Using viral reference databases for phylogeny construction and taxonomic profiling of samples with low viral load
This blog tutorial highlights several recent improvements in the latest update to QIAGEN CLC Microbial Genomics Module 20.1. The update includes improved usability in the Download Microbial Reference Database tool and improved support for long reads in Taxonomic Profiling. Some of the improvements include:
- Faster load times for the selection table, which now loads in just seconds
- Full access to the latest assemblies from NCBI with a taxonomy-aware download selection
- No deduplication: The tool no longer removes duplicate sequences, as this functionality has been moved to Create Taxonomic Profiling Index
With the 20.1 update, it is now easy to customize the Microbial Reference Database to fit your needs. Here we demonstrate two use cases:
- Visualizing phylogenetic relationships of all coronavirus genomes
- Creating a taxonomic profiling index of all viral genomes and carrying out taxonomic profiling of viral metagenome samples containing severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in a few simple steps
Visualizing phylogenetic relationships made easy
The updated downloader makes it simple to visualize phylogenetic relationships. To create a dendrogram of the four coronavirus genera, we first create a microbial database containing only coronavirus:
- Run the Download Microbial Reference Database tool to load the Database builder
- Filter the table to show only entries where the Taxonomy column contains ‘coronoviridae’
- Aggregate rows on Genus; we observe five samples which do not include the genus
- Use Quick Selection: “Complete genomes in RefSeq” to quickly select all complete, coronavirus genomes
Approximately 200 references remained and were downloaded with a minimum contig length of 1000. The five samples with an unknown genus were included in the downloaded database.
The phylogenies of the downloaded database of assemblies can be easily visualized using Create K-mer Tree. In Create K-mer Tree, select the downloaded database of coronavirus genomes. The dendrogram shown was created with default settings, except “Only index k-mers with prefix” was left blank due to the short length of coronavirus genomes.
Figure 1 shows a circular dendrogram with added genus metadata. For ease of viewing, 50% of both the alphacoronavirus and betacoronavirus genomes have been excluded from the tree.
In the tree, the five references without a genus are selected and their branches are shown in dark blue. From the tree, we can see that three of these references cluster with the betacoronavirus, one clusters with the alphacoronavirus and one clusters between alphacoronavirus and gammacoronavirus.
This highlights a quick and easy way to download a database of viral genomes, and how to use the database to create a phylogeny. The phylogeny can then be used to resolve samples of unknown genus.
Create K-mer tree also works with reads. In the next section, we demonstrate how to create a taxonomic profile with metagenome samples.
Create a taxonomic profiling index and detect abundance of coronavirus in metagenome samples with low coronavirus copy number
With the recent updates to the Download Microbial Reference Database and Taxonomic Profiling functions in QIAGEN CLC Microbial Genomics Module, it is now fast and easy to detect coronavirus presence in metagenome samples containing only a few virus reads. Taxonomic profiling now also supports long reads such as those generated by Oxford Nanopore and PacBio sequencing technologies.
For the first time setup, we create a viral database:
- Run the Download Microbial Reference Database tool to load the Database builder
- Filter the table to show only entries where the Taxonomy column contains ‘virae’ – we skip the remaining virus kingdom in the interest of speed
- Use ’Quick Selection: Complete genomes in RefSeq’ to quickly select all complete, viral genomes
All complete virus genomes to date, approximately 18,500, remained and were downloaded with a minimum contig length of 1000.
The downloaded database was used to create a taxonomic profiling index using default settings.
The analysis can be carried out in a simple workflow using the curated Microbial Reference Database and human genome to create a Taxonomic Profiling index for host genome filtering (Figure 2).
Results are presented from 3 different studies with low fraction of viral reads (Table 1).
- SRR10948550: Long read sequencing using Oxford Nanopore (1)
- SRR11092061: Paired end sequencing using Illumina HiSeq 3000 (2)
- ERR4385803: Paired end sequencing using Illumina HiSeq 2500 (gut virome sample – negative for SARS-CoV-2)
Abundance virus values have been aggregated to species level and table filtered to abundance >10. The % viral reads is the percentage of reads in the sample matching the virus database.
Table 1. Abundances for the different samples (results have been aggregated to species level)
Sample | % viral reads | Species | Taxonomy | Abundance |
SRR10948550 |
1.0556 |
Severe acute respiratory syndrome-related coronavirus | Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus | 985 |
Ambystoma tigrinum virus | Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Ambystoma tigrinum virus | 39 | ||
Common midwife toad virus | Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus | 26 | ||
SRR11092061 |
0.0045 |
Severe acute respiratory syndrome-related coronavirus | Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus | 1304 |
Spodoptera frugiperda rhabdovirus | Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Rhabdoviridae; Spodoptera frugiperda rhabdovirus | 822 | ||
Saccharomyces 20S RNA narnavirus | Orthornavirae; Lenarviricota; Amabiliviricetes; Wolframvirales; Narnaviridae; Narnavirus; Saccharomyces 20S RNA narnavirus | 336 | ||
Stenotrophomonas virus SMA7 | Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Subteminivirus; Stenotrophomonas virus SMA7 | 126 | ||
Influenza A virus | Orthornavirae; Negarnaviricota; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus | 112 | ||
Nipah henipavirus | Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Paramyxoviridae; Henipavirus; Nipah henipavirus | 48 | ||
Common midwife toad virus | Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus | 12 | ||
Inoviridae sp | Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Inoviridae sp | 12 | ||
ERR4385803 |
0.6578 |
Gokushovirus WZ-2015a | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Gokushovirus WZ-2015a | 19753 |
Human gut gokushovirus | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Human gut gokushovirus | 3883 | ||
Microviridae sp | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Microviridae sp | 1726 | ||
Microviridae | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae | 47 |
The negative control sample ERR4385803 correctly reports no coronavirus. The abundance of virus was correctly reported in both positive samples (Table 1).
References:
- Zhou, P. et al. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 7798: 270-273.
- Chan, J.F.W. et al. (2020) A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. The Lancet 10223: 514-523.