OTU clustering using the latest version of SILVA OTU database from within QIAGEN CLC Microbial Genomics Module

Author:

QIAGEN Digital Insights

OTU clustering using the latest version of SILVA OTU database from within QIAGEN CLC Microbial Genomics Module

We’ve got a useful tip that will help you get even more value out of QIAGEN CLC Microbial Genomics Module when performing OTU clustering. Get the latest version of the SILVA OTU database within the QIAGEN CLC Microbial Genomics Module with minimal effort outside of QIAGEN CLC Genomics Workbench, even before the latest version is released through the Microbial Genomics Module. The SILVA databases are updated more regularly than the corresponding QIIME versions, which the downloader currently relies on. To avoid waiting for QIIME updates, the newest SILVA database can be used with the Create Annotated Sequence List tool, with just a bit of reformatting required.

SILVA releases are available on the FTP server https://ftp.arb-silva.de/ where each release is stored in a separate folder. Here we focus on the latest release_138, more specifically the non-redundant database at 99% sequence similarity. If you are interested in another version, please consult the corresponding README file and change the surl and corresponding turl in the top of the script accordingly. To download the correct files and format it properly right away for import into the QIAGEN CLC Genomics Workbench, the following script may be used:

import gzip, urllib.request, zipfile, io, shutil, os
surl="https://ftp.arb-silva.de/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva.fasta.gz"
turl="https://ftp.arb-silva.de/release_138/Exports/taxonomy/taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz"
nurl="https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip"
print("Downloading "+nurl[nurl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
allowedRanks = {"superkingdom":"k__", "phylum":"p__","class":"c__","order":"o__","family":"f__","genus":"g__","species":"s__"}
def sp(line):
    return line.replace(b"\n",b"\t").split(b"\t|\t")
with zipfile.ZipFile(io.BytesIO(urllib.request.urlopen(nurl).read())) as zip_ref:
    with zip_ref.open([name for name in zip_ref.namelist() if os.path.basename(name) == "nodes.dmp"][0]) as zf:
        nodes = {sp(line)[0]:[sp(line)[1], sp(line)[2].decode("UTF-8"), ""] for line in zf}
    with zip_ref.open([name for name in zip_ref.namelist() if name.endswith("names.dmp")][0]) as zf:
        for line in zf:
            s = sp(line)
            if s[3]==b"scientific name":
                nodes[s[0]][2] = s[1].decode("UTF-8")
def getLineage(byteTaxID):
    lin = {r:v for r,v in allowedRanks.items()}
    pid = byteTaxID
    if pid in nodes:
        mid = nodes[pid]
        while pid != b"1" and pid != mid[0]:
            if mid[1] in allowedRanks:
                lin[mid[1]] += mid[2]
            pid = mid[0]
            mid = nodes[pid]
    return "; ".join(v for k,v in lin.items())
print("done")
oname1 = surl[surl.rfind("/")+1:].replace("fasta.gz","fa.gz")
oname2 = oname1.replace("fa.gz","txt")
print("Downloading "+turl[turl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(turl)) as gzTax, open(oname2,'w') as tO:
    next(gzTax)
    tO.write("Name"+"\t"+"Taxonomy"+"\n")
    for line in gzTax:
        sp = line.strip().split(b"\t")
        tO.write(sp[0].decode("UTF-8")+"."+sp[1].decode("UTF-8")+"."+sp[2].decode("UTF-8")+"\t"+getLineage(sp[5])+"\n")
print("done")
print("Taxonomy output: "+oname2)
print("Downloading "+surl[surl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(surl)) as gzSilva, gzip.open(oname1,'wb') as fO:
    for line in gzSilva:
        if line.startswith(b">"):
            fO.write(line[:line.rfind(b" ", 0, line.find(b";"))]+b"\n")
        else:
            fO.write(line.replace(b"U",b"T"))
print("done")
print("Fasta output:    "+oname1)

To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as “get_silva.py”, then open a terminal and navigate to the folder where the script is located. Finally, execute it with:

$python get_silva.py

Depending on your connection, this script will run for about 5 to 10 minutes. It downloads three files and performs actions on and with them:

  • The most recent NCBI Taxonomy: taxdmp.zip. The script loads the taxids, parent ids, ranks and names of the taxonomy into memory.
  • Taxonomy Mappings from SILVA: taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz. The script uses this file to get the mapping from the SILVA names to taxids in the NCBI taxonomyNote that the SILVA database is updated biannually and the NCBI corresponding taxonomy is updated daily and thus there is not always a one-to-one correspondence between the final taxonomies and the original SILVA taxonomies.
  • The SILVA rRNA database: SILVA_138_SSURef_NR99_tax_silva.fasta.gz. The script strips the provided taxonomies from this file, keeps the names and translates U to T.

For each of the taxids for the rRNAs, a 7-step lineage is constructed on the levels of the allowed ranks. The output of the script are two files in the folder where it is executed:

  • SILVA_138_SSURef_NR99_tax_silva.fa.gz: Fasta file with the rRNA sequences and the sequence names in the header
  • SILVA_138_SSURef_NR99_tax_silva.txt: A tab-separated file connecting the name of an rRNA sequence to its taxonomy in QIIME format

These two files can now be used in the Create Annotated Sequence List.

  • Import the SILVA_138_SSURef_NR99_tax_silva.fa.gz file using a standard import, or drag and drop the file into the CLC Genomics Workbench
  • Run the Create Annotated Sequence List on the resulting CLC file in the Workbench and click “Next” 
  • Select SILVA_138_SSURef_NR99_tax_silva.txt as taxonomy file
  • Set the similarity percentage to 99% (if you have selected the NR99 version of SILVA, otherwise this should be adjusted)
  • Click “Next” and in the “Select input file and map columns to attributes” under Parsing select Separator as “Tab
  • Click “Next” and “Finish”

Now you have version 138 of the SILVA database available for OTU clustering. Quick and easy, right?

For questions about this or other tips, tricks or functionalities related to QIAGEN CLC Microbial Genomics Module or QIGAGEN CLC Genomics Workbench, contact us at bioinformaticssales@qiagen.com.

Disclaimer: QIAGEN does not support the SILVA databases constructed this way, and the information provided in this article is given without any warranty, expressed or implied. Users are solely responsible for the application of any code or information provided. The SILVA databases version 138 are free for academic and commercial use under the Create Commons Attribution 4.0 (CC-BY 4.0) license.