We've got a useful tip that will help you get even more value out of QIAGEN CLC Microbial Genomics Module when performing OTU clustering. Get the latest version of the SILVA OTU database within the QIAGEN CLC Microbial Genomics Module with minimal effort outside of QIAGEN CLC Genomics Workbench, even before the latest version is released through the Microbial Genomics Module. The SILVA databases are updated more regularly than the corresponding QIIME versions, which the downloader currently relies on. To avoid waiting for QIIME updates, the newest SILVA database can be used with the Create Annotated Sequence List tool, with just a bit of reformatting required.
SILVA releases are available on the FTP server https://ftp.arb-silva.de/ where each release is stored in a separate folder. Here we focus on the latest release_138, more specifically the non-redundant database at 99% sequence similarity. If you are interested in another version, please consult the corresponding README file and change the surl and corresponding turl in the top of the script accordingly. To download the correct files and format it properly right away for import into the QIAGEN CLC Genomics Workbench, the following script may be used:
import gzip, urllib.request, zipfile, io, shutil, os surl="https://ftp.arb-silva.de/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva.fasta.gz" turl="https://ftp.arb-silva.de/release_138/Exports/taxonomy/taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz" nurl="https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip" print("Downloading "+nurl[nurl.rfind("/")+1:]+" may take some time ... ", end="", flush=True) allowedRanks = {"superkingdom":"k__", "phylum":"p__","class":"c__","order":"o__","family":"f__","genus":"g__","species":"s__"} def sp(line): return line.replace(b"\n",b"\t").split(b"\t|\t") with zipfile.ZipFile(io.BytesIO(urllib.request.urlopen(nurl).read())) as zip_ref: with zip_ref.open([name for name in zip_ref.namelist() if os.path.basename(name) == "nodes.dmp"][0]) as zf: nodes = {sp(line)[0]:[sp(line)[1], sp(line)[2].decode("UTF-8"), ""] for line in zf} with zip_ref.open([name for name in zip_ref.namelist() if name.endswith("names.dmp")][0]) as zf: for line in zf: s = sp(line) if s[3]==b"scientific name": nodes[s[0]][2] = s[1].decode("UTF-8") def getLineage(byteTaxID): lin = {r:v for r,v in allowedRanks.items()} pid = byteTaxID if pid in nodes: mid = nodes[pid] while pid != b"1" and pid != mid[0]: if mid[1] in allowedRanks: lin[mid[1]] += mid[2] pid = mid[0] mid = nodes[pid] return "; ".join(v for k,v in lin.items()) print("done") oname1 = surl[surl.rfind("/")+1:].replace("fasta.gz","fa.gz") oname2 = oname1.replace("fa.gz","txt") print("Downloading "+turl[turl.rfind("/")+1:]+" may take some time ... ", end="", flush=True) with gzip.GzipFile(fileobj=urllib.request.urlopen(turl)) as gzTax, open(oname2,'w') as tO: next(gzTax) tO.write("Name"+"\t"+"Taxonomy"+"\n") for line in gzTax: sp = line.strip().split(b"\t") tO.write(sp[0].decode("UTF-8")+"."+sp[1].decode("UTF-8")+"."+sp[2].decode("UTF-8")+"\t"+getLineage(sp[5])+"\n") print("done") print("Taxonomy output: "+oname2) print("Downloading "+surl[surl.rfind("/")+1:]+" may take some time ... ", end="", flush=True) with gzip.GzipFile(fileobj=urllib.request.urlopen(surl)) as gzSilva, gzip.open(oname1,'wb') as fO: for line in gzSilva: if line.startswith(b">"): fO.write(line[:line.rfind(b" ", 0, line.find(b";"))]+b"\n") else: fO.write(line.replace(b"U",b"T")) print("done") print("Fasta output: "+oname1)
To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as “get_silva.py”, then open a terminal and navigate to the folder where the script is located. Finally, execute it with:
$python get_silva.py |
Depending on your connection, this script will run for about 5 to 10 minutes. It downloads three files and performs actions on and with them:
For each of the taxids for the rRNAs, a 7-step lineage is constructed on the levels of the allowed ranks. The output of the script are two files in the folder where it is executed:
These two files can now be used in the Create Annotated Sequence List.
Now you have version 138 of the SILVA database available for OTU clustering. Quick and easy, right?
For questions about this or other tips, tricks or functionalities related to QIAGEN CLC Microbial Genomics Module or QIGAGEN CLC Genomics Workbench, contact us at bioinformaticssales@qiagen.com.