English

Products

Applications

QCI Interpret for Oncology

The most advanced end-to-end solution for oncology NGS analysis, interpretation and reporting

Ingenuity Pathway Analysis (IPA)

Analyze, compare and contextualize your NGS data with the leading pathway analysis application

Knowledge Bases

Biomedical Knowledge Bases

Gene Variant Databases

COSMIC (Catalogue of Somatic Mutations in Cancer)

HSMD (Human Somatic Mutation Database)

HGMD (Human Gene Mutation Database)

PGXI (Pharmacogenomic Insights)

'Omics Databases

QIAGEN DiseaseLand

QIAGEN OncoLand

QIAGEN Single Cell Land

ATCC Cell Line Land

QIAGEN OmicSoft Land Explorer

Biomedical Knowledge Base-AI

Find novel connections missed by traditional methods, hiding in over 640M biomedical relationships

Services

Discovery Bioinformatics Services

Clinical Analysis and Interpretation Services

QCI Precision Insights

Biomedical Knowledge Bases

Access critical drug discovery data, save time and explore novel biomedical relationships

QCI Precision Insights

A professional clinical interpretation service that translates molecular data specific to each patient into insights and therapeutic options

Biomedical Knowledge Base-HD

Directly access over 10M high-quality biological findings

QIAGEN receives European IVDR certification for QIAGEN Clincal Insight Interpret

Powerful cloud-enabled ‘omics GUI, complete NGS analysis workflows and unparalleled curated content for immediate exploration

Which secondary analysis solution is right for you?

Use our decision tree to find out which secondary analysis solution is right for your lab

Solutions

Discovery and Research

Biomarker Identification

Target Discovery

Mechanism of Action

Single-Cell Genomics

Microbial/Metagenomics

Gene Regulation

Variant Analysis

SARS-CoV-2 Solutions

OmicSoft NGS Data Analysis

Explore and compare data across 700K+ disease studies with a cloud-based NGS analysis suite

Data Sciences

Data and Pipeline Management

OmicSoft OncoLand

Explore high-quality, preprocessed genomics data with our oncology database

Webinar: How decentralized and small labs can adopt high-throughput NGS analysis

Discover a new secondary analysis solution for oncology & inherited disease applications for high-throughput use with any clincal NGS data

Clinical NGS Testing

Oncology

Solid Tumors

Hematological Malignancies

Hereditary Cancers

Inherited Disorders

Rare & Undiagnosed Diseases

Sample to Insight solutions

View our NGS workflows for labs of all sizes and experience - and find the right fit for you

Partner Program

Biomarker and Target Discovery

Augment your biomarker discovery research with 20M findings & 700K preprocessed ` omics samples

Rare and Undiagnosed Diseases

Finding a diagnosis for rare diseases is often a race against time. QDI is helping provide answers where none were available before

Clinical Testing Solutions

Deliver partient-specific reports for any NGS panel in minutes with on-demand, expert-curated content & professional interpretation services

Resources

App Notes & White Papers

Public Citations

Citation Guidelines

SARS-CoV-2 Resources

Knowledge Bases Blog: Using trusted cancer data can accelerate drug discovery and development

See our expert curation processes and how data curated this way helps biopharma research

Latest IPA Blog: Free pathway analysis - How much do you really save?

Discover why you need pathway analysis tools that provide rich, directional relationships

Latest Blog: Immune Repertoire Analysis Showdown: Speed, Ease, Accuracy

Find out which B-cell receptor reconstruction tool takes the crown

Webinar: Investigating genomic variants with QDI Software

Learn to analyze various types of NGS data with CLC Genomics Workbench, QCII Translational and IPA

Blog: How COSMIC & HSMD support different phases of cancer drug discovery and development

Learn how expert-curated cancer data can help biopharma researchers identify & validate drug targets faster & optimize clinical trial design

Support

Maintenance and Support

VIdeo: Introduction to QCI Interpret for Oncology

Check out this introductory video to QCI Interpret for Oncology, a CDS software that will allow you to confidently interpret NGS variants

Lateset improvements: IPA

See the newest improvements and updates in the IPA 2024R1 Release

Webinar: Leveraging the QIAGEN Knowledge Graph for insights into drug repurposing

This webinar shows how to predict novel drug-disease relationships and construct networks that capture relevant supporting evidence

Webinar: HGMD Pro in action: Search, curate and classify genetic variants - Session 2

Learn how HGMD Professional can help you get better variant data faster

Downloads

Product Downloads

Module and Plugin Downloads

Example Data

Somatic knowledge bases for clinical NGS testing

Expert-curated content for accelerated analysis of cancer mutations in clincal NGS testing

Somatic knowledge bases for biopharmaceutical research

Expert-curated content for the discovery and development of precision cancer therapies

Human Gene Mutation Database (HGMD) Professional

Improve diagnostics with the largest expert-curated source of hereditary disease-causing mutations

Latest improvements: OmicsSoft Lands

See the latest improvements and updates in the Lands 2024R1 Release

Video: Introduction to QCI for Hereditary Disorders

Check out this introductory video to QCI Interpret for Hereditary Disorders, industry's only automated FASTQ to final report solution

Press Release: QIAGEN enchances bioinformatics workflows with new secondary analysis solultion

QIAGEN launches QCI Secondary Analysis, a cloud-based solution enabling high-throughput secondary analysis with clincal NGS data

About

About Us

Press Releases

Contact

Careers | QIAGEN Digital Insights

Home > News BLOG > OTU clustering using the latest version of SILVA OTU database from within QIAGEN CLC Microbial Genomics Module

  Author: 
  QIAGEN Digital Insights

Author: qiagen

October 21, 2020

OTU clustering using the latest version of SILVA OTU database from within QIAGEN CLC Microbial Genomics Module

We've got a useful tip that will help you get even more value out of QIAGEN CLC Microbial Genomics Module when performing OTU clustering. Get the latest version of the SILVA OTU database within the QIAGEN CLC Microbial Genomics Module with minimal effort outside of QIAGEN CLC Genomics Workbench, even before the latest version is released through the Microbial Genomics Module. The SILVA databases are updated more regularly than the corresponding QIIME versions, which the downloader currently relies on. To avoid waiting for QIIME updates, the newest SILVA database can be used with the Create Annotated Sequence List tool, with just a bit of reformatting required.

SILVA releases are available on the FTP server https://ftp.arb-silva.de/ where each release is stored in a separate folder. Here we focus on the latest release_138, more specifically the non-redundant database at 99% sequence similarity. If you are interested in another version, please consult the corresponding README file and change the surl and corresponding turl in the top of the script accordingly. To download the correct files and format it properly right away for import into the QIAGEN CLC Genomics Workbench, the following script may be used:

import gzip, urllib.request, zipfile, io, shutil, os
surl="https://ftp.arb-silva.de/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva.fasta.gz"
turl="https://ftp.arb-silva.de/release_138/Exports/taxonomy/taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz"
nurl="https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip"
print("Downloading "+nurl[nurl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
allowedRanks = {"superkingdom":"k__", "phylum":"p__","class":"c__","order":"o__","family":"f__","genus":"g__","species":"s__"}
def sp(line):
    return line.replace(b"\n",b"\t").split(b"\t|\t")
with zipfile.ZipFile(io.BytesIO(urllib.request.urlopen(nurl).read())) as zip_ref:
    with zip_ref.open([name for name in zip_ref.namelist() if os.path.basename(name) == "nodes.dmp"][0]) as zf:
        nodes = {sp(line)[0]:[sp(line)[1], sp(line)[2].decode("UTF-8"), ""] for line in zf}
    with zip_ref.open([name for name in zip_ref.namelist() if name.endswith("names.dmp")][0]) as zf:
        for line in zf:
            s = sp(line)
            if s[3]==b"scientific name":
                nodes[s[0]][2] = s[1].decode("UTF-8")
def getLineage(byteTaxID):
    lin = {r:v for r,v in allowedRanks.items()}
    pid = byteTaxID
    if pid in nodes:
        mid = nodes[pid]
        while pid != b"1" and pid != mid[0]:
            if mid[1] in allowedRanks:
                lin[mid[1]] += mid[2]
            pid = mid[0]
            mid = nodes[pid]
    return "; ".join(v for k,v in lin.items())
print("done")
oname1 = surl[surl.rfind("/")+1:].replace("fasta.gz","fa.gz")
oname2 = oname1.replace("fa.gz","txt")
print("Downloading "+turl[turl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(turl)) as gzTax, open(oname2,'w') as tO:
    next(gzTax)
    tO.write("Name"+"\t"+"Taxonomy"+"\n")
    for line in gzTax:
        sp = line.strip().split(b"\t")
        tO.write(sp[0].decode("UTF-8")+"."+sp[1].decode("UTF-8")+"."+sp[2].decode("UTF-8")+"\t"+getLineage(sp[5])+"\n")
print("done")
print("Taxonomy output: "+oname2)
print("Downloading "+surl[surl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(surl)) as gzSilva, gzip.open(oname1,'wb') as fO:
    for line in gzSilva:
        if line.startswith(b">"):
            fO.write(line[:line.rfind(b" ", 0, line.find(b";"))]+b"\n")
        else:
            fO.write(line.replace(b"U",b"T"))
print("done")
print("Fasta output:    "+oname1)

To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as “get_silva.py”, then open a terminal and navigate to the folder where the script is located. Finally, execute it with:

$python get_silva.py

Depending on your connection, this script will run for about 5 to 10 minutes. It downloads three files and performs actions on and with them:

The most recent NCBI Taxonomy: taxdmp.zip. The script loads the taxids, parent ids, ranks and names of the taxonomy into memory.
Taxonomy Mappings from SILVA: taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz. The script uses this file to get the mapping from the SILVA names to taxids in the NCBI taxonomy. Note that the SILVA database is updated biannually and the NCBI corresponding taxonomy is updated daily and thus there is not always a one-to-one correspondence between the final taxonomies and the original SILVA taxonomies.
The SILVA rRNA database: SILVA_138_SSURef_NR99_tax_silva.fasta.gz. The script strips the provided taxonomies from this file, keeps the names and translates U to T.

For each of the taxids for the rRNAs, a 7-step lineage is constructed on the levels of the allowed ranks. The output of the script are two files in the folder where it is executed:

SILVA_138_SSURef_NR99_tax_silva.fa.gz: Fasta file with the rRNA sequences and the sequence names in the header
SILVA_138_SSURef_NR99_tax_silva.txt: A tab-separated file connecting the name of an rRNA sequence to its taxonomy in QIIME format

These two files can now be used in the Create Annotated Sequence List.

Import the SILVA_138_SSURef_NR99_tax_silva.fa.gz file using a standard import, or drag and drop the file into the CLC Genomics Workbench
Run the Create Annotated Sequence List on the resulting CLC file in the Workbench and click “Next”
Select SILVA_138_SSURef_NR99_tax_silva.txt as taxonomy file
Set the similarity percentage to 99% (if you have selected the NR99 version of SILVA, otherwise this should be adjusted)
Click “Next” and in the “Select input file and map columns to attributes” under Parsing select Separator as “Tab”
Click "Next" and "Finish"

Now you have version 138 of the SILVA database available for OTU clustering. Quick and easy, right?

For questions about this or other tips, tricks or functionalities related to QIAGEN CLC Microbial Genomics Module or QIGAGEN CLC Genomics Workbench, contact us at bioinformaticssales@qiagen.com.

Disclaimer: QIAGEN does not support the SILVA databases constructed this way, and the information provided in this article is given without any warranty, expressed or implied. Users are solely responsible for the application of any code or information provided. The SILVA databases version 138 are free for academic and commercial use under the Create Commons Attribution 4.0 (CC-BY 4.0) license.

Share on:

OTU clustering using the latest version of SILVA OTU database from within QIAGEN CLC Microbial Genomics Module

Categories

Categories

Upcoming Webinars

Follow Us

OTU clustering using the latest version of SILVA OTU database from within QIAGEN CLC Microbial Genomics Module

Categories

Categories

Related Posts

Related Posts

Upcoming Webinars

Follow Us

Contact Us