Author: QIAGEN Digital Insights
Author: qiagen
March 8, 2024

Using trusted cancer data can accelerate drug discovery and development

How expert-curated cancer data from COSMIC and HSMD can help biopharmaceutical researchers identify and validate targets faster and optimize clinical trial design.

In cancer drug discovery and development, data is king. From identifying potential molecular targets to helping predict drug toxicity and optimizing clinical trial design, high-quality data can significantly improve the efficiency and success rate of bringing new cancer therapies to market.

The Catalogue Of Somatic Mutations In Cancer (COSMIC) and the Human Somatic Mutation Database (HSMD) are two expert-curated somatic databases exclusively licensed through QIAGEN that enable biopharmaceutical researchers to avoid pitfalls in early cancer drug discovery, confidently qualify candidate drug targets, and accelerate indication expansion and repurposing of existing cancer therapies.

In this blog, we take a closer look at COSMIC and HSMD for biopharmaceutical research, providing an overview of the expert curation processes, what types of data can be found in each database, and examples of how this data can be applied through the cancer drug discovery and development pipeline.

How COSMIC's cancer data supports oncology drug discovery

COSMIC is an expert-curated knowledge base providing data on somatic variants in cancer, supported by a comprehensive suite of tools for interpreting genomic data, discerning the impact of somatic alterations on disease, and facilitating translational research. The catalogue is accessed and used by thousands of cancer and biopharmaceutical researchers and clinicians daily, allowing them to quickly access information from an immense pool of data curated from over 29 thousand scientific publications and large studies.

COSMIC integrates somatic data from multiple sources published around the world and allows researchers to access and scrutinize information about somatic mutations and their impact in cancer. Over the past two decades, COSMIC has been diligently collecting, cleaning, and organizing genomic data and associated metadata from cancer studies published in scientific literature and various bioinformatics sources. This data is then translated into a standardized format, integrated, and made available to the research community through well-structured datasets and user-friendly data exploration websites and tools.

In addition to the main catalogue of somatic mutations, a further 6 accompanying resources focus on different aspects of oncology (Figure 1). The Cancer Gene Census (CGC)  and Cancer Mutation Census (CMC) provide additional annotations regarding the roles of genes and mutations in oncogenesis, which are based on a defined set of rules and sufficient evidence obtained through dedicated literature curation and analysis of the content of the core catalogue.

→ View the complete database numbers in the latest COSMIC v99 (December 2023) here.

Figure 1. COSMIC’s 7 key resources for understanding cancer and improving cancer patient care. The main catalogue of somatic mutations is supported by further six resources that together lay additional layers of knowledge helping to interpret the impact of somatic mutations on cancer development and presenting available therapeutic options (graphic from Sondka et al. 2024).

COSMIC's expert curation process

COSMIC’s workflows to manually curate cancer genetic data have been built to deliver high-quality, biologically and clinically-relevant data to the research community. Different data sources and types of curated data require different approaches (Figure 2). However, in each case there are common core elements.

  • Firstly, the source of the information is identified from the peer reviewed literature or bioinformatic resources, and checked for the quality and relevance of the content.
  • To enable meaningful analysis by end users, data need to be adequately and transparently categorized. This is achieved by combining the use of controlled vocabularies that label data and a database schema that is able to represent these vocabularies.
  • Before data extraction, all curated features and terms are converted to vocabularies, ontologies and data conventions used by COSMIC. Genes, variants, and transcripts use external vocabularies and ontologies. For interoperability, all COSMIC disease classifications have been mapped to the NCI thesaurus ontology and these mappings can be downloaded from the COSMIC website.
  • Acquiring the data itself is the final stage of curation.The minimum unit of curation is: a genetic variant, tumour type and the scope of the study, i.e. which genes were tested. In addition, whenever reported by the publication, other clinical features for the patient are curated e.g. age, gender, ethnicity, therapeutic history, family history of cancer or exposure to DNA-damaging agents. At the tumour level, the curation team extracts information on cancer stage and grade, metastases, drug response and therapy relationship, i.e. if a sample was collected prior to, during, or post-therapy.

Figure 2. COSMIC data curation flowchart. Depending on the data source and curation objectives, there are three main curation paths in COSMIC (graphic from Sondka et al. 2024).

How HSMD's cancer data supports oncology drug development

HSMD is a web-based application that allows biopharmaceutical researchers and clinical NGS testing labs to harness genetic insights from QIAGEN’s real-world oncology dataset combined with knowledge from two decades of expert curation.

In the latest version of HSMD, the resource focuses on providing deep insight into small variants, such as SNVs, indels, frameshifts, fusions and copy number variants that have been clinically observed or curated from scientific literature to help users better understand and define precise function and actionability. This expert-curated resource contains content from over 547,000 real-world clinical oncology cases combined with content from the QIAGEN Knowledge Base (QKB), providing gene-level, alteration-level, and disease-level information.

HSMD enables users to easily search and explore mutational characteristics across genes, synthesize key findings from drug labels, clinical trials, and professional guidelines, and receive detailed annotations for each observed variant (Figure 3).

HSMD home screen

Figure 3. HSMD home screen. HSMD enables users to search by gene, alteration, disease, drugs, and clinical trials.

HSMD's expert curation process

HSMD leverages variant content from two sources: expert-curated content from the QIAGEN Knowledge Base (QKB) and data from real-world oncology cases sourced from our professional clinical interpretation services (Figure 4).

When a variant has been “clinically observed,” it means our professional clinical interpretation service has encountered this alteration in a real-world clinical case. For these variants, QIAGEN's team has assessed the clinical and biological relevance and  calculated the gene and variant prevalence across observed tumor types. Conversely, content from the QKB is proactively curated from scientific literature; therefore, not all variants have yet been directly clinically observed by our professional clinical interpretation services.

Figure 4. HSMD curation workflow. HSMD contains content from the QKB, which pulls information from all public and proprietary databases, clinical articles for the most relevant cancer genes, and thousands of clinical articles for somatic genes. Curation then occurs by artificial intelligence (AI) approaches, manual curation, or a combination of both. All content then goes through rigorous quality control to ensure consistency, accuracy, and reproducibility. In addition, HSMD contains content from over 500,000 somatic mutations submitted to QIAGEN's professional variant interpretation service, QCI Precision Insights (formerly N-of-One). This is de-identified patient data that provides even greater insight into real-world clinical cases.

Trusted cancer data to accelerate drug discovery and development

COSMIC and HSMD are two expert-curated databases licensed exclusively through QIAGEN that enable biopharmaceutical companies to improve the drug discovery process, develop more effective clinical trials, and enhance the treatment of rare cancers. To learn more about how your research team can use COSMIC and HSMD, visit our product webpage or click the button below for a free trial and personal consultation with our biopharmaceutical research experts.

Sample to Insight
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.