With the scientific research community publishing over two million peer-reviewed articles every year since 2012 (1) and next-generation sequencing fueling a data explosion, the need for comprehensive yet accurate, reliable and analysis-ready information on the path to biomedical discoveries is now more pressing than ever.
Manual curation has become an essential requirement in producing such data. Data scientists spend an estimated 80% of their time collecting, cleaning and processing data, leaving less than 20% of their time for analyzing the data to generate insights (2,3). But manual curation is not just time-consuming. It is costly and challenging to scale as well.
We at QIAGEN take on the task of manual curation so researchers like you can focus on making discoveries. Our human-certified data enables you to concentrate on generating insights rather than collecting data. QIAGEN has been curating biomedical and clinical data for over 25 years. We’ve made massive investments in a biomedical and clinical knowledge base that contains millions of manually reviewed findings from the literature, plus information from commonly-used third-party databases and ‘omics dataset repositories. With our knowledge and databases, scientists can generate high-quality, novel hypotheses quickly and efficiently, while using innovative and advanced approaches, including artificial intelligence.
Here are seven best practices for manual curation that QIAGEN’s 200 dedicated curation experts follow, which we presented at the November 2021 Pistoia Alliance event.
- Efficient yet thorough information capture: Understanding articles is time-limiting, so efficiency is imperative. All essential elements must be captured in a single reading. But because critical information may be distributed throughout the article, curators must read it entirely to deliver accurate findings and context.
- Standardization: We use an ontology of more than 2 million concepts and dozens of relationship types to capture information. Wherever possible, data are mapped to public identifiers to enhance interoperability.
- Triaging: Document selection is fundamental to efficient manual curation and helps avoid reading articles that lack useful information. We’ve developed a way to identify relevant sources using criteria such as novelty, and employ automation to prioritize articles for manual curation, as well as utilize delivery workflows to orchestrate work.
- Training: For consistency, we use internally-developed curation protocols, training documents and editorial reviews. Trainees receive continuous feedback for several months before advancing to our production environment.
- Tooling: Good curation tools are fundamental to accuracy and efficiency. Our internally-created tools ensure we capture information consistently through guided forms, pulldown menus, constraints on slots and other features.
- Revisions: Knowledge constantly evolves and needs to be updated based on new evidence. Articles may become deprecated or have corrections published, and drug labels and guidelines undergo revisions. Our workflows deal with all of these situations.
- Quality control: Our metrics measure accuracy, including QC in curation tools, editor reviews, author error reviews and database consistency checks.
These principles ensure that our knowledge base and integrated ‘omics database deliver timely, highly accurate, reliable and analysis-ready data. In our experience, 40% of public ‘omics datasets include typos or other potentially critical errors in an essential element (cell lines, treatments, etc.); 5% require us to contact the authors to resolve inconsistent terms, mislabeled treatments or infections, inaccurate sample groups or errors mapping subjects to samples. Thanks to our stringent manual curation processes, we can correct such errors.
Our extensive investment in high-quality manual curation means that scientists like you don’t need to spend 80% of their time aggregating and cleaning data. We’ve scaled our rigorous manual curation procedures to collect and structure accurate and reliable information from many different sources, from journal articles to drug labels to ‘omics datasets. In short, we accelerate your journey to comprehensive yet accurate, reliable and analysis-ready data.
Ready to get your hands on reliable biomedical, clinical and ‘omics data that we’ve manually curated using these best practices? Learn about QIAGEN knowledge and databases, and request a consultation to find out how our accurate and reliable data will save you time and get you quick answers to your questions.
References:
- The STM Report 2018: An overview of scholarly and scientific publishing. https://www.stm-assoc.org/2018_10_04_STM_Report_2018.pdf
- H. Sarih, A. P. Tchangani, K. Medjaher and E. Pere (2019) Data preparation and preprocessing for broadcast systems monitoring in PHM framework. 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), 1444–1449. DOI:10.1109/CoDIT.2019.8820370
- Big data to good data: Andrew Ng urges ML community to be more data-centric and less model-centric (06/04/2021) https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/