Bioinformatics workflows for routine analysis of many samples

Author:

QIAGEN Digital Insights

Bioinformatics workflows for routine analysis of many samples

Discover an easy way to get reproducible analyses, traceability of results and efficient bulk analysis with QIGEN CLC workflows.

Stringing together bioinformatics tools into pipelines enables reproducible execution of complex workflows, producing, among other things, QC reports, data visualizations, statistical analyses, annotation and filtering of output from raw NGS data. Combined with parallel execution, the potential for efficient throughput of reproducible analyses with traceability of results can be done using QIAGEN CLC workflows.

In QIAGEN CLC Workbenches, workflows are easily created and configured using a graphical editor (1, 2). Tools can be added through drag-and-drop from the Workbench Toolbox or by selecting from a list. The output of one element is defined as the input to another simply by drawing a line between them. Fine-grained control over the execution pattern within a workflow can be added with control flow elements, supporting cases such as RNA-Seq and differential expression analysis in a single workflow or providing sets of different inputs per workflow run.

With a QIAGEN CLC Genomics Server, third-party applications (e.g., your own tools or open-source tools) can be configured as external applications, thereby expanding the analysis potential beyond the software provided by QIAGEN (3). External applications can be added to workflows using the graphical workflow editor.

Getting started with workflows is simple. Examples are provided in the Template Workflows folder in the Workbench Toolbox. These workflows can be run directly or edited to add or remove tools, change parameters, reconfigure output naming patterns and much more. The Template Workflows folder initially contains two subfolders: Basic Workflow Designs (containing RNA-seq and DNA-seq workflows) and Prepare Raw Data. When QIAGEN CLC Workbench plugins containing workflows are installed, additional subfolders are created containing those template workflows.

Outputs generated using QIGEN CLC workflows include information on provenance relevant for auditing or publication. This history information includes the version of the software used, the tool and parameter settings used, the name of the user who ran the workflow, the date and time the element was created and the data that the output was derived from. When analyses are run on a QIAGEN CLC Server, a record of the analysis is also written to the audit log.

For bulk processing, a workflow can be submitted in batch mode, where the workflow is run multiple times, once for each input, or set of inputs, specified. When a workflow is launched in batch mode on a QIAGEN CLC Workbench, the individual jobs in that batch are carried out serially – one workflow run after another. For small analyses, this is fine. However, for routine analyses and for large analyses, we recommend the parallel execution potential and intelligent queuing facilities afforded by QIAGEN CLC Genomics Server using a Job Node or Grid Node setup.

When a workflow is submitted in batch mode to a QIAGEN CLC Genomics Server with nodes, each workflow run can be executed in parallel.  The server administrator can choose the level of parallelization desired. Options include executing each workflow run on a single node, splitting execution of individual workflows across nodes or specifying parallelization at the level of sub-workflows (blocks), which are created behind the scenes during execution.

 

 

Figure 1. Serial (top) versus parallel (bottom) execution of workflows. On CLC Servers with nodes, queuing and parallel execution capacity supports optimal computational resources. On a QIAGEN CLC Genomics Server without nodes, workflows would be queued and processed serially, however they were submitted. On QIAGEN CLC Workbenches, batch jobs are run serially. Parallel execution of workflows on a QIAGEN CLC Workbench, triggered by multiple individual job launches in relatively quick succession, is not recommended; each workflow run assumes it has access to the entire system. Thus there is a risk that jobs will crash due to issues such as memory limitations.

Learn more about the features and request your free trial of QIAGEN CLC Genomics Server and QIAGEN CLC Genomics Workbench, and explore the benefits for yourself.

Have questions? Request a consultation today.

References:

  1. An introduction to workflows using QIAGEN CLC Workbenches
  2. Theiagen Consulting LCC video tutorials on how to build workflows in CLC Genomics Workbench for SARS CoV-2 analysis
  3. Theiagen Consulting LCC video tutorials on how to include external applications to the CLC Genomics Server: a) RAxML b) MAFFT c) iVAR