QIAGEN powered by

Reference multi-nucleotide variants (MNVs) are removed when applying filter or annotation tools under some circumstances

Issue description

In rare cases, reference MNVs may be removed in error by tools that add and remove information from variants and tools that filter variants. This occurs when an MNV is called as the reference allele for a region containing multiple non-reference variants, and some of the non-reference variants are subsequently filtered away. Reference variants without perfectly matching alternate variants may then, in some cases, be removed.

Expected impact

This issue is expected to affect a minority of variants, arising where there is low support for some of the called variants. This is most likely in analyses where very low frequency variants are being considered.

Details

The chance of calling multiple alternate alleles in the same region increases when detecting very low frequency variants, especially if the data contained in a read mapping are of low quality. If variant detection is followed by filtering steps using tools such as “Remove False Positives”, as would be common in a workflow context, any reference MNV allele appearing without a corresponding variant MNV would be filtered away. This would result in these variants being represented without corresponding reference alleles, even though the data supported the existence of reference alleles. SNVs in this situation in affected software would have incorrect or misleading zygosity.

Prior to CLC Genomics Workbench 12.0 and CLC Genomics Server 11.0, downstream issues after exporting to VCF could also arise for affected variants as two or more overlapping heterozygous variants without corresponding reference alleles would be reported on separate lines as “reference allele unknown” (a “.” in the GT field), when they should have been reported on a single line, as heterozygous alleles.

Further background

In one of the later steps of variant detection, if contiguous single nucleotide variants (SNVs) have been found, evidence is sought in the reads for the presence of an MNV. If this evidence exists, the contiguous SNVs are reported as an MNV. Otherwise, they remain classified as SNVs.

During the final step of variant detection, count and coverage filters are applied, and potential variants identified before this point will be filtered out if they do not meet the necessary cut-off criteria.

For a region with two or more contiguous, heterozygous SNVs in the data, it is possible for a potential variant MNV allele to be filtered out at this step such that the variants are represented as multiple SNVs. The reference MNV may then be filtered out by downstream tools, leaving the SNV alleles with no corresponding reference variant, when the data supported the presence of one.

Changes to VCF export introduced in CLC Genomics Workbench 12.0 and CLC Genomics Server 11.0 partially mitigate this situation through the introduction of 4 options for how complex variants should be represented. Of particular note with relation to this issue is the “Reference overlap” option, which is the default, and when selected results in overlapping alternate alleles being appropriately assigned a heterozygous genotype when the reference variant does not perfectly match overlapping alternate variants.

Affected software

The problem exists as described for the following software versions:

The downstream effects of the problem were partially mitigated through changes to the VCF export functionality introduced in CLC Genomics Workbench 12.0 and CLC Genomics Server 11.0 described in the Further Background section above.

Further mitigation of this problem was introduced with CLC Genomics Workbench 12.0.1 and CLC Genomics Server 11.0.1, where the behavior of the following tools was changed so that reference variants without exact matching non-reference variants are retained if they partially overlap non-reference variants: Annotate with Flanking Sequence, Annotate with Conservation Score, Annotate with Exon Numbers, Remove Variants Present in Control Reads, Remove Marginal Variants, , Remove Orphan Reference Variants, Filter against Known Variants, Filter Based on Overlap, GO Enrichment Analysis, Link Variants to 3D Protein Structure, Predict Splice Site Effect, TRIO Analysis, Identify Shared Variants, Add Information from Overlapping Genes (legacy), Compare Simple Variant Tracks (legacy) and Remove Variants Found in Allele Frequency Community (from the Ingenuity Variant Analysis plugin).