Transcript Discovery latest improvements

Transcript Discovery 22.0

Released on January 11, 2022

Updated to be compatible with with CLC Genomics Workbench 22 and CLC Genomics Server 22.

Bug fixes

Fixed an issue in the Transcript Discovery tool so that output mRNA track have the same annotation type as the input mRNA track.

Transcript Discovery 21.0

Released on January 12, 2021

Updated to be compatible with with CLC Genomics Workbench 21 and CLC Genomics Server 21.

Bug fixes

Fixed an issue where the presence of spliced reads that mapped across the origin of a circular chromosome, where the start of a mapped segment lay at least 15 bases from the origin, caused Transcript Discovery to fail. Such reads are now ignored. Note that this tool is not designed to discover transcripts or genes that wrap around the chromosome.

Transcript Discovery 20.0

Released on December 11, 2019

Updated to be compatible with with CLC Genomics Workbench 20 and CLC Genomics Server 20.

Transcript Discovery 3.0

Released on November 28, 2018

The tools delivered by Transcript Discovery 3.0 are no longer betas. Further information about this release can be found on https://staging.digitalinsights.supremeclients.com/plugins/transcript-discovery/.

Changes affecting both the Large Gap Read Mapping and Transcript Discovery tools

The Large Gap Read Mapping and Transcript Discovery tools can now be included in workflows.

Both tools have been changed to produce track-based outputs. Benefits of this change include:

Existing annotations can be easily imported for use in the Transcript Discovery tool by using Import | Tracks. Previously the Annotate with GFF file plugin would have been needed.
Different results obtained by varying the settings in the Transcript Discovery tool can be compared side-by-side in a Track List.
Predicted genes and transcripts can be used directly in the RNA-Seq Analysis tool.
High coverage regions can be inspected more easily, due to the superior performance of the track viewer for deep mappings.

Large Gap Read Mapping

Improvements

The Large Gap Read Mapping tool now runs up to several times faster than it did in earlier releases.
The accuracy of the results has been improved.
Handling of long reads such as PacBio CCS reads has been improved. Reads of length up to 99 999 bp are supported (previously the limit was 8000 bp).
Reads can now be mapped across the origin of circular reference sequences.
The estimation of paired end distances in the presence of an unknown set of intervening splice junctions has been improved. This typically leads to more reads being mapped in proper pairs.
Paired end reads are now annotated with their mapping orientation, which can be seen by whether paired reads are forward pairs (by default shown in dark blue) or reverse pairs (by default shown in light blue).

Bug fixes

Some anomalies in the mapping report for paired end reads have been addressed. In particular:

- An issue where estimates of paired end distances could sometimes be reported as negative.
- An issue where unaligned ends of individual reads in a pair would be reported as “unaligned internal gaps”.

Transcript Discovery

Improvements

Accuracy has been substantially improved. Changes that have contributed to these improvements include:
- There is now greater emphasis on using inferred strand information when reconstructing transcripts from regions with overlapping genes on different strands.
- The set of inferred genes is now re-evaluated several times when reconstructing transcripts in regions with closely packed genes
- When reconstructing transcripts in regions with closely packed genes the set of inferred genes is re-evaluated several times
- The order in which events are filtered has been changed and events are also merged, where necessary, between filter steps, typically leading to more of the underlying signal being recovered from the data.
The process of determining which transcript isoforms to report has been improved. The tool now considers up to 10 000 possible transcripts per gene, which is usually exhaustive, and performs an EM quantification to find those transcripts that are most likely to be expressed. This approach has the additional benefit that transcripts can now be detected even when they include only splice junctions present in other transcripts.
The tool is now considerably faster for most data sets. This is especially noticeable for high coverage regions.
The tool is designed to allow predictions to be updated over time. If a “Predicted transcript” and “Predicted Gene” track generated by the Transcript Discovery tool are supplied as annotations in another run of the tool, then:
- The newly generated “Predicted transcript” and “Predicted Gene” tracks will contain the list of all the previously detected transcripts or genes as well as newly detected ones.
- New, unique names are assigned to any newly predicted genes, avoiding clashes with names of previously predicted genes.
- Previously detected transcripts that are identified in the new data set are annotated with an estimate of their expression level in the new data set.We recommend that after updating predictions, the RNA-Seq Analysis tool is run using the new “Predicted transcript” and “Predicted Gene” tracks as annotations to identify older predictions that could be removed because they are more parsimoniously explained by the newly identified transcripts.
The extra information in paired end reads is now better utilized when reconstructing transcripts. We note that even with these changes, the tool is still expected to be as or more accurate when running with single end reads than with paired end reads of equivalent length. For example 300 bp single end reads are slightly preferred over 2×150 paired end reads.
It is now possible to supply several Large Gap Read Mappings as input when launching the tool. These are then processed as one data set.
The tool now can now analyze data for chromosomes with high read coverage (previously calculated as reads / chromosome length > 1000). Previously these would be skipped. This change may lead to high-coverage transcripts being discovered on short contigs, such as mitochondrial genomes.
CDSs are now annotated from the first observed start codon if doing so allows the resulting ORF to exceed the specified minimum ORF length. In other cases, the behavior is the same as previously: the CDS annotation starts at the first codon in the open reading frame. This change improves the recovery of full-length ORFs.

Bug fixes

Fixed an issue where the realignment of reads at splice junctions could return a sub-optimal result, sometimes leading to the prediction of incorrect splice junctions.
The Transcript Discovery tool is now able to detect alternative splicing arising from intron retention. Previously, if two transcripts in an input mapping differed by a retained intron, then only the fully-spliced transcript would be reported.
Fixed an issue that could occur when running with the options “Exclude uncertain splice sites” and “Extend known annotations”. The error message would include the text “found an unrepresented event”.
Various minor bug fixes

Changes

Genetic code This new option is available when predicting Open Reading Frames. The code choice is important for determining possible stop codons. This information is not used to determine start codons: only AUG is treated as a start codon as alternative starting codons are rare in practice.
Ignore chimeric reads This new filter identifies and removes reads that are inferred to map across two nearby genes.
Strand specific This option has now been removed. It supported the less common type of strand-specificity where single end reads (or the first read in a pair) map in the direction of the transcript. The tool now infers read direction accurately from splice signatures without the need for this option.
Extend existing annotations This option has now been removed: if previously generated predictions are provided as input to the tool, the results will include both the previous and new predictions.
Splice sites The choice of splice sites has been replaced by the use of GT-AG, GC-AG, and AT-AC splice signatures. This is because other forms of splicing are typically the result of assembly errors or noise in the read mapping.
Predict open reading frames This option has been removed. The prediction of open reading frames is now always performed.

Known limitations

The Large Gap Read Mapping tool can only align reads when at least 10% of the read maps without splicing. This requirement means that reads spanning more than 10 exons are less likely to be mapped.
Alternative isoforms that are a strict subset of existing transcripts (i.e. they differ only by having TSS and TES at different positions/exons but share all intervening exons), cannot be distinguished. Only the longest transcript will be reported in these cases.
Transcripts spanning the origin of circular chromosomes will be reported as two disconnected transcripts: one at the start of the chromosome and one at the end.
If the predictions generated by the Transcript Discovery tool are supplied as annotations, and a new round of prediction is performed on the same input read mapping, then a small number of novel transcripts and genes will still be identified. This is because the set of known annotations can affect which events are filtered, and lead to small changes in the predicted genes and transcripts.
When used with short read data, all tools that attempt to recover full length transcripts are likely to produce many false positives (typically at least 50% for human RNA-seq data). See for example Hayer et al. Bioinformatics, Volume 31, Issue 24, 15 December 2015, Pages 3938–3945, https://doi.org/10.1093/bioinformatics/btv488