In easyRNASeq: Count summarization and normalization for RNA-Seq data

Synthetic transcripts - createSyntheticTranscripts

One major caveat of estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, might sometimes align to several features (e.g. transcripts or genes) with alignments of equivalent quality.

This, for example, might happen as a result of gene duplication and the presence of repetitive or common domains. To avoid counting unique mRNA fragments multiple times, the stringent approach is to keep only uniquely mapping reads - being aware of potential consequences, see the note below.

Not only can multiple counting arise from a biological reason, but also from technical artifacts, introduced mostly by poorly formatted gff3/gtf annotation files. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a synthetic transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript: a gene structure with a one to one relationship.

To create such a structure, we use the createSyntheticTranscripts function on the file we just downloaded, simply by passing our annotParam object as argument.

annotParam <- createSyntheticTranscripts(annotParam,verbose=FALSE)

This function returns an updated annotParam object that contains the newly created, flattened transcript annotation. This object can then be saved as an rda file for later re-use or for sharing with collaborators.

save(annotParam,
file="./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda")

Alternatives

Instead of updating the annotParam object, we could have created an object of class Genome_Intervals from the genomeIntervals package, using the same function but using the actual datasource of the previous annotParam object as argument rather than the object itself.

gI <- createSyntheticTranscripts(
    "./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
    verbose=FALSE)

This gI object can then be exported as a gff3 file.

writeGff3(gI,
          file="./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts.gff3.gz")

Note: Ignoring multi-mapping reads may introduce biases in the read counts of some genes (such as that of paralogs or of very conserved gene families), but in the context of a conservative first analysis we are of the current opinion that they are best ignored. One should of course assess how many reads are multi-mapping (check for example the STAR output) and possibly extract them from the alignment read file to visualize them using a genome browser so as to understand where they are located and how they may affect any analysis. Based on this, one may, at a later stage, decide to relax the counting parameters to accept multi-mapping reads.