knitr::opts_chunk$set(echo = TRUE)
The purpose of annotation plots is to provide a graphical overview of how were processed the sequences in CAGE libraries. How many were discarded, and why ? Where do align the remaining ones ? The plotAnnot function displays this information as stacked bar plots, with error bars if multiple libraries are grouped together. The default scale is in percentage points, from 0 to 100 %
The main command to produce annotation plots in smallCAGEqc is called
plotAnnot
. It takes a table containing the sample metadata, a scope,
a title and optionally a factor to group similar plots together.
The scope determines what data is plotted and how it is normalised. The available scopes will be explained with an example plot in a later part of this document, but first, let's see the input in more details.
Here, we will use some of the example data that is distributed in smallCAGEqc.
The commands below load the R package and load the example data in a data frame
called libs
.
library(smallCAGEqc) libs <- read.table(system.file("extdata/libs-with-all-metadata.tsv", package="smallCAGEqc"))
The following columns in the metadata table describe the total remaining pairs step after step in the processing.
total: The total number of pairs before tag extraction (called "raw" in some other pipelines). In cases where this number is not available per sample, for example when demultiplexing and tag extraction are performed at the same stage, it is set arbitrarily to zero.
extracted: The number of pairs where the linkers and unique molecular identifier (if present) were successfully extracted.
cleaned: The number of pairs remaining after filtering out spike, rRNA, low-complexity, primer artefact and other unwanted sequences.
mapped: The number of pairs mapped to the genome with at least one successful alignment, regardless of its quality. This called "genome_mapped" in some other pipelines.
properpairs: The number of pairs remaining after filtering out the non-proper alignments, that are unlikely to represent a real transcript because the mates are not aligned to the same chromosome, or are too far apart, or not on the same strand, or tail to tail. This is called "properly_mapped" in some other pipelines. Note that some real transcripts may be missed by this filtering step, for instance if there was a recombination not reflected in the reference genome, or in case of trans-splicing, etc.
counts: The number of unique molecules counted after alignment. This is called "transcript_count" in some other pipelines. The annotation statistics will be calculated on these counts.
The following columns describe the number of pairs removed at some step of the processing.
spikes, rdna: The number of pairs removed because they matched spikes or rRNA reference sequences, respectively.
tagdust: The number of pairs removed because of low-complexity or similarity to primer artefacts.
The following columns describe the number of TSS (after proper pairing and deduplication) aligning to known regions in the genome.
promoter: Promoter regions.
exon: Known exons.
intron: Know introns (that is: the TSS matches a transcript but none of its exons).
unknown: None of the above; basically intergenic.
other: Normally there should not be anything in this category since unknown should catch all intergenic TSS. But sometimes there is a misdesign, for instance in the example data here, the annotation file contained one more category, "boundary", which was not handled correctly by smallCAGEqc.
The annotation is hierarchical (promoters have priority on exons, etc.), so the sum of the annotation columns above should be be equal to the counts column.
Shows how many pairs are removed by the extraction, cleaning, mapping (proper pairs) and transcript counting steps described above.
Extraction: The total number of pairs where a tag could not be extracted. Calculated as extracted − cleaned.
Cleaning: The total number of pairs removed because at least one mate matched a reference or artefact sequence, or because it had low complexity Calculated as tagdust + rdna + spikes, or cleaned − mapped.
Mapping: The number of pairs removed by the mapping, because at least one mate did not align or because the alignmens were not "properly paired" (see "proerpairs" above) Calculated as mapped − properpairs.
Deduplication: The number of pairs removed because they represent the same molecule. Thus, this definition is broader than just removing the identical pairs (for instance with the rmdups command of samtools). Calculated as properpairs − counts.
Counts: The number of molecule counts.
plotAnnot(libs, SCOPE="steps", TITLE="steps")
Pairs are categorised as tag dust, rDNA, spikes, unmapped, non-proper, duplicates and counts, and normalised by the total number of extracted pairs. Non-extracted pairs are ignored.
Compared to "steps", this scope gives more details on the sequences removed at the TagDust and mapping stages of the processing pipeline.
Tag_dust, rDNA, Spikes: same numbers as in the tagdust, rdna and spikes columns of the metadata table.
Unmapped: The number of pairs that could not be mapped.
Non_proper: The number of pairs that did not have a proper alignment.
Duplicates: The number of pairs that do not add a molecule count.
Counts: same as in "steps".
plotAnnot(libs, SCOPE="qc", TITLE="qc")
The unique molecule counts are grouped in annotation categories ("promoter", "exon", "intron" and "intergenic"), as described above.
plotAnnot(libs[libs$counts > 0,], SCOPE="counts", TITLE="counts")
Same as "counts", with the addition of duplicates and non-proper pairs. Therefore the plot represents all the mapped data.
plotAnnot(libs, SCOPE="mapped", TITLE="mapped")
Pairs are categorised by extraction step and genome annotation.
plotAnnot(libs, SCOPE="all", TITLE="all")
Same as all
except that normalisation is relative to the number of mapped reads
plotAnnot(libs, SCOPE="annotation", TITLE="annotation")
libs
tablelibs
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.