## nolint start
suppressPackageStartupMessages({
    library(goalie)
    library(basejump)
    library(bcbioRNASeq)
})
prepareTemplate()
source("_setup.R")
## nolint end

Load bcbioRNASeq object.

object <- import(params[["bcb_file"]])
assert(
    is(object, "bcbioRNASeq"),
    validObject(object)
)
print(object)

bcbio-nextgen run data was imported from:
r metadata(object)[["uploadDir"]].

Sample metadata

as.data.frame(sampleData(object))

Read metrics {.tabset}

Total reads

High quality RNA-seq samples ideally should have at least 10 million reads per sample.

plotTotalReads(object)

Mapped reads

The number of mapped reads should correspond to the number of total reads.

plotMappedReads(object)

Mapping rate

The genomic mapping rate represents the percentage of reads mapping to the reference genome. Low mapping rates are indicative of sample contamination, poor sequencing quality or other artifacts.

plotMappingRate(object)

Number of features (genes) detected

plotFeaturesDetected(object)

Feature (gene) detection saturation

We should observe a linear trend in the number of genes detected with the number of mapped reads, which indicates that the sample input was not overloaded.

plotFeatureSaturation(object)

Exonic mapping rate

Ideally, at least 60% of total reads should map to exons.

plotExonicMappingRate(object)

Intronic mapping rate

The majority of reads should map to exons and not introns.

plotIntronicMappingRate(object)

rRNA mapping rate

Samples should have a ribosomal RNA (rRNA) contamination rate below 10%.

plotRrnaMappingRate(object)

5'->3' bias

RNA-seq data can have specific biases at either the 5’ or 3’ end of sequenced fragments. It is common to see a small amount of bias, especially if polyA enrichment was performed, or if there is any sample degradation. If a large amount of bias is observed here, be sure to analyze the samples with a Bioanalyzer and check the RIN scores.

plot5Prime3PrimeBias(object)

Counts per gene

Generally, we expect similar count spreads for all genes between samples unless the library sizes or total RNA expression are different.

We recommend visualizing counts normalized with the Trimmed Mean of M-Values (TMM) method here [@Robinson2010-np]. TMM normalization equates the overall expression levels of genes between samples under the assumption that the majority of them are not differentially expressed. Therefore, by normalizing for total RNA expression by sample, we expect the spread of the TMM-normalized counts per gene to be similar for every sample.

plotCountsPerGene(object, geom = "boxplot")
plotCountsPerGene(object, geom = "density")

Counts per biotype

Different RNA-seq processing methods can preferentially capture a subset of the RNA species from the total RNA. For example, polyA selection should select for mostly coding genes and skip a large percentage of non-polyA non-coding RNA.

plotCountsPerBiotype(object)

Counts per broad biotype class

The Ensembl biotype clasifications are too specific to plot them all. Here we have grouped the biotypes into broad classes.

plotCountsPerBroadClass(object)

Variance stabilization

These plots show the standard deviation of normalized counts.

plotMeanSd(object)

Sample similarity analysis

Before performing similarity analysis, we transform counts to log2, which acts to minimize large differences in sequencing depth and helps normalize all samples to a similar dynamic range. For RNA-seq count data, variance increases with the mean. Logarithmic transformation of normalized count values with a small pseudocount will account for large variations seen between the highest expressing genes so that these genes won't dominate the PCA plots. However, due to the strong noise among low count values due to Poisson, the general log2 transformation will amplify this noise, and instead, low count genes will now dominate the PCA plots. So instead, we use a variance-stabilizing transformation that gives similar results for high counts as a log2 transformation but also shrinks the values of low counts towards the genes’ average across samples [@Love2014-sq].

Principal component analysis (PCA)

PCA [@Jolliffe2002-jz] is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expression levels for genes and transforms it in principal component space, reducing each sample into one point. Thereby, we can separate samples by expression variation, and identify potential sample outliers. The PCA plot is a way to look at how samples are clustering.

plotPca(object)

Hierarchical clustering

Inter-correlation analysis (ICA) is another way to look at how well samples cluster by plotting the correlation between the expression profiles of the samples.

plotCorrelationHeatmap(object)

Export results

files <- export(
    object = object,
    con = params[["output_dir"]],
    compress = FALSE
)

These normalized matrices are suitable for comparison across samples, and to generate plots.

Only use the raw counts to perform a new differential expression analysis. These are not suitable for plots.

Methods

RNA-seq counts were generated by [bcbio-nextgen][] and [bcbioRNASeq][] [@Steinbaugh2018-rc]. Counts were imported into [R][] using [tximport][] [@Soneson2016-tu] and [DESeq2][] [@Love2014-sq]. Gene annotations were obtained from [Ensembl][]. Plots were generated by [ggplot2][] [@Wickham2009-si]. Heatmaps were generated by [pheatmap][] [@Kolde2015-dy].





hbc/bcbioRNASeq documentation built on March 28, 2024, 3:01 p.m.