ggplot2::theme_set(AcidPlots::acid_theme_light())
knitr::opts_chunk[["set"]](message = FALSE)

Introduction

For a high-level overview of our bcbio RNA-seq analysis pipeline, including detailed explanation of the bcbioRNASeq S4 class definition, first consult our workflow paper published in F1000 Research [@Steinbaugh2018-rc]. This vignette is focused on more advanced usage and edge cases that a user may encounter when attempting to load a bcbio dataset and perform downstream quality control analysis.

Note: if you use bcbioRNASeq in published research, please include this citation:

citation("bcbioRNASeq")
## nolint start
suppressPackageStartupMessages({
    library(basejump)
    library(bcbioRNASeq)
})
## nolint end

Loading bcbio data

The bcbioRNASeq constructor function is the main interface connecting bcbio] output data to interactive use in R. It is highly customizable and supports a number of options for advanced use cases. Consult the documentation for additional details.

help(topic = "bcbioRNASeq", package = "bcbioRNASeq")

Upload directory

We have designed the constructor to work as simply as possible by default. The only required argument is uploadDir, the path to the bcbio final upload directory specified with upload: in the YAML configuration. Refer to the bcbio configuration documentation for detailed information on how to set up a bcbio run, which is outside the scope of this vignette.

For example, let's load up the example bcbio dataset stored internally in the package.

uploadDir <- system.file("extdata", "bcbio", package = "bcbioRNASeq")

bcbio outputs RNA-seq data in a standardized directory structure, which is described in detail in our workflow paper.

sort(list.files(path = uploadDir, full.names = FALSE, recursive = TRUE))

Counts level

By default, bcbioRNASeq imports counts at gene level, which are required for standard differential expression analysis (level = "genes"). For pseudo-aligned counts (e.g. Salmon, Kallisto, Sailfish) [@Bray2016-me; @Patro2014-qz; @Patro2017-pi], tximport [@Soneson2016-tu] is used internally to aggregate transcript-level counts to gene-level counts, and generates length-scaled transcripts per million (TPM) values. For aligned counts processed with featureCounts [@Liao2014-js] (e.g. STAR, HISAT2) [@Dobin2013-hr; @Dobin2016-qj; @Kim2015-rr], these values are already returned at gene level, and therefore not handled by tximport. Once the gene-level counts are imported during the bcbioRNASeq call, the DESeq2 package [@Love2014-sq] is then used to generate an internal DESeqDataSet from which we derive normalized and variance-stabilized counts.

object <- bcbioRNASeq(uploadDir = uploadDir, level = "genes")
print(object)

Alternatively, if you want to perform transcript-aware analysis, such as differential exon usage or splicing analysis, transcript-level counts can be obtained using level = "transcripts". Note that when counts are loaded at transcript level, TPMs are generated with tximport internally, but no additional normalizations or transformations normally calculated for gene-level counts with DESeq2 are generated.

object <- bcbioRNASeq(uploadDir = uploadDir, level = "transcripts", fast = TRUE)
print(object)

Expression callers

Since bcbio is flexible and supports a number of expression callers, we have provided advanced options in the bcbioRNASeq constructor to support a variety of workflows using the caller argument. Salmon, Kallisto, and Sailfish counts are supported at either gene or transcript level. Internally, these are loaded using tximport. STAR and HISAT2 aligned counts processed with featureCounts are also supported, but only at gene level.

object <- bcbioRNASeq(uploadDir = uploadDir, caller = "star", fast = TRUE)
print(object)

Sample selection and metadata

If you'd like to load up only a subset of samples, this can be done easily using the samples argument. Note that the character vector declared here must match the description column specified in the sample metadata. Conversely, if you're working with a large dataset and you simply want to drop a few samples, this can be accomplished with the censorSamples argument.

When working with a bcbio run that has incorrect or outdated metadata, the simplest way to fix this issue is to pass in new metadata from an external spreadsheet (CSV or Excel) using the sampleMetadataFile argument. Note that this can also be used to subset the bcbio dataset, similar to the samples argument (see above), based on the rows that are included in the spreadsheet.

sampleMetadataFile <- file.path(uploadDir, "sample_metadata.csv")
stopifnot(file.exists(sampleMetadataFile))
import(sampleMetadataFile)
object <- bcbioRNASeq(
    uploadDir = uploadDir,
    sampleMetadataFile = sampleMetadataFile,
    fast = TRUE
)
sampleData(object)

Genome annotations

When analyzing a dataset against a well-annotated genome, we recommend importing the corresponding metadata using AnnotationHub and ensembldb. This functionality is natively supported in the bcbioRNASeq constructor with using the organism, ensemblRelease, and genomeBuild arguments. For example, with our internal bcbio dataset, we're analyzing counts generated against the Ensembl Mus musculus GRCm38 genome build (release 87). These parameters can be defined in the object load call to ensure that the annotations match up exactly with the genome used.

object <- bcbioRNASeq(
    uploadDir = uploadDir,
    level = "genes",
    organism = "Mus musculus",
    genomeBuild = "GRCm38",
    ensemblRelease = 87L
)

This will return a GRanges object using the GenomicRanges package [@Lawrence2013-vz], which contains coordinates and rich metadata for each gene or transcript. These annotations are accessible with the rowRanges and rowData functions defined in the SummarizedExperiment package [@Huber2015-nw].

summary(rowRanges(object))
summary(rowData(object))

Alternatively, transcript-level annotations can also be obtained automatically using this method.

When working with a dataset generated against a poorly-annotated or non-standard genome, we provide a fallback method for loading gene annotations from a general feature format (GFF) file with the gffFile argument. If possible, we recommend providing a general transfer format (GTF) file, which is identical to GFF version 2. GFFv3 is more complicated and non-standard, but Ensembl GFFv3 files are also supported.

If your dataset contains transgenes (e.g. EGFP, TDTOMATO), these features can be defined with the transgeneNames argument, which will automatically populate the rowRanges slot with placeholder metadata.

We recommend loading up data per genome in its own bcbioRNASeq object when possible, so that rich metadata can be imported easily. In the edge case where you need to look at multiple genomes simultaneously, set organism = NULL, and bcbioRNASeq will skip the gene annotation acquisition step.

Refer to the the GenomicRanges and SummarizedExperiment package documentation for more details on working with the genome annotations defined in the rowRanges slot of the object. Here are some useful examples:

seqnames(object)
ranges(object)
strand(object)

Variance stabilization

During the bcbioRNASeq constructor call, log2 variance stabilizaton of gene-level counts can be calculated automatically, and is recommended. This is performed internally by the DESeq2 package, using the varianceStabilizingTransformation and/or rlog functions. These transformations will be slotted into assays.

summary(counts(object, normalized = "vst"))

Use case dataset

To demonstrate the functionality and configuration of the package, we have taken an experiment from the Gene Expression Omnibus (GEO) public repository of expression data to use as an example use case. The RNA-seq data is from a study of acute kidney injury in a mouse model (GSE65267) [@Craciun2016-ha]. The study aims to identify differentially expressed genes in progressive kidney fibrosis and contains samples from mouse kidneys at several time points (n = 3, per time point) after folic acid treatment. From this dataset, we are using a subset of the samples for our use case: before folic acid treatment, and 1, 3, 7 days after treatment.

For the vignette, we are loading a pre-computed version of the example bcbioRNASeq object used in our workflow paper.

object <- import(
    con = cacheUrl(
        pasteUrl(
            "github.com",
            "hbc",
            "bcbioRNASeq",
            "raw",
            "f1000v2",
            "data",
            "bcb.rda",
            protocol = "https"
        ),
        pkg = "bcbioRNASeq"
    )
)
object <- updateObject(object)
print(object)

Sample metadata

For reference, let's take a look at the sample metadata. By comparison, using colData will return all sample-level metadata, including our quality control metrics generated by bcbio. We recommend instead using sampleData with the clean = TRUE argument in reports, which only returns factor columns of interest.

sampleData(object)

Interesting groups

Groups of interest to be used for coloring and sample grouping in the quality control plots can be defined in the bcbioRNASeq object using the interestingGroups argument in the bcbioRNASeq constructor call. Assignment method support with the interestingGroups function is also provided, which can modify the groups of interest after the object has been created.

The interestingGroups definition defaults to sampleName, which is automatically generated from the bcbio description metadata for demultiplexed bulk RNA-seq samples. In this case, the samples will be grouped and colored uniquely.

Interesting groups must be defined using a character vector and refer to column names defined in the colData slot of the object. Note that the bcbioRNASeq package uses lower camel case formatting for column names (e.g. "sampleName"), so the interesting groups should be defined using camel case, not snake (e.g. "sample_name") or dotted case (e.g. "sample.name").

This approach was inspired by the DESeq2 package, which uses the argument intgroup in some functions, such as plotPca for labeling groups of interest. We took this idea and ran with it for a number of our quality control functions, which are described in detail below.

interestingGroups(object) <- c("treatment", "day")

Quality control

Our workflow paper describes the quality control process in detail. In addition, our Quality Control R Markdown template contains notes detailing each metric. Here were are going into more technical detail regarding how to customize the appearance of the plots. Note that all quality control plotting functions inherit the interestingGroups defined inside the bcbioRNASeq object. You can also change this dynamially for each function call using the interestingGroups argument.

Read counts

plotTotalReads(object)
plotMappedReads(object)

Mapping rates

Note that the overall mapping rate is relatively low per sample, while the exonic mapping rate is acceptable. High quality samples should have low intronic mapping rate, and high values are indicative of sample degradation and/or contamination.

plotMappingRate(object)
plotExonicMappingRate(object)
plotIntronicMappingRate(object)

rRNA mapping rate

Note that the samples here have a high rRNA mapping rate. This can be indicative of the polyA enrichment or ribo depletion protocol not having removed all ribosomal RNA (rRNA) transcripts. This will reduce the number of biologically meaningful reads in the experiment and is best avoided.

plotRrnaMappingRate(object)

5'->3' bias

RNA-seq data can have specific biases at either the 5’ or 3’ end of sequenced fragments. It is common to see a small amount of bias, especially if polyA enrichment was performed, or if there is any sample degradation. If a large amount of bias is observed here, be sure to analyze the samples with a Bioanalyzer and check the RIN scores.

plot5Prime3PrimeBias(object)

Feature (gene) distributions

plotFeaturesDetected(object)
plotFeatureSaturation(object)
plotCountsPerFeature(object, geom = "boxplot")
plotCountsPerFeature(object, geom = "density")

Sample similarity

We can visualize sample similarity with principal component analysis (PCA) and hierarchical clustering of sample correlations. These functions support multiple normalization methods with the normalized argument.

plotPca(object)
plotCorrelationHeatmap(object)

Export

The counts accessor function returns raw counts by default. Multiple normalization methods are accessible with the normalized argument.

rawCounts <- counts(object, normalized = FALSE)
normalizedCounts <- counts(object, normalized = TRUE)
tpm <- counts(object, normalized = "tpm")
log2VstCounts <- counts(object, normalized = "vst")

We are providing bcbioRNASeq method support for the export function, which automatically exports matrices defined in assays and corresponding metadata (e.g. rowData, colData) to disk in a single call.

files <- export(object, con = tempdir())
print(files)

Differential expression

bcbioRNASeq integrates with DESeq2 and edgeR for differential expression.

DESeq2

To prepare our dataset for analysis, we need to coerce the bcbioRNASeq object to a DESeqDataSet. We have defined an S4 coercion method that uses the as function in the package.

dds <- as(object, "DESeqDataSet")
print(dds)

Since both bcbioRNASeq and DESeqDataSet classes extend RangedSummarizedExperiment, internally we coerce the original bcbioRNASeq object to a RangedSummarizedExperiment and then use the DESeqDataSet constructor, which requires a SummarizedExperiment with integer counts.

The source code for our bcbioRNASeq to DESeqDataSet coercion is accessible with getMethod.

getMethod(
    f = "coerce",
    signature = signature(from = "bcbioRNASeq", to = "DESeqDataSet")
)

We also provide a parameterized starter R Markdown template for standard DESeq2 differential expression, supporting analysis and visualization of multiple contrasts inside a single report. In our workflow paper, we also describe an example LRT analysis.

Now you can perform differential expression analysis with DESeq2. The authors of that package have provided a number of detailed, well documented references online:

edgeR

To prepare our dataset for analysis, we need to coerce the bcbioRNASeq object to a DGEList. Similar to our DESeq2 approach, we have defined an S4 coercion method that hands off to edgeR.

dge <- as(object, "DGEList")
summary(dge)

The source code for our bcbioRNASeq to DGEList coercion is accessible with getMethod.

getMethod(
    f = "coerce",
    signature = signature(from = "bcbioRNASeq", to = "DGEList")
)

Functional analysis

We provide a starter R Markdown template for gene set enrichment analysis (GSEA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis [@Kanehisa2000-hk; @Subramanian2005-zk], which leverages the functionality of the clusterProfiler package [@Yu2012-ae]. This workflow is described in detail in our workflow paper, and is outside the scope of this advanced use vignette. For more information on functional analysis, consult the clusterProfiler vignette or Stephen Turner's excellent DESeq2 to fgsea workflow.

R session information

utils::sessionInfo()

References

The papers and software cited in our workflows are available as a shared library on Paperpile.



hbc/bcbioRNASeq documentation built on March 28, 2024, 3:01 p.m.