Data included in this package was generated by harmonizing all publicly
available bulk RNA-seq samples from human primary colorectal tissue available
through NCBI's Sequence Read Archive and
database of Genotypes and Phenotypes, NCI's
Genomic Data Commons, and the NIH-funded
BarcUVa-Seq project. FASTQ files were downloaded
directly or generated from BAM files using
biobambam2. Gene expression
estimates were generated from FASTQ files using
Salmon and aggregated from quant.sf
files using
tximport.
The original processed data are stored as tab-delimited text on
Synapse under Synapse ID
syn22237139 and
packaged as SummarizedExperiment
objects on ExperimentHub
to facilitate
reproduction and extension of results published in Dampier et al. (PMCID:
PMC7386360, PMID: 32764205).
The following example demonstrates how to download the dataset from
ExperimentHub
.
First, make sure the Bioconductor packages BiocManager
,
SummarizedExperiment
, and ExperimentHub
are installed:
if (!requireNamespace("BiocManager", quietly=TRUE)) { install.packages("BiocManager") } BiocManager::install(c("SummarizedExperiment", "ExperimentHub", "FieldEffectCrc"))
Second, load the SummarizedExperiment
and ExperimentHub
packages:
library(SummarizedExperiment) library(ExperimentHub)
Third, find the FieldEffectCrc
data objects and look at their descriptions:
hub = ExperimentHub() ## simply list the resource names ns <- listResources(hub, "FieldEffectCrc") ns ## query the hub for full metadata crcData <- query(hub, "FieldEffectCrc") crcData ## extract metadata df <- mcols(crcData) df ## see where the cache is stored hc <- hubCache(crcData) hc
Fourth, explore the metadata of a particular file. Let's look at the data for cohort A:
md <- hub["EH3524"] md
Fifth, download the data object itself as a SummarizedExperiment
:
## using resource ID data1 <- hub[["EH3524"]] data1 ## using loadResources() data2 <- loadResources(hub, "FieldEffectCrc", "cohort A") data2
Since the data is a SummarizedExperiment
, the different phenotypes can be
accessed using colData()
. The phenotype of interest in the field effect
analysis is referred to as sampType
:
summary(colData(data1)$sampType)
To immediately access the count data for the 834 samples in cohort A, load the
SummarizedExperiment
object for cohort A as demonstrated in the
Installation section immediately above. Then use the assays()
accessor function to extract the counts
assay, which is the second assay in
the SummarizedExperiment
:
se <- data1 counts1 <- assays(se)[["counts"]]
Note that assay()
, the singular as opposed to plural, will by default extract
the first assay element from a SummarizedExperiment
but will extract the 'i'th
assay element with the assay(se, i)
syntax:
counts2 <- assay(se, 2)
The two alternatives are equivalent:
all(counts1==counts2)
We are interested in conducting a differential expression analysis using
DESeq2
to identify genes over- and under-expressed in different colorectal
tissue phenotypes. Due to substantial technical artifacts between single-end and
paired-end library formats observed in the full FieldEffectCrc
data set,
samples are grouped according to library format, with paired-end samples in
cohorts A and B and single-end samples in cohort C. Cohort A is the primary
analysis cohort and the one we want to use to discover differentially expressed
genes. To begin, load cohort A as demonstrated in the
Installation section immediately above if not already loaded.
In the following steps, we will use DESeq2
to perform differential expression
analysis on expression data stored in the SummarizedExperiment
object for
cohort A. We will adjust for latent batch effects with surrogate variable
analysis as implemented in the sva
package. This analysis would typically
consume excess memory and time, so for the purpose of demonstration, we perform
the analysis on a small subset of the full data.
In order to take advantage of the feature lengths preserved by tximport
, we
will recreate a tximport
object from the SummarizedExperiment
object
downloaded from ExperimentHub
and use the DESeqDataSetFromTximport()
function to create the DESeqDataSet
object. Alternatively, the
SummarizedExperiment
object itself could be used to create the DESeqDataSet
object using the DESeqDataSet()
function, but the order of the assays would
need to be re-arranged and the counts would need to be rounded, as the
DESeqDataSet()
function expects counts
with integer values to be the first
assay in a SummarizedExperiment
object, and abundance
with floating-point
values is the first assay in the FieldEffectCrc
objects to facilitate
conversion to tximport
objects. Such a re-ordering is facilitated by the FieldEffectCrc::reorder_assays()
function. Yet another option is to extract
the counts
assay as a matrix and use the DESeqDataSetFromMatrix()
function
to create the DESeqDataSet
object. Since there are probably some benefits to
feature length normalization as implemented in DESeq2
when such information is
available, we demonstrate creation of the tximport
object.
First, make the tximport
object, which is just an object of class list
from
base R:
## manual construction txi <- as.list(assays(se)) txi$countsFromAbundance <- "no" ## convenience function construction txi <- make_txi(se)
Next, we could assign the clinical annotations from colData(se)
to an
S4Vectors::DataFrame
or a base::data.frame
object, since that is what
DESeqDataSetFromTximport()
expects for the colData
argument. However, the
colData
slot of a SummarizedExperiment
object already stores the clinical
annotations as an S4Vectors::DataFrame
, so we are ready to create the
DESeqDataSet
object:
if (!requireNamespace("DESeq2", quietly=TRUE)) { BiocManager::install("DESeq2") } library(DESeq2) dds <- DESeqDataSetFromTximport(txi, colData(se), ~ sampType)
Note that we are interested in expression differences between phenotypes, which
are coded in the sampType
field, so we include sampType
in our preliminary
design.
Normally, we would pre-filter genes to select genes whose quantities are likely to be estimated accurately. We could use something like the following code, which selects for genes with a count greater than 10 RNA molecules in at least a third of all samples:
countLimit <- 10 sampleLimit <- (1/3) * ncol(dds) keep <- rowSums( counts(dds) > countLimit ) >= sampleLimit dds <- dds[keep, ]
However, to expedite subsequent computational steps in this demonstration, we will select a random subset of genes and samples for analysis and then filter out any gene with a count less than 10 across all samples:
r <- sample(seq_len(nrow(dds)), 6000) c <- sample(seq_len(ncol(dds)), 250) dds <- dds[r, c] r <- rowSums(counts(dds)) >= 10 dds <- dds[r, ]
Amazingly, DESeq2
estimates size factors (i.e. normalization factors based on
library size and feature length) and dispersions for every gene, fits negative
binomial models for every gene, and performs a Wald test for every gene in a
single function:
dds <- DESeq(dds)
Pretty incredible. Unfortunately, we have to do some more work before we can fit
useful models. The size factor estimates are the key result of the DESeq()
function at this stage.
If we trusted our high-throughput assays to be perfect representations of the
underlying biology they were designed to measure, we would be ready to test for
differential expression. However, our dataset is composed of samples from
several studies collected and processed in various ways over several years, so
we have reason to suspect there are latent factors driving non-biological
variation in the expression data. Fortunately, we can use sva
to estimate the
latent factors and include them in our DESeq2
design.
For a full tutorial on how to use the sva
package, see the package
vignette.
For this demonstration, we will provide only cursory motivations for the code.
The sva
functions will take the normalized count data as well as model
matrices as inputs. The full model matrix should include the variables of
interest (e.g. sampType
) as well as adjustment variables (e.g. sex
, age
),
which we do not use here. The null model matrix should include adjustment
variables only (i.e. no variables of interest). Without any adjustment
variables, the null model includes an intercept term alone.
dat <- counts(dds, normalized=TRUE) ## extract the normalized counts mod <- model.matrix(~sampType, data=colData(dds)) ## set the full model mod0 <- model.matrix(~1, data=colData(dds)) ## set the null model
The sva
package offers two approaches for estimating the number of surrogate
variables that should be included in a differential expression model, the Buja
Eyuboglu and Leek approaches. Buja Eyuboglu is the default approach.
if (!requireNamespace("sva", quietly=TRUE)) { BiocManager::install("sva") } library(sva) nsv <- num.sv(dat, mod, method=c("be"), B=20, seed=1) ## Buja Eyuboglu method
The crucial function for estimating surrogate variables in RNA-seq data is
svaseq()
:
svs <- svaseq(dat, mod, mod0, n.sv=nsv)
The output is a list of 4 elements, the first of which is the latent factor
estimate with a vector for each factor. The name of this element is sv
.
We want to include the latent factors as adjustment variables in our statistical
testing. Here, we add the surrogate variables to colData
and update the
DESeqDataSet
design formula:
for (i in seq_len(svs$n.sv)) { newvar <- paste0("sv", i) colData(dds)[ , newvar] <- svs$sv[, i] } nvidx <- (ncol(colData(dds)) - i + 1):ncol(colData(dds)) newvars <- colnames(colData(dds))[nvidx] d <- formula( paste0("~", paste(paste(newvars, collapse="+"), "sampType", sep="+")) ) design(dds) <- d
Now that we have the surrogate variables in the DESeqDataSet
, we can perform
the differential expression analysis we set out to do:
dds <- DESeq(dds)
We want to extract results from each of the following 3 comparisons:
With three categories under consideration, the DESeq()
function call tests the
first level against the other two, but the second and third levels are not
tested against each other. (Note that the order of the levels is simply
alphabetical if they are not explicitly set.) Furthermore, the results reported
with the results()
function are for the comparison of the first level against
the last by default. Therefore, in order to extract all the results of interest,
we set contrasts as follows:
cons <- list() m <- combn(levels(colData(dds)$sampType), 2) for (i in seq_len(ncol(m))) { cons[[i]] <- c("sampType", rev(m[, i])) names(cons) <- c( names(cons)[seq_len(length(cons) - 1)], paste(rev(m[, i]), collapse="v") ) } res <- list() for (i in seq_len(length(cons))) { res[[i]] <- results(dds, contrast=cons[[i]], alpha=0.05) ## default alpha is 0.1 } names(res) <- names(cons)
Let's take a look at the results:
lapply(res, head) lapply(res, summary)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.