BiocStyle::markdown() options(max.print=1000) suppressPackageStartupMessages({ library(genefilter) library(airway) library(DESeq2) library(GenomicAlignments) library(GenomicFeatures) })
Keep it simple
Replicate
r Biocpkg("RNASeqPower")
Avoid confounding experimental factors with other factors
Record co-variates
Be aware of batch effects
Known
Unknown
r Biocpkg("sva")
.Surrogate variable analysis
r Biocpkg("sva")
Bioconductor package HapMap samples from one facility, ordered by date of processing.
Confounding factors
Artifacts of your particular protocols
Axes of variation
Application-specific, e.g.,
Alignment strategies
Splice-aware aligners (and Bioconductor wrappers)
r Biocpkg("Rbowtie")
)r Biocpkg("Rsubread")
)GenomicAlignments::summarizeOverlaps()
Rsubread::featureCount()
Unique statistical aspects
Summarization
Normalization
Appropriate error model
Pre-filtering
r Biocpkg("edgeR")
vignette; work flow
today.Borrowing information
Placing differentially expressed regions in context
Correlation between genomic copy number and mRNA expression identified 38 mis-labeled samples in the TCGA ovarian cancer Affymetrix microarray dataset.
r Biocpkg("DESeq2")
estimateSizeFactors()
, Anders and Huber,
2010
r Biocpkg("edgeR")
calcNormFactors()
TMM method of Robinson and
Oshlack, 2010
r Biocpkg("DESeq2")
estimateDispersions()
r Biocpkg("edgeR")
estimateDisp()
t.test()
x
: vector of univariate measurementsy
: factor
describing experimental designvar.equal=TRUE
: appropriate for relatively small experiments where
no additional information available?formula
: alternative representation, y ~ x
.head(sleep) plot(extra ~ group, data = sleep) ## Traditional interface with(sleep, t.test(extra[group == 1], extra[group == 2])) ## Formula interface t.test(extra ~ group, sleep) ## equal variance between groups t.test(extra ~ group, sleep, var.equal=TRUE)
lm()
and anova()
lm()
: fit linear model.anova()
: statisitcal evaluation.## linear model; compare to t.test(var.equal=TRUE) fit <- lm(extra ~ group, sleep) anova(fit)
formula
: translated into model matrix, used in
lm.fit()
.## underlying model, used in `lm.fit()` model.matrix(extra ~ group, sleep) # last column indicates group effect model.matrix(extra ~ 0 + group, sleep) # contrast between columns
fit0 <- lm(extra ~ ID, sleep) fit1 <- lm(extra ~ ID + group, sleep) anova(fit0, fit1) t.test(extra ~ group, sleep, var.equal=TRUE, paired=TRUE)
genefilter::rowttests()
x
: matrix of expression valuesfeatures x samples (reverse of how a 'statistician' would represent the data -- samples x features)
fac
: factor of one or two levels describing experimental design
Limitations
Consequences
count ~ factor
. Alternative: count ~ 0 + factor
and
contrastscount ~ covariate + factor
count ~ factor
or count ~ 0 + factor
count ~ factor1 + factor2
; main
effects and interactions, count ~ factor1 * factor2
. Contrasts to
ask specific questionsr Biocpkg("limma")
approach:
duplicateCorrelation()
Adapted from Love, Anders, and Huber's Bioconductor work flow
Michael Love [1], Simon Anders [2], Wolfgang Huber [2]
[1] Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, US;
[2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
The data used in this workflow is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is:
Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.
Paired-end sequencing leading to FASTQ files of reads and their quality scores.
Reads aligned to a reference genome or transcriptome, resulting in BAM files. Reads for this experiment were aligned to the Ensembl release 75 human reference genome using the STAR aligner
We use the r Biocexptpkg("airway")
package to illustrate
reduction. The package provides sample information, a subset of eight
BAM files, and the known gene models required to count the reads.
library(airway) path <- system.file(package="airway", "extdata") dir(path)
The ingredients for counting include are:
a. Metadata describing samples. Read this using read.csv()
.
csvfile <- dir(path, "sample_table.csv", full=TRUE) sampleTable <- read.csv(csvfile, row.names=1) head(sampleTable)
b. BAM files containing aligned reads. Create an object that
references these files. What does the yieldSize
argument mean?
library(Rsamtools) filenames <- dir(path, ".bam$", full=TRUE) bamfiles <- BamFileList(filenames, yieldSize=1000000) names(bamfiles) <- sub("_subset.bam", "", basename(filenames))
c. Known gene models. These might come from an existing TxDb
package, or created from biomart or UCSC, or from a
GTF file. We'll
take the hard road, making a TxDb object from the GTF file used to
align reads and using the TxDb to get all exons, grouped by gene.
library(GenomicFeatures) gtffile <- file.path(path, "Homo_sapiens.GRCh37.75_subset.gtf") txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character()) genes <- exonsBy(txdb, by="gene")
After these preparations, the actual counting is easy. The function
summarizeOverlaps()
from the r Biocpkg("GenomicAlignments")
package will do this. This produces a SummarizedExperiment
object,
which contains a variety of information about an experiment
library(GenomicAlignments) se <- summarizeOverlaps(features=genes, reads=bamfiles, mode="Union", singleEnd=FALSE, ignore.strand=TRUE, fragments=TRUE) colData(se) <- as(sampleTable, "DataFrame") se colData(se) rowData(se) head(assay(se))
r Biocpkg("DESeq2")
The previous section illustrates the reduction step on a subset of the data; here's the full data set
data(airway) se <- airway
This object contains an informative colData
slot -- prepared as
described in the r Biocexptpkg("airway")
vignette. In particular,
the colData()
include columns describing the cell line cell
and
treatment dex
for each sample
colData(se)
r Biocpkg("DESeq2")
makes the analysis particularly easy, simply add
the experimental design, run the pipeline, and extract the results
library(DESeq2) dds <- DESeqDataSet(se, design = ~ cell + dex) dds <- DESeq(dds) res <- results(dds)
Simple visualizations / sanity checks include
Look at counts of strongly differentiated genes, to get a sense of how counts translate to the summary statistics reported in the result table
r
topGene <- rownames(res)[which.min(res$padj)]
res[topGene,]
plotCounts(dds, gene=topGene, intgroup=c("dex"))
An 'MA' plot shows for each gene the between-group log-fold-change
versus average log count; it should be funnel-shaped and
approximately symmetric around y=0
, with lots of between-treatment
variation for genes with low counts.
r
plotMA(res, ylim=c(-5,5))
Plot the distribution of (unadjusted) P values, which should be uniform (under the null) but with a peak at small P value (true positives, hopefully!)
r
hist(res$pvalue, breaks=50)
Look at a 'volcano plot' of adjusted P-value versus log fold change, to get a sense of the fraction of up- versus down-regulated genes
r
plot(-log10(padj) ~ log2FoldChange, as.data.frame(res), pch=20)
Many additional diagnostic approaches are described in the DESeq2 (and edgeR) vignettes, and in the RNA-seq gene differential expression work flow.
see Part E, Gene Set Enrichment
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.