BiocStyle::markdown()
options(max.print=1000)
suppressPackageStartupMessages({
    library(genefilter)
    library(airway)
    library(DESeq2)
    library(GenomicAlignments)
    library(GenomicFeatures)
})

Work flow

1. Experimental design

Keep it simple

Replicate

Avoid confounding experimental factors with other factors

Record co-variates

Be aware of batch effects

HapMap samples from one facility, ordered by date of processing.

2. Wet-lab

Confounding factors

Artifacts of your particular protocols

3. Sequencing

Axes of variation

Application-specific, e.g.,

4. Alignment

Alignment strategies

Splice-aware aligners (and Bioconductor wrappers)

(5a. Bowtie2 / tophat / Cufflinks / Cuffdiff / etc)

5. Reduction to 'count tables'

6. Analysis

Unique statistical aspects

Summarization

Normalization

Appropriate error model

Pre-filtering

Borrowing information

7. Comprehension

Placing differentially expressed regions in context

Copy number / expression QC Correlation between genomic copy number and mRNA expression identified 38 mis-labeled samples in the TCGA ovarian cancer Affymetrix microarray dataset.

Experimental and statistical issues in depth

Normalization

r Biocpkg("DESeq2") estimateSizeFactors(), Anders and Huber, 2010

r Biocpkg("edgeR") calcNormFactors() TMM method of Robinson and Oshlack, 2010

Dispersion

r Biocpkg("DESeq2") estimateDispersions()

r Biocpkg("edgeR") estimateDisp()

Analysis of designed experiments in R

Example: t-test

t.test()

head(sleep)
plot(extra ~ group, data = sleep)
## Traditional interface
with(sleep, t.test(extra[group == 1], extra[group == 2]))
## Formula interface
t.test(extra ~ group, sleep)
## equal variance between groups
t.test(extra ~ group, sleep, var.equal=TRUE)

lm() and anova()

## linear model; compare to t.test(var.equal=TRUE)
fit <- lm(extra ~ group, sleep)
anova(fit)
## underlying model, used in `lm.fit()`
model.matrix(extra ~ group, sleep)     # last column indicates group effect
model.matrix(extra ~ 0 + group, sleep) # contrast between columns
fit0 <- lm(extra ~ ID, sleep)
fit1 <- lm(extra ~ ID + group, sleep)
anova(fit0, fit1)
t.test(extra ~ group, sleep, var.equal=TRUE, paired=TRUE)

genefilter::rowttests()

Limitations

Consequences

Common experimental designs

Practical: RNA-Seq gene-level differential expression

Adapted from Love, Anders, and Huber's Bioconductor work flow

Michael Love [1], Simon Anders [2], Wolfgang Huber [2]

[1] Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, US;

[2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.

1. Experimental design

The data used in this workflow is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is:

Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.

2, 3, and 4: Wet lab, sequencing, and alignment

5. Reduction

We use the r Biocexptpkg("airway") package to illustrate reduction. The package provides sample information, a subset of eight BAM files, and the known gene models required to count the reads.

library(airway)
path <- system.file(package="airway", "extdata")
dir(path)

Setup

The ingredients for counting include are:

a. Metadata describing samples. Read this using read.csv().

csvfile <- dir(path, "sample_table.csv", full=TRUE)
sampleTable <- read.csv(csvfile, row.names=1)
head(sampleTable)

b. BAM files containing aligned reads. Create an object that references these files. What does the yieldSize argument mean?

library(Rsamtools)
filenames <- dir(path, ".bam$", full=TRUE)
bamfiles <- BamFileList(filenames, yieldSize=1000000)
names(bamfiles) <- sub("_subset.bam", "", basename(filenames))

c. Known gene models. These might come from an existing TxDb package, or created from biomart or UCSC, or from a GTF file. We'll take the hard road, making a TxDb object from the GTF file used to align reads and using the TxDb to get all exons, grouped by gene.

library(GenomicFeatures)
gtffile <- file.path(path, "Homo_sapiens.GRCh37.75_subset.gtf")
txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character())
genes <- exonsBy(txdb, by="gene")

Counting

After these preparations, the actual counting is easy. The function summarizeOverlaps() from the r Biocpkg("GenomicAlignments") package will do this. This produces a SummarizedExperiment object, which contains a variety of information about an experiment

library(GenomicAlignments)
se <- summarizeOverlaps(features=genes, reads=bamfiles,
                        mode="Union",
                        singleEnd=FALSE,
                        ignore.strand=TRUE,
                        fragments=TRUE)
colData(se) <- as(sampleTable, "DataFrame")
se
colData(se)
rowData(se)
head(assay(se))

6. Analysis using r Biocpkg("DESeq2")

The previous section illustrates the reduction step on a subset of the data; here's the full data set

data(airway)
se <- airway

This object contains an informative colData slot -- prepared as described in the r Biocexptpkg("airway") vignette. In particular, the colData() include columns describing the cell line cell and treatment dex for each sample

colData(se)

r Biocpkg("DESeq2") makes the analysis particularly easy, simply add the experimental design, run the pipeline, and extract the results

library(DESeq2)
dds <- DESeqDataSet(se, design = ~ cell + dex)
dds <- DESeq(dds)
res <- results(dds)

Simple visualizations / sanity checks include

Many additional diagnostic approaches are described in the DESeq2 (and edgeR) vignettes, and in the RNA-seq gene differential expression work flow.

7. Comprehension

see Part E, Gene Set Enrichment



Bioconductor/UseBioconductor documentation built on May 6, 2019, 7:52 a.m.