options(width=100) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({ library(ENAR2016) library(ggplot2) library(airway) library(DESeq2) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(BSgenome.Hsapiens.UCSC.hg19) library(AnnotationHub) })
Functions
rnorm(10) # sample 10 random normal deviates
rnorm(10)
the same as rnorm(n=10)
rnorm(n=10, mean=0, sd=1)
?rnorm
Vectors
c(TRUE, FALSE)
), integer (1:5
), numeric (rnorm(10)
),
character (c("alpha", "beat")
), complexNA
), factors
(factor(c("Female", "Male"))
), formulas (y ~ x
)x <- rnorm(1000) y <- x + rnorm(1000) plot(y ~ x)
Objects: classes and methods
data.frame()
, matrix()
lm()
df <- data.frame(Y = y, X = x) fit <- lm(Y ~ X, df) plot(Y ~ X, df) abline(fit, col="red", lwd=2) anova(fit)
Packages
stats
, graphics
, Matrix
ggplot2
, dplyr
, ade
install.packages()
(or biocLite()
, below)library(ggplot2) ggplot(df, aes(x=X, y=Y)) + geom_point() + geom_smooth(color="blue") + geom_smooth(method="lm", color="red")
Help!
help.start()
?rnorm
class(fit)
; methods(plot)
; methods(class=class(fit))
?"plot<tab>...
Biological domains
Principles
Interoperability
DNAStringSet()
, rather
than character vector.Reproducibility
Installation and use
biocLite("DESeq2")
; also works for CRAN & github
packagesSix stages
Key packages: data access
Coordinated management: SummarizedExperiment
This is derived from: RNA-Seq workflow: gene-level exploratory analysis and differential expression, by Michael Love, Simon Anders, Wolfgang Huber; modified by Martin Morgan, October 2015.
We walk through an end-to-end RNA-Seq differential expression workflow, using DESeq2 along with other Bioconductor packages.
The complete work flow starts from the FASTQ files, but we will start after reads have been aligned to a reference genome and reads overlapping known genes have been counted.
Our main focus is on differential gene expression analysis with DESeq2. Other Bioconductor packages are important in statistical inference of differential expression at the gene level, including Rsubread, edgeR, limma, BaySeq, and others.
The data are from an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways.
In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is:
Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.
Counts
summarizeOverlaps()
, etc.)SummarizedExperiment
library(airway) data("airway") se <- airway
Assay data
Row-wise tests
r
head(assay(se))
Experimental design
cell
) plus treatment (dex
)R's formula interface allows for complicated designs, contrasts, etc
r
colData(se)
design = ~ cell + dex
se$dex <- relevel(se$dex, "untrt") # 'untrt' as reference level
Features (genes)
Genomic coordinates of gene models
r
rowRanges(se)
Create DESeqDataSet
object
library(DESeq2) dds <- DESeqDataSet(se, design = ~ cell + dex)
Run the pipeline
dds <- DESeq(dds)
Summarize results
res <- results(dds) head(res) mcols(res) # what does each column mean? head(res[order(res$padj),]) # 'top table' by p-adj
Keep it simple
Replicate
Avoid confounding experimental factors with other factors
Be aware of batch effects
HapMap samples from one facility, ordered by date of processing.
Confounding factors
Artifacts of your particular protocols
Axes of variation
Application-specific, e.g.,
Alignment
Splice-aware aligners (and Bioconductor wrappers)
Reduction to 'count tables'
GenomicAlignments::summarizeOverlaps()
Rsubread::featureCount()
Integration with gene-level analyses -- Soneson et al.
```r
library(tximportData) dir <- system.file("extdata", package="tximportData") samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) files <- file.path(dir,"salmon", samples$run, "quant.sf") names(files) <- paste0("sample",1:6)
tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))
txi <- tximport(files, type="salmon", tx2gene=tx2gene) ```
tx2gene <- read.csv(file.path(dir, "tx2gene.csv"))
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
Unique statistical aspects
Summarization
Counts per se, rather than a summary (RPKM, FPKM, ...), are relevant for analysis
Normalization
Detail
Appropriate error model
Pre-filtering
Borrowing information
'Annotation' packages
'org': map between gene identifiers; also GO.db, KEGGREST
r
library(org.Hs.eg.db)
ttbl <- head(res[order(res$padj),]) # 'top table' by p-adj
(ensid <- rownames(ttbl))
mapIds(org.Hs.eg.db, ensid, "SYMBOL", "ENSEMBL")
select(org.Hs.eg.db, ensid, c("SYMBOL", "GENENAME"), "ENSEMBL")
columns(org.Hs.eg.db)
- 'TxDb'
```r
library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
egid <- mapIds(org.Hs.eg.db, ensid, "ENTREZID", "ENSEMBL") transcriptsBy(txdb, "gene")[egid] ``` - 'BSgenome': BSgenome.Hsapiens.UCSC.hg19
r
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
gn <- genes(txdb)[egid]
up1000 <- flank(gn, width=1000)
getSeq(genome, up1000)
r
library(AnnotationHub)
hub = AnnotationHub()
query(hub, c("ensembl", "gtf", "release-83"))
hub["AH50417"]
hub[["AH50417"]]
Genomic Ranges: GenomicRanges
A ton of functionality, especially findOverlaps()
r
gr <- GRanges("chr1",
IRanges(c(1000, 2000, 3000), width=100),
strand=c("+", "-", "*"),
score=c(10, 20, 30))
gr
Data / metadata integration: SummarizedExperiment
Very easy to coordinate data manipulation -- avoid those embarassing 'off-by-one' and other errors!
r
library(airway)
data(airway)
airway
Efficient R programming
*apply()
are forms of iteration, not vectorizationR works best when memory is pre-allocated
Worst: no pre-allocation, no vectorization; scales quadratically with number of elements
r
result <- integer()
for (i in 1:10)
result[i] = sqrt(i)
Better: pre-allocate and fill; scales linearly with number of elements, but at the interpretted level
```r result <- integer(10) for (i in 1:10) result[[i]] = sqrt(i)
result <- sapply(1:10, sqrt) ```
Best: vectorize; scales linearly, but at the compiled level so 100x faster
r
result <- sqrt(1:10)
Restriction and 'chunk' iteration
ScanBamParam()
, VcfParam()
BamFile(..., yieldSize=1000000)
See also GenomicFiles::reduceByYield()
r
library(GenomicAlignments)
file <- system.file("extdata", "ex1.bam", package="Rsamtools")
aln <- readGAlignments(file) # all data
length(aln)
bamfile <- open(BamFile(file, yieldSize=1500)) # chunks
length(readGAlignments(file)) # better: yieldSize=1e6
length(readGAlignments(file))
length(readGAlignments(file))
length(readGAlignments(file))
Parallel evaluation: BiocParallel
lapply
-likeDifferent 'back-ends', e.g., cores on a computer; computers in a cluster; instances in a cloud
r
library(BiocParallel)
system.time({
res <- lapply(1:8, function(i) { Sys.sleep(1); i })
})
system.time({
res <- bplapply(1:8, function(i) { Sys.sleep(1); i })
})
In Bioc-devel
r
X <- list(1, "2", 3)
res <- bptry(bplapply(X, sqrt))
X <- list(1, 2, 3)
res <- bplapply(X, sqrt, BPREDO=res)
Scripts
Vignettes
Light weight / accessible: R-flavored markdown
--- title: "A title" author: "An Author" vignette: > % \VignetteEngine{knitr::rmarkdown} --- # Heading ## Sub-heading Some text. ```r x <- rnorm(1000) plot(x) ```
Check out the knitr package!
Packages
Standard on-disk representation
/MyPackage /MyPackage/DESCRIPTION /MyPackage/NAMESPACE /MyPackage/R/fun_one.R /MyPackage/R/fun_another.R /MyPackage/vignettes
Very easy to create
Individuals
Annual conference: https://bioconductor.org/BioC2016
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.