BiocStyle::markdown()
options(digits=3) suppressPackageStartupMessages({ library(csaw) library(edgeR) library(GenomicRanges) library(ChIPseeker) library(genefilter) })
Key references
Kharchenko et al. (2008).
ChIP-seq for differential binding
Novel statistical issues
Experimental design and execution
Single sample
Designed experiments
Sequencing & alignment
Peak calling
MACS: Model-based Analysis for ChIP-Seq, Liu et al., 2008
Down-stream analysis
'Known' ranges
de novo windows
de novo peak calling
Various strategies for calling peaks -- Lun & Smyth, Table 1
Relevant slides pdf
ID | Mode | Library | Operation |
---|---|---|---|
1 | Single-sample | Individual | Union |
2 | Single-sample | Individual | Intersection |
3 | Single-sample | Individual | At least 2 |
4 | Single-sample | Pooled over group | Union |
5 | Single-sample | Pooled over group | Intersection |
6 | Two-sample | Pooled over group | Union |
7 | Single-sample | Pooled over all | – |
How to choose? -- Lun & Smyth,
ID | Error rate | ||
---|---|---|---|
0.01 | 0.05 | 0.1 | |
RA | 0.010 (0.000) | 0.051 (0.001) | 0.100 (0.002) |
1 | 0.002 (0.000) | 0.019 (0.001) | 0.053 (0.001) |
2 | 0.003 (0.000) | 0.030 (0.000) | 0.073 (0.001) |
3 | 0.006 (0.000) | 0.042 (0.001) | 0.092 (0.001) |
4 | 0.033 (0.001) | 0.145 (0.001) | 0.261 (0.002) |
5 | 0.000 (0.000) | 0.001 (0.000) | 0.005 (0.000) |
6 | 0.088 (0.006) | 0.528 (0.013) | 0.893 (0.006) |
7 | 0.010 (0.000) | 0.049 (0.001) | 0.098 (0.001) |
## 100,000 t-tests under the null, n = 6 n <- 6; m <- matrix(rnorm(n * 100000), ncol=n) P <- genefilter::rowttests(m, factor(rep(1:2, each=3)))$p.value quantile(P, c(.001, .01, .05)) hist(P, breaks=20)
de novo hybrid strategies
r Biocpkg("csaw")
)This exercise is based on the r Biocpkg("csaw")
vignette, where more
detail can be found.
The experiment involves changes in binding profiles of the NFYA
protein between embryonic stem cells and terminal neurons. It is a
subset of the data provided by Tiwari et
al. 2012 available as
GSE25532. There
are two es (embryonic stem cell) and two tn (terminal neuron)
replicates. Single-end FASTQ files were extracted from GEO, aligned
using r Biocpkg("Rsubread")
, and post-processed (sorted and indexed)
using r Biocpkg("Rsamtools")
with the script available at
system.file(package="UseBioconductor", "scripts", "ChIPSeq", "NFYA", "preprocess.R")
Create a data frame summarizing the files used.
files <- local({ acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417", tn_2="SRR074418") data.frame(Treatment=sub("_.*", "", names(acc)), Replicate=sub(".*_", "", names(acc)), sra=sprintf("%s.sra", acc), fastq=sprintf("%s.fastq.gz", acc), bam=sprintf("%s.fastq.gz.subread.BAM", acc), row.names=acc, stringsAsFactors=FALSE) })
Change to the directory where the BAM files are located
setwd("~/UseBioconductor-data/ChIPSeq/NFYA")
Load the csaw library and count reads in overlapping windows. This
returns a SummarizedExperiment
, so explore it a bit...
library(csaw) library(GenomicRanges) frag.len <- 110 system.time({ data <- windowCounts(files$bam, width=10, ext=frag.len) }) # 156 seconds acc <- sub(".fastq.*", "", data$bam.files) colData(data) <- cbind(files[acc,], colData(data))
Filtering (vignette Chapter 3) Start by filtering low-count windows. There are likely to be many of these (how many?). Is there a rational way to choose the filtering threshold?
library(edgeR) # for aveLogCPM() keep <- aveLogCPM(assay(data)) >= -1 data <- data[keep,]
frag.len <- 110 fl <- system.file(package="UseBioconductor", "extdata", "csaw-data-filtered.Rds") data <- readRDS(fl)
Normalization (composition bias) (vignette Chapter 4) csaw uses
binned counts in normalization. The bins are large relative to the
ChIP peaks, on the assumption that the bins primarily represent
non-differentially bound regions. The sample bin counts are normalized
using the r Biocpkg("edgeR")
TMM (trimmed median of M values) method
seen in the RNASeq differential expression lab. Explore vignette
chapter 4 for more on normalization (this is a useful resource when
seeking to develop normalization methods for other protocols!).
system.time({ binned <- windowCounts(files$bam, bin=TRUE, width=10000) }) #139 second normfacs <- normalize(binned)
fl <- system.file(package="UseBioconductor", "extdata", "csaw-normfacs.Rds") normfacs <- readRDS(fl)
Experimental design and Differential binding (vignette Chapter 5)
Differential binding will be assessed using r Biocpkg("edgeR")
,
where we need to specify the experimental design
design <- model.matrix(~Treatment, colData(data))
Apply a standard r Biocpkg("edgeR")
work flow to identify
differentially bound regions. Creatively explore the results.
y <- asDGEList(data, norm.factors=normfacs) y <- estimateDisp(y, design) fit <- glmQLFit(y, design, robust=TRUE) results <- glmQLFTest(fit, contrast=c(0, 1)) head(results$table)
Multiple testing (vignette Chapter 6) The challenge is that FDR across all detected differentially bound regions is what one is interested in, but what is immediately available is the FDR across differentially bound windows; region will often consist of multiple overlapping windows. As a first step, we'll take a 'quick and dirty' approach to identifying regions by merging 'high-abundance' windows that are within, e.g., 1kb of one another
merged <- mergeWindows(rowRanges(data), tol=1000L)
Combine test results across windows within regions. Several strategies are explored in section 6.5 of the vignette.
tabcom <- combineTests(merged$id, results$table) head(tabcom)
Section 6.6 of the vignette discusses approaches to identifying the 'best' windows within regions.
Finally, create a GRangesList
that associated with two result tables
and the genomic ranges over which the results were calculated.
gr <- rowRanges(data) mcols(gr) <- as(results$table, "DataFrame") grl <- split(gr, merged$id) mcols(grl) <- as(tabcom, "DataFrame")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.