```{css, echo=FALSE} .infobox { padding: 1em 1em 1em 4em; margin-bottom: 10px; border: 2px solid #008000; border-radius: 10px; background: #DCDCDC 5px center/3em no-repeat; color: #8B0000; } .center { text-align: center; }

```r
library(knitr)
knitr::opts_chunk$set(out.width="100%", cache=TRUE)
knitr::opts_knit$set(root.dir="../inst/extdata")





Workshop participation


Instructions on how to run and follow along during the workshop.

Workshop participation -- local R installation

In order to run the workshop, make sure the following packages are installed from Bioconductor. Here is a list of packages that we will be using:

library(DiffBind)
library(GreyListChIP)
library(csaw)

To run the workshop package, you can download it from here:

https://content.cruk.cam.ac.uk/bioinformatics/software/DiffBind/QuantitativeChIPseqWorkshop_0.1.4.tar.gz

NOTE: As this package contains all the workshop data, it is about 500MB and may take some time to download and install.

Install this package into your local R instance.

How to follow the workshop

First load the workshop package:

library(QuantitativeChIPseqWorkshop)

The set the working directory to where package is installed:

setwd(system.file(package="QuantitativeChIPseqWorkshop"))

There are a number of ways to follow the workshop markdown document. You can either view an HTML version of the document that includes all the results, or step through the markdown source and build the results as you go.

Option: Open the pre-built HTML in an RStudio "Help" window

You can open the fully completed workbook (with all code and results) in the "Help" pane in RStudio as follows:

vignette("Quantitative-ChIPseq-Workshop","QuantitativeChIPseqWorkshop")

Option: Open the pre-built HTML in a searate Browser window or tab

You can open the fully built workshop (with all code and results) in a browser tab as follows:

rstudioapi::viewer("doc/Quantitative-ChIPseq-Workshop.html")

It may be helpful to move the tab into its own window so you can see the vignette and the RStudio window at the same time.

Option: Open the workshop mark-down file in an edit window

You can open the workshop "source" .Rmd file in the RStudio "Source" pane, and follow along, executing each code chunk in sequence, to build it as you go.

file.edit("doc/Quantitative-ChIPseq-Workshop.Rmd")

Set directory to where the data are

Now set the directory to where the data is located:

setwd("extdata")

You can check that the data are there:

list.files()

you should see a samplesheet ("tamoxifen.csv") and directories containing peaks and reads.

Now we're ready to go!





Loading an experiment into DiffBind


The first step in a quantitative analysis is setting up the experiment so that DiffBind has all the information it needs. For many projects, this can be the most time-consuming step!

What do you need to do a quantitiative analysis of an experiment?

In order to run an quantitative analysis, all the experimental data must be ready. This includes three types of data:

Sample metadata

An experiment is based on a number of samples. Each sample needs a unique SampleID. A comparative analysis requires that the samples are associated with at least two classes that can be compared. The classes the samples belong to are indicated by associated samples metadata, which can include:

Note that not all of these are required, but there should be at least one metadata factor to divide the samples into classes.

Finally it is imperative that the experimental classes be represented by Replicate samples. It is not possible to perform a meaningful quantitative analysis without replicated data to capture variance. Generally a minimum of three replicates are required to obtain meaningful results.

Aligned sequencing read data

The second type of data needed for an analysis are aligned sequencing reads, generally in the form of '.bam' files. Each sample needs an aligned sequencing library, but may include the following:

Called peaks

The type of analysis we are discussing today requires that a peak caller (such as MACS) has been used on the aligned reads to call peaks. While having called peaks for the sample is useful in a variety of ways, it is possible to perform a quantitative analysis without using called peaks. For example, in some cases, the regions of interest may be known in advance (such as a list known gene promoters). An alternative is to perform a quantitative analysis using windows that cover the entire genome. The csaw package [@lun2016csaw] provides a tool to perform such an analysis.

Making a samplesheet

The easiest way to set up an experiment for analysis in DiffBindis to use a sample sheet. This can take the form of a .csv file, or a dataframe.

We can read in the sample sheet provided for the example data set:

samples <- read.csv("tamoxifen.csv")
samples

Here we see the example ChIP-seq data set used for this tutorial. It consists of 11 samples, all ChIPed for the same factor (ER, the estrogen receptor). There are four breast-cancer cell lines involved, representing two experimental conditions: cells that either Responsive to the drug tamoxifen, or those that are Resistant to tamoxifen. Note that one cell line (MCF7) is marked as being both Responsiveand Resistant; these cells are naturally Responsive, but a Resistant version has been derived by treating with tamoxifen until only resistant cells remain. The goal of the experiment is to examine changes in ER binding patterns between the two conditions.

There are seven Responsive replicates, encompassing three different cell lines, with two or three replicates for each Responsivecell line. There are four replicates of Resistant samples in two cell lines, each with two replicates. For each sample we have a set of aligned reads (note that only reads aligned to chromosome 18 are included to speed up the example analysis), and an Input control is available for each cell line/condition combination.

Peaks are included in .bed format (they were originally called using MACS [@zhang2008model].)

Loading the experiment into DiffBind

At this point a complete default end-to-end analysis could be run using a single command (Do not run this code during the workshop!):

result <- dba.analyze(samples)

This is tells DiffBind to run through all the steps of an analysis, roughly (Do not run this code during the workshop!):

peaks  <- dba(sampleSheet=samples)
peaks  <- dba.blacklist(peaks)
counts <- dba.count(peaks)
counts <- dba.normalize(counts)
model  <- dba.contrast(counts)
result <- dba.analyze(model)

Instead, we are going to walk through each of the steps of the analysis, one at a time.

First load the experiment into DiffBind either by supplying the original .csv file or using the loaded samples dataframe:

tam.peaks <- dba(sampleSheet="tamoxifen.csv")

This creates a DiffBind object (a DBA object) we've called tam.peaks, which can be examined:

tam.peaks

This shows the metadata for the 11 samples, along with how many called peaks were included for each. After loading, overlapping peaks are merged; there are a total of 3,795 unique regions specified, with 2,845 of them overlapping peaks called in at least two samples.





Blacklists and greylists


Now that the peaks are loaded, it is good practice to filter out peaks called in problematic regions. This is accomplished using a standard, published Blacklist of areas in in reference genome known to be problematic. The best known lists have been identified as part of the ENCODE project [@amemiya2019blacklist] and are available for a variety of reference genomes and genome versions. The current ENCODE blacklists are available through the dba.blacklist() function.

In addition, if control alignments are available for an experiment, experiment-specific exclusion lists can be generated reflecting the tissue and experimental conditions used. We call these Greylists, and they are generated using the GreyListChIPpackage [@brown2015greylistchip]. The idea is to analyze libraries that are not meant to show systematic enrichment (such as Inputs, in which no anti-body is introduced), and identify anomalous regions where a disproportionate degree of signal is present.

Applying a blacklist

If an ENCODE blacklist exists for the reference genome used to align the experimental data, it can be applied automatically using the dba.blacklist() function. The reference genome is automatically detected from the supplied .bam files.

dba.blacklist(tam.peaks, blacklist=TRUE, greylist=FALSE)

This shows that the blacklist filtered out one merged peak that had been called in three of the samples.

Generating and applying a greylist

Greylists can be automatically generated from the controls as DiffBind provides an interface to the GreyListChIP package. This can be a time-consuming operation, as each of the controls is analyzed for anomalous regions.

The blacklist can be applied, and the greylists generated and applied, in a single step NB: This is a compute-intensive step that may take too long to run in the workshop, see alternative method to load pre-computed greylists below

tam.peaks <- dba.blacklist(tam.peaks)
## Genome detected: Hsapiens.UCSC.hg19
## Applying blacklist...
## Removed: 3 of 14102 intervals.
## Counting control reads for greylist...
## Building greylist: reads/Chr18_BT474_input.bam
## coverage: 166912 bp (0.21%)
## Building greylist: reads/Chr18_MCF7_input.bam
## coverage: 106495 bp (0.14%)
## Building greylist: reads/Chr18_T47D_input.bam
## coverage: 56832 bp (0.07%)
## Building greylist: reads/Chr18_TAMR_input.bam
## coverage: 122879 bp (0.16%)
## Building greylist: reads/Chr18_ZR75_input.bam
## coverage: 72704 bp (0.09%)
## BT474c: 58 ranges, 166912 bases
## MCF7c: 14 ranges, 106495 bases
## T47Dc: 11 ranges, 56832 bases
## TAMRc: 10 ranges, 122879 bases
## ZR75c: 15 ranges, 72704 bases
## Master greylist: 72 ranges, 255487 bases
## Removed: 420 of 14099 intervals.
## Removed: 52 merged (of 3795) and 50 (of 2845) consensus.

To save time, greylists for the example data set are included, and we can apply them directly:

data(tamoxifen_greylist)
names(tamoxifen.greylist)

In this object, there is a "master" greylist formed from the individual greylist (computed one per control):

tamoxifen.greylist$master
names(tamoxifen.greylist$controls)

Now we can apply it:

tam.peaks <- dba.blacklist(tam.peaks, greylist=tamoxifen.greylist)

The greylist results in 52 merged peaks being filtered out (50 in the at-least-two-sample consensus), representing some 420 of the originally supplied peaks.





Consensus peaks sets and counting reads


The central vehicle for conducting a quantitative analysis is a binding matrix, with rows representing genomic intervals, columns representing samples, and values representing overlapping read counts (similar to an expression matrix in RNA-seq). To compute this, we need to define the rows in the form of a consensus peak set.

Consensus peak set

One straightforward way to determine the consensus peaks is to consider genomic intervals that are identified as peaks in one or more of the samples. It can be helpful to make a plot of how many peaks overlap in how many samples:

olap.rate <- dba.overlap(tam.peaks, mode=DBA_OLAP_RATE)
olap.rate
plot(olap.rate, xlab="Overlapping samples", ylab="Overlapping peaks", type="b")

This shows that there are r olap.rate[1] total merged peaks, representing the union of all intervals. At the other extreme, there are r olap.rate[11] peaks that overlap in all 11 samples, representing the intersection of all the samples.

Which should we chose? Given the rigor of the underlying statistical analysis, we can choose a more inclusive consensus set. The default is to make the consensus peak set using peaks identified in at least two samples (r olap.rate[2]).

More complex schemes, such as forming a consensus peak set separately from the replicates for each condition and then taking the union of these, are also possible.

The default consensus peak set, which includes merged peaks that overlap in at least two samples, can be retrieved:

consensus.peaks <- dba.peakset(tam.peaks, bRetrieve=TRUE)
consensus.peaks[,0]

Counting overlapping reads

The next step is to form the binding matrix using the consensus peak and computing overlapping read counts for each peak in each sample, whether or not it was called as a peak in that sample.

The function used to accomplish this is dba.count(): NB: This is a compute-intensive step that may take too long to run in the workshop, see below for alternative way to load pre-computed counts.

tam.counts <- dba.count(tam.peaks)
## Computing summits...
## Re-centering peaks...
## Reads will be counted as Single-end.

You can, alternatively, obtain the results as follows:

data(tamoxifen_counts)
tam.counts <- tamoxifen

dba.count() takes a number of parameter to control it's default behavior. Here is a list of some important ones:

After counting the reads, the DBA object reflects the consensus peak set rather than the original peak data:

tam.counts

Now all eleven samples have counts for all 2,845 consensus sites. Two columns have been added to the display, showing the total number of aligned Reads in each .bamfile, and the Fraction of Reads in Peaks (FRiP) representing the proportion of reads that were counted as overlapping a consensus site.

Examining and normalizing the binding matrix

At this stage it is worth looking more closely at certain aspects of the binding matrix in preparation for normalization and modeling.

Plots to look for a batch effect

Of particular interest is the presence of technical effects in the experiment, such as batch effects.

We can look at how the samples cluster using heatmaps and PCA plots. A correlation heatmap (in which correlations values between each pair of columns in the binding matrix is calculated) can be shown using dba.plotHeatmap(), which is also the default plot:

plot(tam.counts)

This shows a very high correlation between the replicate for each sample type, which indicates that there is not a large batch effect. The main clustering divides the samples into those derived from the MCF7 cell line, which are all highly correlated, with the other cell lines clustering together. At this point, we do not see a natural clustering into tamoxifen Responsive and Resistant clusters.

We can verify that there is no obvious batch effect with a PCA plot:

dba.plotPCA(tam.counts,DBA_REPLICATE, label=DBA_TISSUE)

Normalization (default)

The next step is to normalize the data. Normalization of ChIP and ATAC sequencing data is not always straightforward, especially compared to normalization of RNA-seq data. We will look more closely at these issues in a subsequent section. For now we will use the default normalization in DiffBind, which makes minimal assumptions about the data and seeks to "do no harm".

First let's make an MA plot of the data before normalization. An MA plot shows how the fold changes between conditions are distributed as the mean concentration of reads counts increases. For this plot, we'll use the samples in the Responsive condition as the baseline, and see how the samples in the Resistant condition compare.

dba.plotMA(tam.counts, bNormalized=FALSE, sub="Non-Normalized",
           contrast=list(Resistant=tam.counts$masks$Resistant,
                         Responsive=tam.counts$masks$Responsive))

In this plot, each of the consensus site is a point (smoothed), while the the X-axis orders them according to the mean number of
overlapping read counts for the consensus site, and the Y-axis shows the log2 fold change. Points above the 0-fold line (blue) demonstrate greater enrichment of reads in the Resistant condition, while point below the 0-fold line are more enriched in the Responsive condition. The red curve is a loess fit showing the overall trend in the data.

This plots shows that in the raw data, the Responsive condition samples have overall higher concentration of overlapping reads in consensus peaks. This could either be a technical bias, or an indication that there is a loss of ER binding in the tamoxifen resistant condition.

By default, DiffBind normalizes only using the full library sizes of the samples (total aligned reads in the .bam). Perform a default normalization and see how the plot changes:

tam.counts <- dba.normalize(tam.counts)
dba.plotMA(tam.counts, sub="Normalized (Default)",
           contrast=list(Resistant=tam.counts$masks$Resistant,
                         Responsive=tam.counts$masks$Responsive))

Compared to the non-normalized data, the fold changes are somewhat more balanced and moved closer to the central 0-fold line, although the loess fit shows there is still a bias towards greater enrichment in the Responsive condition.

We will consider normalization in more depth after modeling the data, as it is useful to consider the impact of normalization on the modeling results.





Modeling and testing


The core differential analysis in DiffBind is performed by one or more underlying read count modeling Bioconductor packages. The default package is DESeq2 [@Love2014]; the edgeR package [@Robinson:2010p249] may be also used either as an alternative or in parallel. These packages, developed for modeling RNA-seq read counts, enable GLMs to be fit.

Default model

The DiffBind function dba.contrast() is used to specify the model and establish contrasts to be tested against the model. A default model and contrasts(s) can easily be established (note we explicitly specify that the baseline Condition should be Responsive):

tam.model <- dba.contrast(tam.counts,
                          reorderMeta=list(Condition="Responsive"))
tam.model

The default model is based on observing that there is only one metadata field, Condition, that has a) multiple values and b) at least three samples having each value. A design formula is established using this factor and a contrast comparing Resistant samples against the Responsive samples, with four samples in the Resistant condition and seven samples in the Responsive condition.

Fitting and testing

With the design formula in place, and a contrast established, the model can be fit with DESeq2 and tests applied for the contrast:

tam.model <- dba.analyze(tam.model)
dba.show(tam.model,bContrasts=TRUE)

At the default threshold of FDR < 0.05, some r dba.show(tam.model,bContrasts=TRUE)$DB.DESeq2 sites are identified as being differentially bound.





Examining analysis results


After conducting an analysis, the results can be examined in a number of ways, including obtaining a report with the differential sites, as well as some useful plots.

Reporting: obtaining differential sites

The differential sites may be retrieved as a GRanges object:

tam.db <- dba.report(tam.model)
tam.db

For each site, the genomic interval is reported, as well as the key statistics:

We can look at the balance between Gain and Loss sites:

sum(tam.db$Fold>0)
sum(tam.db$Fold<0)

The report object can now be used for downstream analysis, such as motif analysis, annotation to genomic features, etc.

Plotting: MA plot

After an analysis, it is useful to generate an MA plot with the differential sites superimposed:

dba.plotMA(tam.model)

Plotting: Volcano plot

Similar to an MA plot, a volcano plot shows the log fold change (on the X-axis instead of the Y-axis), compared to the inverse of the FDR on the Y-axis, to show how FDR and LFC are related:

dba.plotVolcano(tam.model)

Plotting: Clustering correlation heatmap

We can also re-visit the clustering plots we originally did on the full binding matrix, except this time only taking differential sites into consideration:

plot(tam.model, contrast=1)

As we would hope, the differential sites separate the samples into two clusters, one Resistant and one Responsive.

Plotting: PCA plot

Likewise, a PCA plot based only on the differential sites shows how the samples can be separated by condition:

dba.plotPCA(tam.model, contrast=1, label=DBA_TISSUE)

Plotting: Read concentration heatmap

In addition to the correlation heatmap, a plot showing the read concentrations can also be generated. In this case we scale the read concentrations for each site, to emphasize the difference between the lowest and highest values:

hmap <- colorRampPalette(c("red", "black", "green"))(n = 13)
readscores <- dba.plotHeatmap(tam.model, contrast=1, correlations=FALSE,
                              scale="row", colScheme = hmap)





Multi-factor design


The default design and contrasts may not be the correct ones, as they may not be focused on the factors of interest. In the current case, it turns out that it is important to use a multi-factor design in order to model the use of different cell lines with each condition, and most importantly the fact that a single cell origin (MCF7) is used across conditions.

In the current case, we will change the model to the "correct" multi-factor design before proceeding:

tam.model <- dba.contrast(tam.model, design="~Tissue + Condition")
dba.show(tam.model,bDesign=TRUE)
tam.model <- dba.analyze(tam.model)
tam.model

With the multi-factor model taking into account the underlying tissue types, we now identify r dba.show(tam.model,bContrasts=TRUE)$DB.DESeq2 differential sites at the default threshold of FDR < 0.05. The resulting MA plot shows these are biased towards loss of binding in the Resistant condition:

dba.plotMA(tam.model, contrast=1)





Normalization discussion


Normalization of experimental data is particularly important in ChIP-seq (and ATAC-seq) analysis, and may require more careful consideration than needed for RNA-seq analysis. This is because the range of ChIP-seq experiments covers more cases than RNA-seq, which usually involve a similar set of possible expressed genes and/or transcripts, many of which are not expected to significantly change expression. ChIP, ATAC, and similar enrichment-based sequencing data may not follow the assumptions inherent in popular methods for normalizing RNA-seq data, as well as exhibiting different types of efficiency and other biases.

Core normalization methods

The default normalization method in DiffBind is to look only at the relative library sizes (number of aligned reads) of the experimental samples. Why don't we use the underlying normalization methods inherent in the DESeq2 (RLE, or Relative Log Expression) and edgeR (TMM, Trimmed Mean of M-values) differential analysis packages?

Consider the example experiment. We saw in the first two MA plots that there was a bias towards greater read concentration in the Responsive condition; this was reduced somewhat, but still remained, after library-size normalization, and the analysis identified more differential sites that gained binding in the Responsive condition (alternatively interpretable as binding loss in the Responsive condition:)

dba.plotMA(tam.model,bNormalized=FALSE, th=0, sub="Non-Normalized")
dba.plotMA(tam.model,bNormalized=TRUE, sub="Normalized: Library Size")

Let's see what happens if we normalize against the binding matrix (RiP) using the "native" method in DESeq2, RLE:

tam.model <- dba.normalize(tam.model, normalize="RLE")
tam.model <- dba.analyze(tam.model)
dba.plotMA(tam.model,bNormalized=TRUE, sub="Normalized: RLE [RiP]")

The result is quite different! The read concentrations are closely balanced and fit near the zero log fold change line. The differential sites are also more balanced between those losing and gaining affinity in the Resistant condition:

rep <- dba.report(tam.model)
sum(rep$Fold < 0)
sum(rep$Fold > 0)

What accounts for the difference? Which is the "correct" analysis?

Reference reads

It turns out that for the type of data we are dealing with in DNA enrichment assays, the reads that are used as the reference for normalizing are more important than the normalization method itself. While in RNA-seq experiments the expression matrix can be normalized directly, based on the reads that uniquely overlap genes or transcripts, this does not apply to a binding matrix based on a consensus peakset.

In a windowing based analysis such as is done using the csaw package[@lun2016csaw], where there is no consensus binding matrix, this issue was addressed by using large (>10,000bp) "bins" of reads across the genome. The idea is that the bins, being so much larger in size than binding sites or stretches of open chromatin, contain mostly "background" reads, and present a less biased way of providing a read count matrix on which to base normalization calculations. This "background" method is included in DiffBind NB: This is a compute-intensive step, however if you have loaded the pre-computed counts, these already include the background reads so this operation may be safely run:

tam.model <- dba.normalize(tam.model, normalize="RLE",
                           background=TRUE)
tam.model <- dba.analyze(tam.model)
dba.plotMA(tam.model,bNormalized=TRUE, sub="Normalized: RLE [Background]")

This yields result that more closely match those obtained using the default library-size based normalization.

Comparison of normalization and reference reads methods

It is interesting to compare the impact of six different normalization schemes:

  1. Library size normalization based on total aligned reads (lib_full)
  2. Library size normalization based on Reads in Peaks (lib_RiP)
  3. RLE (native to DESeq2) normalization based on Reads in Peaks (RLE_RiP)
  4. RLE (native to DESeq2) normalization based on background bins (RLE_BG)
  5. TMM (native to edgeR) normalization based on Reads in Peaks (TMM_RiP)
  6. TMM (native to edgeR) normalization based on background bins (TMM_BG)

There is an R script in the inst/scripts directory called compare_norm.R, that performs all six of the normalization schemes, performs a DESeq2 analysis on each, and collects the differential sites into a report-based DBA object that can be loaded as follows:

norm.comp <- dba.load("norm_comp")
norm.comp

We can look at the differences in a couple of ways. One is to consider how closely correlated are the resulting differential sites:

cor <- dba.plotHeatmap(norm.comp)

This shows how the results cluster. While all the analysis have a core set of sites in common (the lowest correlation between analyses is r min(cor), there are two distinct clusters driven not by the normalization methods, but by the choice of reference reads. One cluster uses all the reads (background bins BG or full library size full), while the other cluster normalizes based only on reads in the binding matrix (RiP). The least important difference is the choice of normalization methods: edgeR's TMM method and DESeq2's RLE method produce essentially the same results when using the same reference reads.

Next we can look at differences in the differential sites themselves by plotting a heatmap of the log fold changes (LFC) of each site:

hmap <- colorRampPalette(c("red", "black", "green"))(n = 7)
dba.plotHeatmap(norm.comp, correlations=FALSE, 
                bLog=FALSE, colScheme=hmap, 
                minval=-4, maxval=4,
                key.title = "LFC", main = "DESeq2 Differentially Bound Sites")

Sites in green are ones that gain binding intensity in the Resistant condition, while those in red lose binding intensity. Sites in black are ones that are not identified in a specific analysis. This plot again shows how the analyses cluster first by the choice of reference reads, but also shows how the balance of sites changes, with one cluster identifying mostly Loss (red) sites, and the other RiP cluster identifying many more Gain sites.

The biological conclusions one might reach could be quite different depending on which normalization scheme is chosen.

Spike-ins and parallel factors

While it is beyond the scope of this workshop, it is worth noting that alternative sets of reference reads can be used. DiffBind supports the use of spike-ins, where the reads used to normalize the libraries are based on those aligned to exogenous chromatin (eg. from Drosophila melanogaster), or are identified as peaks associated with a factor or mark that is known to not change under the experimental conditions. See the DiffBind vignette, and help page for dba.normalize(), for guidance on how to perform such normalization.





Sneak peak: Profile Plots (coming in DiffBind 3.2)

Profile Heatmap for MCF7 samples





sessionInfo()

sessionInfo()





References



bioinformatics-core-shared-training/Quantitative-ChIPseq-Workshop documentation built on Feb. 15, 2022, 12:07 a.m.