ChIPSeqSpike: ChIP-seq data scaling with spike-in control


Chromatin Immuno-Precipitation followed by Sequencing (ChIP-Seq) is used to determine the binding sites of any protein of interest, such as transcription factors or histones with or without a specific modification, at a genome scale [@barski2007; @park2009]. ChIP-Seq entails treating cells with a cross-linking reagent such as formaldehyde; isolating the chromatin and fragmenting it by sonication; immuno-precipitating with antibodies directed against the protein of interest; reversing crosslink; DNA purification and amplification before submission to sequencing. These many steps can introduce biases that make ChIP-Seq more qualitative than quantitative. Different efficiencies in nuclear extraction, DNA sonication, DNA amplification or antibody recognition make it challenging to distinguish between true differential binding events and technical variability.

This problem was addressed by using an external spike-in control to keep track of technical biases between conditions [@orlando2014; @bonhoure2014]. Exogenous DNA from a different non-closely related species is inserted during the protocol to infer scaling factors. This modification was shown to be especially important for revealing global histone modification differences, that are not caught by traditional downstream data normalization techniques, such as Histone H3 lysine-27 trimethyl (H3K27me3) upon Ezh2 inhibitor treatment [@trojer2016].

ChIPSeqSpike provides tools for ChIP-Seq spike-in normalization, assessment and analysis. Conversely to a majority of ChIP-Seq related tools, ChIPSeqSpike does not rely on peak detection. However, if one wants to focus on peaks, ChIPSeqSpike is flexible enough to do so. ChIPSeqSpike provides ready to use scaled bigwig files as output and scaling factors values. We believe that ChIPSeqSpike will be of great value to the genomics community.

Standard workflow

Note for Windows users

On Windows operating system, reading of BigWig files fail due to the Bioconductor package rtracklayer >= 1.37.6 that does not support this file format. Therefore, the boost mode, the input subtraction method, and scaling method do not work on this operating system.

Quick start

A case study reported Chromatin Immuno-Precipitation followed by Sequencing (ChIP-Seq) experiments that did not show differences upon inhibitor treatment with traditional normalization procedures [@orlando2014]. These experiments looked at the effect of DOT1L inhibitor EPZ5676 [@daigle2013] treatment on the Histone H3 lysine-79 dimethyl (H3K79me2) modification in Jurkat cells. DOT1L is involved in the RNA Polymerase II pause release and licensing of transcriptional elongation. H3K79me2 ChIP-Seq were performed on cells treated with 0%, 50% and 100% EPZ5676 inhibitor (see next section for details on data).

The following code performs spike-in normalization with a wrapper function on data sub-samples. A 'test_chipseq' temporary results folder is also created.


## Preparing testing data
info_file_csv <- system.file("extdata/info.csv", package="ChIPSeqSpike")
bam_path <- system.file("extdata/bam_files", package="ChIPSeqSpike")
bigwig_path <- system.file("extdata/bigwig_files", package="ChIPSeqSpike")
gff_vec <- system.file("extdata/test_coord.gff", package="ChIPSeqSpike")
genome_name <- "hg19"
output_folder <- "test_chipseqspike"
bigwig_files <- system.file("extdata/bigwig_files", 
                              ""), package="ChIPSeqSpike")

## Copying example files
mock <- file.copy(bigwig_files, "test_chipseqspike")

## Performing spike-in normalization
if (.Platform$OS.type != "windows") {
    csds_test <- spikePipe(info_file_csv, bam_path, bigwig_path, gff_vec, 
                           genome_name, outputFolder = output_folder)


The data used in this documentation represent a gold-standard example of the importance of using spike-in controls with ChIP-Seq experiments. It uses chromatin from Drosophila Melanogaster as exogenous spike-in control to correct experimental biases. Without a spike-in control and using only RPM normalization, proper differences in the H3K79me2 histone modification in human Jurkat cells upon EPZ5676 inhibitor treatment were not observed [@orlando2014].

This dataset is made of bigwig and bam files of H3K79me2 ChIP-Seq data and corresponding input DNA controls (see input subtraction section). Bam files contain data aligned to the Human reference genome Hg19 or to the Drosophila reference genome dm3. The latest is used to compute external spike-in scaling factors. All above mentioned data are available at 0%, 50% and 100% EPZ5676 inhibitor treatment.

Testing data

For the sake of memory and computation time efficiency, bigwig files used in this vignette are limited to chromosome 1. Reads falling in the top 10% mostly bound genes (at 0% treatment) with length between 700-800 bp were kept in bam files. For efficient plotting functions testing, only binding values of the top 100 mostly bound genes are used and can be accessed with data(result_extractBinding). Scores for factors and read counts were computed on the whole dataset and are available through data(result_estimateScalingFactors) (see below).

Complete data

The whole dataset is accessible at GSE60104. Specifically, the data used are H3K79me2 0% (GSM1465004), H3K79me2 50% (GSM1465006), H3K79me2 100% (GSM1465008), input DNA 0% (GSM1511465), input DNA 50% (GSM1511467) and input DNA 100% (GSM1511469).

The data were treated as follows: Quality of sequencing was assessed with FastQC v0.11.4. Reads having less than 80% of quality scores above 25 were removed with NGSQCToolkit v2.3.3 [@ngsqctoolkit]. Homo Sapiens Hg19 and Drosophila Melanogaster dm3 from Illumina igenomes UCSC Collection were used. ChIP-Seq data were aligned with Bowtie2 v2.1.0 [@bowtie2] with default parameters. Sam outputs were converted to Bam with Samtools v1.0.6 [@samtools] and sorted with Picard tools v1.88. Data were further processed with Pasha v0.99.21 [@pasha]. Fixed steps wiggle files were converted to bigwigs with wigToBigWig.

Results with the complete dataset are also provided in this documentation.

Detailed operations

The spike-in normalization procedure consists of 4 steps: RPM scaling, input DNA subtraction, RPM scaling reversal and exogenous spike-in DNA scaling. Below is detailed the different steps included in the above mentioned wrapper function 'spikePipe'.

Dataset generation

The different data necessary for proper spike-in scaling are provided in a csv or a tab separated txt file. The columns must contain proper names and are organized as follows: Experiment name (expName); bam file name of data aligned to the endogenous reference genome (endogenousBam); bam file name of data aligned to the exogenous reference genome (exogenousBam); the corresponding input DNA bam file aligned to the endogenous reference genome (inputBam); the fixed steps bigwig file name of data aligned to the endogenous reference genome (bigWigEndogenous) and the fixed steps bigwig file names of the corresponding input DNA experiment aligned to the endogenous reference genome (bigWigInput). \newline ```r info_file <- read.csv(info_file_csv) head(info_file)

From the info file, two kinds of objects can be generated: either a 
ChIPSeqSpikeDataset or a ChIPSeqSpikeDatasetList depending upon the number of 
input DNA experiments. A ChIPSeqSpikeDatasetList object is a list of 
ChIPSeqSpikeDataset object that is created if several input DNA experiments are
 used. In this latter case, ChIP-Seq experiments are grouped by their 
corresponding input DNA. The function spikeDataset creates automatically the 
suitable object. The folder path to the bam and fixed steps bigwig files must 
be provided.
csds_test <- spikeDataset(info_file_csv, bam_path, bigwig_path)

Boost mode

Reading and processing bigwig and bam files can be memory-greedy. By default, files are read, processed and written at each step of the normalization procedure to enable data treatment on a regular desktop computer. One could wish to reduce the time of computation especially when processing a lot of data. ChIPSeqSpike reduces such time by providing a boost mode. \newline ```r if (.Platform$OS.type != "windows") { csds_testBoost <- spikeDataset(info_file_csv, bam_path, bigwig_path, boost = TRUE) is(csds_testBoost) }

Binding scores for each experiment are stored in a GRanges object and are 
directly accessible by functions.
if (.Platform$OS.type != "windows") {

Even if optimizing greatly the time of computing, one should know that loading binding scores of all experiments is greedy in memory and should be used with caution. The boost mode is ignored in the rest of the vignette, but all code provided in the following sections, with the exception of section 3.1 ( plotTransform), can be run with csds_testBoost.

Summary and control

A ChIPSeqSpikeDataset object, at this point, is made of slots storing paths to files. In order to compute scaling factors, bam counts are first computed. A scaling factor is defined as 1000000/bam_count. The method estimateScalingFactors returns bam counts and endogenous/exogenous scaling factors for all experiments. In the following example, scores are computed using chromosome 1 only. Scores on the whole dataset are also indicated below. \newline ```r csds_test <- estimateScalingFactors(csds_test, verbose = FALSE)

The different scores can be visualized:
## Scores on testing sub-samples

##Scores on whole dataset

An important parameter to keep in mind when performing spike-in with ChIP-seq is the percentage of exogenous DNA relative to that of endogenous DNA. The amount of exogenous DNA should be between 2-25% of endogenous DNA. The method getRatio returns the percentage of exogenous DNA and throws a warning if this percentage is not within the 2-25% range. In theory, having more than 25% exogenous DNA should not affect the normalization, whereas having less than 2% is usually not sufficient to perform a reliable normalization. \newline ```r getRatio(csds_test)

Result on the whole dataset

data(ratio) ratio

### RPM scaling

The first normalization applied to the data is the 'Reads Per Million' (RPM) 
mapped reads. The method 'scaling' is used to achieve such normalization using 
default parameters. It is also used to reverse the RPM normalization and apply 
exogenous scaling factors (see sections [2.3.5](#RPMreversal) and 
if (.Platform$OS.type != "windows") {
    csds_test <- scaling(csds_test, outputFolder = output_folder)

In the context of this vignette, output_folder is precised since the testing files were copied to a temporary 'test_chipseqspike' folder. The slots containing paths to files will be updated to this folder. If not precised, the RPM scaled bigwig files are written to the same folder containing bigwigs. This statement is also applicable for the next operations below.

Input Subtraction {#inputSubtraction}

When Immuno-Precipitating (IP) DNA bound by a given protein, a control is needed to distinguish background noise from true signal. This is typically achieved by performing a mock IP, omitting the use of antibody. After mock IP sequencing, one can notice peaks of signal above background. These peaks have to be removed from the experiment since they represent false positives. The inputSubtraction method simply subtracts scores of the input DNA experiment from the corresponding ones. If in boost mode, the input subtracted values are stored in the dataset object and no files are written. For this latter case, the method exportBigWigs can be used to output the transformed files. \newline ```r if (.Platform$OS.type != "windows") { csds_test <- inputSubtraction(csds_test) }

### RPM scaling reversal {#RPMreversal}

After RPM and input subtraction normalization, the RPM normalization is 
reversed in order for the data to be normalized by the exogenous scaling 
if (.Platform$OS.type != "windows") {
    csds_test <- scaling(csds_test, reverse = TRUE)


Exogenous scaling {#exoscaling}

Finally, exogenous scaling factors are applied to the data. \newline ```r if (.Platform$OS.type != "windows") { csds_test <- scaling(csds_test, type = "exo") }

### Extract binding values

The last step of data processing is to extract and format binding scores in 
order to use plotting methods. The 'extractBinding' method extracts binding 
scores at different locations and stores these values in the form of 
PlotSetArray objects and matrices (see ?extractBinding for more details). The 
scores are retrieved on annotations provided in a gff file. If one wishes to 
focus on peaks, their coordinates should be submitted at this step. The genome 
name must also be provided. For details about installing the required BSgenome 
package corresponding to the endogenous organism, see the 
package documentation.
if (.Platform$OS.type != "windows") {
    csds_test <- extractBinding(csds_test, gff_vec, genome_name)

Plotting data

ChIPSeqSpike offers several graphical methods for normalization diagnosis and data exploration. These choices enable one to visualize each step of the normalization through exploring inter-samples differences using profiles, heatmaps, boxplots and correlation plots.

In the following sections, the testing data are restricted to the 100 mostly bound genes. Results on the complete set of hg19 genes are also indicated.

Meta-profiles and transformations

The first step of spike-in normalized ChIP-Seq data analysis is an inter-sample comparison by meta-gene or meta-annotation profiling. The method 'plotProfile' automatically plots all experiments at the start, midpoint, end and composite locations of the annotations provided to the method extractBinding in gff format. Here is the result of profiling H3K79me2 on the 100 mostly bound genes at 0% inhibitor treatment (figure \ref{figure1}). \newline ```r", fig.height = 6} data(result_extractBinding) plotProfile(csds, legend = TRUE)

The unspiked data (however RPM scaled and input subtracted) can be added 
to the plot (figure \ref{figure2}).
```r", fig.height = 7}
plotProfile(csds, legend = TRUE, notScaled = TRUE)

\newpage The effect of the individual processing steps for each experiment can also be plotted. \newline ```r plotTransform(csds, legend = TRUE, separateWindows = TRUE)

## Heatmaps

plotHeatmaps is a versatile method based on the plotHeatmap method of the 
seqplots package [@seqplots]. This method enables one to represent data at 
different locations (start, end, midpoint, composite) and at different stages 
of the normalization process. Different scaling (log, zscore, etc) and 
different clustering approaches (k-means, hierarchical, etc) can be used (see 
documentation for more details).

Figure \ref{figure3} shows a k-means clustering of spiked data, each group 
being sub-sorted by decreasing values.
```r", fig.height = 5}
plotHeatmaps(csds, nb_of_groups = 2, clustering_method = "kmeans")

\newpage Figure \ref{figure4} illustrates a clustering by decreasing values on the whole dataset.

Spiked data organized by decreasing values at start position of all refseq Hg19 genes \label{figure4}


boxplotSpike plots boxplots of the mean values of ChIP-seq experiments on the annotations given to the extractBinding method. It offers a wide range of graphical representations that includes violin plots (see documentation for details). Figure \ref{figure5} illustrates all transformations of all dataset indicating confidence intervals. \newpage ```r", fig.height = 6} par(cex.axis=0.5) boxplotSpike(csds, rawFile = TRUE, rpmFile = TRUE, bgsubFile = TRUE, revFile = TRUE, spiked = TRUE, outline = FALSE)

Figure \ref{figure6} shows spiked experiments indicating each mean value, mean
 and standard deviation with a violin plot representation.
```r", fig.height = 6}
boxplotSpike(csds, outline = FALSE, violin=TRUE, mean_with_sd = TRUE,
 jitter = TRUE)

Correlation plots

The plotCor method plots the correlation between ChIP-seq experiments using heatscatter plot or, if heatscatterplot = FALSE, correlation tables. For heatscatter plots, ChIPSeqSpike makes use of the heatscatter function of the package LSD and the corrplot function of the package corrplot is used to generate correlation tables. This offers a wide range of graphical possibilities for assessing the correlation between experiments and transformation steps (see documentation for more details).

Figure \ref{figure7} shows two correlation table representations between spiked experiments. \newline ```r", fig.height = 6, fig.width=10} par(mfrow=c(1,2)) plotCor(csds, heatscatterplot = FALSE) plotCor(csds, heatscatterplot = FALSE, method_corrplot = "number")

Figure \ref{figure8} illustrates a heatscatter plot of spiked data after log 
transformation (only positive mean binding values are kept) and figure 
\ref{figure9} is the result of running the same code on the whole refseq Hg19 
gene set.
```r", fig.height=6}
plotCor(csds, method_scale = "log")

Heatscatter of spiked data after log transformation on all Hg19 refseq genes \label{figure9}

```r unlink("test_chipseqspike/", recursive = TRUE)

# Session info




Try the ChIPSeqSpike package in your browser

Any scripts or data that you put into this service are public.

ChIPSeqSpike documentation built on Nov. 8, 2020, 5:29 p.m.