Modeling ChIP-Seq data with GenoGAM 2.0: A Genome-wide generalized additive model

knitr::opts_chunk$set(tidy=FALSE, cache=TRUE,
                      dev="png", 
                      message=FALSE, error=FALSE, warning=FALSE)

Note: if you use GenoGAM in published research, please cite:

Stricker and Engelhardt, et al. (2017) GenoGAM: Genome-wide generalized additive models for ChIP-seq analysis Bioinformatics, 33:15. 10.1093/bioinformatics/btx150

and

Stricker, Galinier and Gagneur (2018) GenoGAM 2.0: scalable and efficient implementation of genome-wide generalized additive models for gigabase-scale genomes BMC Bioinformatics, 19:247. 10.1186/s12859-018-2238-7

Quick guide (for small organisms)

This is the brief version of the usual workflow of GenoGAM. It involves:

Getting started

The example dataset is restricted to the first 100kb of the first yeast chromosome.

  1. We set some required meta variables
library(GenoGAM)

## A.
## specify folder and experiment design path
wd <- system.file("extdata/Set1", package='GenoGAM')
folder <- file.path(wd, "bam")
expDesign <- file.path(wd, "experimentDesign.txt")

## B.
## register parallel backend (default is "the number of cores" - 2)
BiocParallel::register(BiocParallel::MulticoreParam(worker = 2))

## C.
## specify chunk and overhang size.
chunkSize <- 50000
overhangSize <- 1000

## D.
## the experiment design reflecting the underlying GAM
## x = position
design <- ~ s(x) + s(x, by = genotype)
  1. Read in the data to obtain a GenoGAMDataSet object.
## build the GenoGAMDataSet
ggd <- GenoGAMDataSet(
  expDesign, directory = folder,
  chunkSize = chunkSize, overhangSize = overhangSize,
  design = design)

ggd
  1. Compute Size factors
## compute size factors
ggd <- computeSizeFactors(ggd)

ggd

## the data
assay(ggd)

Fitting the model

  1. Compute model with fixed hyperparameters
## compute model without parameter estimation to save time in vignette
result <- genogam(ggd, lambda = 4601, theta = 4.51)

result

## the fit and standard error
fits(result)
se(result)
  1. Compute log p-values
result <- computeSignificance(result)

pvalue(result)

Plot results of the center 10kb, where both tiles are joined together.

ranges = GRanges("chrI", IRanges(45000, 55000))
plotGenoGAM(result, ranges = ranges)

GenoGAM 2.0 workflow overview

Standard ChIP-Seq analysis

Additional to the basic smoothing and point-wise significance computation this version of r Biocpkg("GenoGAM") now also supports differential analysis and peak calling with position-wise confidence intervals on ChIP-Seq data.

Goal of the analysis

A small dataset is provided to illustrate the ChIP-Seq functionalities. This is a subset of the data published by Thornton et al [@Thornton2014], who assayed histone H3 Lysine 4 trimethylation (H3K4me3) by ChIP-Seq comparing wild type yeast versus a mutant with a truncated form of Set1, the yeast H3 Lysine 4 methylase. The goal of this analysis is the identification of genomic positions that are significantly differentially methylated in the mutant compared to the wild type strain.

To this end, we will build a GenoGAM model that models the logarithm of the expected ChIP-seq fragment counts $y$ as sums of smooth functions of the genomic position $x$. Specifically, we write (with simplified notations) that:

\begin{equation} \log(\operatorname{E}(y)) = f(x) + \text{genotype} \times f_\text{mutant/wt}(x) (#eq:mutantmodel) \end{equation}

where genotype is 1 for data from the mutant samples, and 0 for the wild type. Here the function $f(x)$ is the reference level, i.e. the log-rate in the wild type strain. The function $f_\text{mutant/wt}(x)$ is the log-ratio of the mutant over wild-type. We will then statistically test the null hypothesis $f_\text{mutant/wt}(x) = 0$ at each position $x$. In the following we show how to build the dataset, perform the fitting of the model and perform the testing.

Registering a parallel backend

The parallel backend is registered using the r Biocpkg("BiocParallel") package. See the documentation in the BiocParallel class for the correct use. Also note, that BiocParallel is just an interface to multiple parallel packages. For example in order to use GenoGAM on a cluster, the r CRANpkg("BatchJobs") package might be required. The parallel backend can be registered at anytime as GenoGAM will just call the current one.

IMPORTANT: According to this and this posts on the Bioconductor support page and R-devel mailing list, the most important core feature of the multicore backend, shared memory, is compromised by Rs own garbage collector, resulting in a replication of the entire workspace across all cores. Given that the amount of data in memory is big (without the HDF5 backend) it might crash the entire system. We highly advice to register the SnowParam backend to avoid this if working on larger organisms. This way the overhead is a little bigger, but only necessary data is copied to the workers, keeping memory consumption relatively low. We never experienced a higher load than 4GB per core, usually it was around 2GB on human genome, which is even lower with the HDF5 backend.

library(GenoGAM)

## On multicore machines by default the number of available cores - 2 are registered on the default backend 'Multicore'
BiocParallel::registered()[1]

For this small example we would like to assign less workers. Check r Biocpkg("BiocParallel") for other possible backends and more options for SnowParam

BiocParallel::register(BiocParallel::SnowParam(workers = 3))

If we check the current registered backend, we see that the number of workers and the backend has changed.

BiocParallel::registered()[1]

Building a GenoGAMDataSet

BAM files restricted to the first 100kb of yeast chromosome I are provided in the inst/extdata folder of the r Biocpkg("GenoGAM") package. This folder also contains a flat file describing the experimental design.

We start by loading the experimental design from the tab-separated text file experimentDesign.txt into a data frame:

folder <- system.file("extdata/Set1", package='GenoGAM')

expDesign <- read.delim(
  file.path(folder, "experimentDesign.txt")
)

expDesign

Each row of the experiment design corresponds to the alignment files in BAM format of one ChIP sample. In case of multiplexed sequencing, the BAM files must have been demultiplexed. The experiment design have named columns. Three column names have a fixed meaning for GenoGAMDataSet and must be provided: ID, file, paired. The field ID stores a unique identifier for each alignment file. It is recommended to use short and easy to understand identifiers because they are subsequently used for labelling data and plots. The field file stores the BAM file name without the directory. The field paired is boolean with values TRUE for paired-end sequencing data, and FALSE for single-end sequencing data. All other columns are taken as the experimental design and must be of binary type. The names of those columns must be identical to the names provided later in the design formula. It will allow us modeling the differential occupancy or call peaks later on. Here the important one is the genotype column.

We will now count sequencing fragment centers per genomic position and store these counts into a GenoGAMDataSet class. GenoGAM reduces ChIP-Seq data to fragment center counts rather than full base coverage so that each fragment is counted only once. This reduces artificial correlation between adjacent nucleotides. For single-end libraries, the fragment center is estimated by shifting the read end position by a constant. The estimation of this constant makes use of methods from the r Biocpkg("chipseq") package (See help of the chipseq::estimate.mean.fraglen method).

The parameters needed to create a GenoGAMDataSetare:

## build the GenoGAMDataSet
wd_folder <- file.path(folder, "bam")
ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype)
)

assay(ggd)

## overhang size
getOverhangSize(ggd)

## chunk size
getChunkSize(ggd)

The GenoGAMDataSet class stores this count data into a structure that index genomic positions over tiles, defined by chunkSize and overhangSize. A bit of background is required to understand these parameters. The smoothing in GenoGAM is based on splines (Figure 1), which are piecewise polynomials. The knots are the positions where the polynomials connect. In our experience, one knot every 20 to 50 bp is required for enough resolution of the smooth fits in typical applications. The fitting of generalized additive models involves steps demanding a number of operations proportional to the square of the number of knots, preventing fits along whole chromosomes. To make the fitting of GAMs genome-wide, GenoGAM performs fitting on overlapping intervals (tiles), and join the fit at the midpoint of the overlap of consecutive tiles. The parameters chunkSize and overhangSize defines the tiles, where the chunk is the core part of a tile that does not overlap other tiles, and the overhangs are the two overlapping parts.

Both variables influence the computation speed and memory usage as they define into how many (and how big) tiles the overall genome is being split for parallel computation. Additionally, a too small overhang size might cause jigged fits at junction points where the chunks are glued back together. Therefore it is advised to keep overhang size to be around 1kb (and is set like this by default). Chunk size is by default roughly estimated to fit the overall modelling problem without using too much memory. However, this estimation is a very simple heuristic, thus if problems are experienced it is advised to set the chunk size yourself. We had good exerience with numbers between 50 - 100kb.

**Figure1:** An example spline. Displayed are seven cubic B-spline basis functions which together make up the complete spline (dark blue above). The knots are depicted as dark-grey dots at the bottom-center of each basis function. All basis functions have been multiplied by their respective coefficient and thus differ.

More options with GenoGAMSettings and the HDF5 backend

For most genome sizes beyond the yeast genome, keeping all data in memory can be impractical or even impossible to fit. Therefore, the HDF5 backend is employed to store all relevant data on hard drive. This can be simply done by setting the parameter hdf5 to true. This will also trigger a split of the underlying data into a list of data frames by chromosome. Splitting data is theoretically not necessary. However, due to problems of representing long integers in the Bioconductor infrastructure data frames of size biger than 2^31 (around 2.14 gigabase pairs) can not be stored easily or only through workarounds that would require more storage. Moreover, because HDF5 data is stored on hard drive it can be easily moved to a different device but has to be read-in again. To makes this easy, the parameter fromHDF5 can be set to true, to trigger GenoGAMDataSet to read from already present HDF5 files.

ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype), hdf5 = TRUE
)

## Note the difference in data type compared to the DataFrame above
assay(ggd)

By default, the HDF5 files are stored in the temp folder, which is impractical if the data has to be kept and shared. Therefore the HDF5 folder can be specified explicitly in the GenoGAMSettings object. This object holds a lot of meta parmeters which are responsible for the functionality of GenoGAM. In particular it guards the following areas:

All settings can be seen by just printing the GenoGAMSettings object. Additionally to the meta parameters it will also show the parallel backend information and the logger threshold, although both are not part of the GenoGAMSettings object.

GenoGAMSettings()

Reading in data

Reading in data is guarded by the bamParams slot. It makes direct use of the class ScanBamParam from the r Biocpkg("Rsamtools") package. Unless the data should be restricted to specific chromosomes, it is the easiest way to specify particular regions that should be read in. Chromosomes can be simply restricted through the chromosomeList slot. Finally, the 'center' slot just specifies if fragments should be centered or not.

range <- GRanges("chrI", IRanges(30000, 50000))
params <- Rsamtools::ScanBamParam(which = range)
settings <- GenoGAMSettings(center = FALSE,
                            bamParams = params)

ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype),
  settings = settings)

## Note the higher counts
assay(ggd)
rowRanges(ggd)

## Now with chromosome specification. As only chrI is present
## in the example file, the read in will log an error and generate
## an empty GenoGAMDataSet object
settings <- GenoGAMSettings(chromosomeList = c("chrII", "chrIII"))

ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype),
  settings = settings)

## Empty GenoGAMDataSet due to lack of data in the example BAM file
length(ggd)
ggd

Parameter estimation

Cross-validaton is performed to estimate the penalization parameter $\lambda$ and the dispersion parameter $\theta$. This is done iteratively maximising the likelihood, where each iteration involved performing one set of k-fold cross-validation. The default method maximising the likelihood function is Nelder-Mead, as the likelihood function can not be analytically specified and has no derivatives. The method can be changed to any method in Rs optim function, but the only reasonable candidate is L-BFGS-B. Moreover, the number of iterations can be changed. This is also true for the model fitting function. There, the number of maximum iterations and the threshold value can be changed.

## Note, optimControl and estimControl is a list of more parameters. However none of them besides the shown are important
settings <- GenoGAMSettings(optimMethod = "L-BFGS-B", optimControl = list(maxit = 100L),
                            estimControl = list(eps = 1e-10, maxiter = 500L)) 
settings

Data control

Probably the most important parameter to set through GenoGAMSettings is the HDF5 folder. Additionally, the compression level can also be changed as an integer between 0 (Lowest) and 9 (Highest). Note, with specifying the folder, the folder gets created!

## set HDF5 folder to 'myFolder'
tmp <- tempdir()
settings <- GenoGAMSettings(hdf5Control = list(dir = file.path(tmp, "myFolder"), level = 0))
HDF5Array::getHDF5DumpDir()
HDF5Array::getHDF5DumpCompressionLevel()

## Another way to set this would be through the HDF5Array package
HDF5Array::setHDF5DumpDir(file.path(tmp, "myOtherFolder"))
settings

As it was mentioned before the fitted splines depend on the placement of knots. The more knots are placed the wigglier the fit and the more the spline has to be penalized to avoid overfitting. This is done automatically. However too little knots will lead to underfitting with no automatic resolution. The default setting is at one knot at every 20bp. Given our experience this is a conservative setting, that works for all applications. For example narrow peak calling might need more knots than broad peak calling as high resolution is required to be able to identify two very close peaks as two peaks instead of one. Because the computational fitting problem grows quadaratically with the number of knots, less knots might speed up computation. However, this might also lead to worse results given the downstream applicaton. Thus, this parameter should be handled with care.

Additionally, computation is speed up during cross-validation by using smaller tiles. Since $\lambda$ and $\theta$ are independent of length, smaller tiles can be used to speed up parameter estimation. The parameter is set as chunk size and defaults to 4kb. Generally one shouldn't go below 2kb as the computation it might become unstable in combination with sparse data.

## For example, we choose a twice as long region but also two times less knots (double the knot spacing),
## which results in the same number of knots per tile as before.
settings <- GenoGAMSettings(dataControl = list(regionSize = 8000, bpknots = 40))
settings
````

**Note:** It is generally not advised to change the settings parameters, unless it is the specification of regions for data read-in or the HDF5 folder.

## Size factor estimation
Sequencing libraries typically vary in sequencing depth. Such variations is controlled for in GenoGAM by adding a sample-specific constant to the right term of Equation \@ref(eq:mutantmodel). The estimation of these constants is performed by the function `computeSizeFactor()` as follows:

```r
## read in data again, since the last read-in was an empty GenoGAM
ggd <- GenoGAMDataSet(
  expDesign, directory = wd_folder,
  design = ~ s(x) + s(x, by = genotype)
)

ggd <- computeSizeFactors(ggd)
sizeFactors(ggd)
````

**Note:** The size factors in GenoGAM are in the natural logarithm scale. Also factor groups are identified automatically from the design. Given more complex design, this might fail. Or, maybe different factor groups are desired. In this case the groups can be provided separatelly through the *factorGroups* argument.

## Model fitting
A GenoGAM model requires two further parameters to be fitted: the regularization parameter, $\lambda$, and the dispersion parameter $\theta$. The regularization parameters $\lambda$ controls the amount of smoothing. The larger $\lambda$ is, the smoother the smooth functions are. The dispersion parameter $\theta$ controls how much the observed counts deviate from their expected value modeled by Equation \@ref(eq:mutantmodel). The dispersion captures biological and technical variation which one typically sees across replicate samples, but also errors of the model. In GenoGAM, the dispersion is modeled by assuming the counts to follow a negative binomial distribution with mean $\mu=\operatorname{E}(y)$ whose logarithm is modeled by Equation \@ref(eq:mutantmodel) and with variance $v = \mu + \mu^2/\theta$.  

If not provided, the parameters $\lambda$ and $\theta$ are obtained by cross-validation. This step is too time consuming for a vignette. For sake of going through this example quickly, we provide the values manually:

```r
## fit model without parameters estimation
fit <- genogam(ggd, lambda = 4601, theta = 4.51)
fit

Remark on parameter estimation: To estimate the parameters $\lambda$ and $\theta$ by cross-validation, call genogam() without setting their values. This will perform a k-fold cross-validation on each tile with initial parameter values and iterate until convergence, often for about 30 iterations. The number of folds can be set with parameter kfolds. We recommend to do it for 20 to 40 different regions (the number can be set with parameter regions). During this procedure the underlying optimization algorithm (Nelder-Mead by default) will explore the possible parameter space. In case of low-count regions this might result in unstable matrix operations, in particular in resolutions of linear systems. To stabilize the computation a small first-order regularization might be added through the parameter eps. This will change the penalization term from $\beta^T \textbf{S} \beta$ to $\beta^T \textbf{S} \beta + \epsilon \textbf{I}$, where identity matrix $\textbf{I}$ is of the same dimension as penalty matrix $\textbf{S}$. The parameter eps represents $\epsilon$ in the equation above. Thus, a too large $\epsilon$ will also have a significant effect on the general fit, while a small $\epsilon$ will mostly affect small values close too zero. We recommend an $\epsilon$ value in the range of 1e-4 - 1e-3.

An important difference of cross-validation in GenoGAM is that instead of leaving out data points, intervals are left out. This improves the parameter estimation by taking into account local correlation where leaving out single points might lead to overfitting. If replicates are present in the data the cross-validation can be further improved by using larger intervals (20bp by default) around the size of a ChIP-Seq fragment, i.e. 200bp. Thus, the large left out interval will be captured by the data in the replicate.

## Does not evaluate
fit_CV <- genogam(ggd, intervalSize = 200)

Remark on parallel computing: If using the SnowParams backend, the initiation of a worker and copying relevant data might take a few seconds. Especially if this is done sequentially. During cross-validation, where computation is done on small tiles, computation might therefore be only a fraction of the total overhead. Therefore, the number of workers is automatically set to an optimal lower number during cross-validation and reset afterwards.

Plotting results

Following the convention of the packages r CRANpkg("mgcv") and r CRANpkg("gam") the names of the fit for the smooth function defined by the by variable in the design formula (which is the same as the column name in the config) follow the pattern s(x):by-variable. Here, the smooth function of interest $f_\text{mutant/wt}(x)$ is thus named s(x):genotype.

ranges = GRanges("chrI", IRanges(45000, 55000))
plotGenoGAM(fit, ranges = ranges)

Because the data is generally too big to plot in its totality. The plotting function requires a specifig range, which is best provided by a GRanges object and supplied to the ranges argument. Additionally, it can be specified if the smooths should be plotted on the same scale (scale), the values should be displayed in log format (log) or if the raw data count data should be included (provided GenoGAMDataSet to ggd, unfortunatelly hard to show in the vignette).

plotGenoGAM(fit, ranges = ranges, scale = FALSE)

Statistical testing

We test for each smooth and at each position $x$ the null hypothesis that it values 0 by a call to computeSignificance(). This gives us pointwise p-values.

fit <- computeSignificance(fit)
pvalue(fit)

Differential binding

If region-wise significance is needed, as in the case of differential binding, then we call computeRegionSignificance(). This returns the provided GRanges object updated by a p-value and FRD column. If smooth is not provided it is computed for all smooths.

gr <- GRanges("chrI", IRanges(c(1000, 7000), c(4000, 9000)))
db <- computeRegionSignificance(fit, regions = gr, smooth = "s(x):genotype")
db

Peak calling

The computed fit allows for easy peak calling of narrow and broad peaks via the callPeaks(). The resulting data.table object provides a similar structure as the ENCODE narrowPeak and broadPeak format. Note, the score is given as the negative natural logarithm of the p-values. Also no peaks borders are provided (a function to compute them just as in GenoGAM 1.0 is being implemented). As this is not peak calling data, we use a high threshold do demonstrate the functionality.

peaks <- callPeaks(fit, threshold = 1, smooth = "s(x):genotype")
peaks

peaks <- callPeaks(fit, threshold = 1, peakType = "broad",
                   cutoff = 0.75, smooth = "s(x):genotype")
peaks

The function writeToBEDFile() provides an easy way to write the peaks data.table to a narrowPeaks or broadPeaks file. The suffix will be determined automatically

## Not evaluated
writeToBEDFile(peaks, file = 'myPeaks')

FAQ

Armadillo Error

Problem: An error occurs during model fitting along those lines:

error: matrix multiplication: incompatible matrix dimensions: 22333830147200x5360120267008000 and 4294972496x1

Solution: First, make sure you have all Armadillo dependencies installed correctly. See here

Second, the error is most likely related to the fact, that Armadillo is using 32bit matrices, thus causing problems for large matrices GenoGAM is using. The solution is to enable ARMA_64BIT_WORD, which is not enabled in RcppArmadillo by default. This should have been done during compilation, but if it fails for some reason you can do it manually with #define ARMA_64BIT_WORD 1 in my_R_Directory/lib/R/library/RcppArmadillo/include/RcppArmadilloConfig.h. See here.

Acknowledgments

We thank Alexander Engelhardt, Mathilde Galinier, Simon Wood, Herv\'e Pag`es, and Martin Morgan for input in the development of GenoGAM

Session Info

sessionInfo()

References



Try the GenoGAM package in your browser

Any scripts or data that you put into this service are public.

GenoGAM documentation built on Nov. 8, 2020, 7:45 p.m.