mixNBHMM is an R package for the detection of genomic regions exhibiting
short and broad differential enrichment under multi-replicate,
multi-condition settings. It is applicable to data from ChIP-seq,
ATAC-seq, DNase-seq, and related experiments. mixNBHMM
detects these
regions using a three-state hidden Markov model (HMM) with a finite
mixture of negative binomials as emission distribution in the HMM’s
differential state. The HMM framework is particularly suitable for the
detection of differential broad peaks and the mixture model allows the
classification of differential combinatorial patterns of enrichment
across conditions. Code to replicate the results from the mixNBHMM paper
is available at https://github.com/plbaldoni/mixNBHMMPaper.
You can install the released version of mixNBHMM from this repository with:
devtools::install_github("plbaldoni/mixNBHMM")
This example illustrates the necessary steps to run mixNBHMM
given a
set of BAM file paths as input. Using
ChIP-seq data, we would like to detect differential regions of
enrichment for the histone modification
H3K36me3 across distinct cell
lines, which
typically has broad regions of enrichment.
Here, I will use BAM files from the ENCODE Consortium for the cell lines
GM12878, H1-hESC, and HepG2. You can download the files
here.
The mixNBHMM package already contains these data in RData format that is
suitable for a quick run of mixNBHMM
. You can access the processed
data and their documentation by calling data(ENCODE)
and ?ENCODE
,
respectively. These data have gone through some of the necessary
preprocessing steps, namely alignment, filtering of low quality reads,
and PCR duplicate removal.
Here
is an overview of these steps.
First, download and unzip the data that will be used in this example.
# Downloading the data
download.file(url = 'https://www.dropbox.com/s/xyvhvr47lts6joo/mixNBHMM.zip?raw=1',
destfile = './mixNBHMM.zip')
# Unzipping BAM files
unzip(zipfile = './mixNBHMM.zip')
The function mixNBHMMDataSetFromBam
reads in the BAM files and outputs
a RangedSummarizedExperiment
object that contains the experimental
read counts from your data. Additional required inputs include
colData
, a data frame specifying the Condition and Replicate
labels, genome
, a
GRanges
object with the chromosome lengths of the reference genome, and
windowSize
, an integer specifying the size of the windows where read
counts will be computed. Optional parameters include blackList
, a
GRanges object with a set of genomic coordinates to be discarded, and
fragLength
, a vector of estimated fragment lengths (one per BAM file).
If you do not specify fragLength
, the package will estimate the
fragment length of your experiments internally.
The current implementation of the package has a built-in
hg19
genome and its curated
blacklist
for ease of access. Check ?mixNBHMMDataSetFromBam
for the full manual.
Because H3K36me3 is a histone mark that exhibit broad regions of
enrichment, we will use a fairly large windowSize
of 500 base pairs.
library(mixNBHMM)
# Specifying the path for the bam files
bamFiles <- c('./wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
'./wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam',
'./wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
'./wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam',
'./wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
'./wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam')
# Creating the Condition & Replicate labels. It must agree with bamFiles.
colData <- data.frame(Condition = c('GM12878','GM12878',
'H1-hESC','H1-hESC',
'HepG2','HepG2'),
Replicate = c(1,2,1,2,1,2))
# Setting up the data
ENCODE <- mixNBHMMDataSetFromBam(bamFiles = bamFiles,
colData = colData,
genome = 'hg19',
blackList = 'hg19',
windowSize = 500)
From ENCODE
, one can extract the information about the samples, the
matrix of read counts (rows are genomic windows, columns are samples),
and the genomic coordinates associated with the genomic windows
# Sample information
colData(ENCODE)
#> DataFrame with 6 rows and 2 columns
#> Condition Replicate
#> <factor> <numeric>
#> GM12878.1 GM12878 1
#> GM12878.2 GM12878 2
#> H1-hESC.1 H1-hESC 1
#> H1-hESC.2 H1-hESC 2
#> HepG2.1 HepG2 1
#> HepG2.2 HepG2 2
# Summary of read counts
summary(assay(ENCODE))
#> GM12878.1 GM12878.2 H1-hESC.1 H1-hESC.2
#> Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.000
#> Median : 1.000 Median : 1.00 Median : 1.00 Median : 2.000
#> Mean : 2.218 Mean : 1.54 Mean : 2.06 Mean : 2.021
#> 3rd Qu.: 3.000 3rd Qu.: 2.00 3rd Qu.: 3.00 3rd Qu.: 3.000
#> Max. :36.000 Max. :35.00 Max. :72.00 Max. :39.000
#> HepG2.1 HepG2.2
#> Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 1.000
#> Median : 0.000 Median : 2.000
#> Mean : 1.156 Mean : 2.743
#> 3rd Qu.: 1.000 3rd Qu.: 3.000
#> Max. :52.000 Max. :65.000
# Genomic coordinates
rowRanges(ENCODE)
#> GRanges object with 470905 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr2 1-499 *
#> [2] chr2 500-999 *
#> [3] chr2 1000-1499 *
#> [4] chr2 1500-1999 *
#> [5] chr2 2000-2499 *
#> ... ... ... ...
#> [470901] chr2 243050101-243050600 *
#> [470902] chr2 243050601-243051100 *
#> [470903] chr2 243051101-243051600 *
#> [470904] chr2 243051601-243052100 *
#> [470905] chr2 243199301-243199373 *
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
In differential peak calling, it is critical to account for systematic
differences across samples due to technical artifacts. Otherwise, these
can introduce all sorts of biases in downstream analyses. Lun and Smyth
(2015) present an overview on
these issues in the context of differential peak calling from ChIP-seq
experiments. The mixNBHMM
accounts for sample- and window-specific
scaling factors through model offsets, which can be estimated by the
function createOffset
.
In data exhibiting broad enrichment regions, such as data from H3K36me3,
it is common to observe that differences in local ChIP-seq signal
depends non linearly on the local abundance level of mapped read counts
across samples. The createOffset
function implements a loess smoothing
to adjust for these non linear differences[1]. To do so, the object
ENCODE
has to be passed as an argument to createOffset
with
type='loess'
. The returned object will then contain a new assay matrix
offset
with the model offsets.
# Calculating model offsets
ENCODE <- createOffset(object = ENCODE,type = 'loess')
# Summary of model offsets
summary(assay(ENCODE,'offset'))
#> GM12878.1 GM12878.2 H1-hESC.1 H1-hESC.2
#> Min. :-0.30252 Min. :-0.44769 Min. :-0.19705 Min. :-0.25584
#> 1st Qu.:-0.09019 1st Qu.:-0.31800 1st Qu.: 0.01525 1st Qu.: 0.03453
#> Median : 0.08472 Median :-0.16899 Median : 0.12479 Median : 0.14327
#> Mean : 0.08682 Mean :-0.15405 Mean : 0.10690 Mean : 0.11859
#> 3rd Qu.: 0.24343 3rd Qu.:-0.03606 3rd Qu.: 0.22274 3rd Qu.: 0.22771
#> Max. : 0.49778 Max. : 0.38673 Max. : 0.35804 Max. : 0.31985
#> HepG2.1 HepG2.2
#> Min. :-0.6171 Min. :-0.23370
#> 1st Qu.:-0.5286 1st Qu.: 0.04593
#> Median :-0.4814 Median : 0.21193
#> Mean :-0.3940 Mean : 0.20567
#> 3rd Qu.:-0.3540 3rd Qu.: 0.36405
#> Max. : 0.8420 Max. : 0.80563
Once the offsets have been calculated, one may call mixNBHMM
to detect
differential peaks across conditions. The function mixNBHMM
takes as
an argument a list of parameters that control the behavior of its EM
algorithm. These parameters are provided by the function controlEM
and
you can check them from the manual by calling ?controlEM
.
# Calling peaks
ENCODE <- mixNBHMM(object = ENCODE,control = controlEM())
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#> Setting up the data...
#> Ordered conditions: GM12878 H1-hESC HepG2
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#> Algorithm initialization...
#> Initialization completed!
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#> The EM algorithm begins...
#> Done!
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
The relevant results from the EM algorithm are saved as metadata in the
ENCODE
object. Please, refer to the manual for a detailed description
of these results by calling ?mixNBHMM
. For differential peak
detection, one might be interested in looking at the HMM posterior
probabilities or the Viterbi path of HMM states.
# HMM posterior probabilities
metadata(ENCODE)$prob
#> Background Differential Enrichment
#> 1: 1.0000000 4.231317e-61 4.218731e-134
#> 2: 0.9999966 3.425834e-06 2.698578e-11
#> 3: 0.9999965 3.482477e-06 2.724026e-11
#> 4: 0.9999965 3.483414e-06 2.724447e-11
#> 5: 0.9999965 3.483429e-06 2.724454e-11
#> ---
#> 470901: 0.9999838 1.622012e-05 1.046935e-10
#> 470902: 0.9999921 7.866768e-06 8.398454e-11
#> 470903: 0.9999437 5.630309e-05 5.829989e-10
#> 470904: 0.9993762 6.237519e-04 1.850620e-08
#> 470905: 0.9997153 2.846611e-04 1.237726e-08
# Viterbi sequence of HMM states
# B = consensus background
# D = differential
# E = consensus enrichment
head(metadata(ENCODE)$viterbi)
#> [1] "B" "B" "B" "B" "B" "B"
table(metadata(ENCODE)$viterbi)
#>
#> B D E
#> 384222 53915 32768
The function plotPeaks
can be used to plot normalized read counts from
a given genomic region along with the called peaks (and their associated
combinatorial pattern of enrichment) and the HMM posterior
probabilities. To this end, plotPeaks
determines the most likely
differential combinatorial pattern of enrichment from each genomic
window and then checks for the most frequent pattern across genomic
windows pertaining to each called peak.
plotPeaks
takes as input the object ENCODE
, a GRanges objects with
the genomic coordinate to be plotted, and the type of peak calls. By
default, plotPeaks
shows the peak calls based on the Viterbi sequence
of HMM states (type='viterbi'
). Conversely, one may set a desired
false discovery rate control level, such as type=0.05
, and the
function will internally threshold HMM posterior probabilities to call
peaks.
# Peak visualization
# Black track represents the differential peaks
# Different colors represent the most likely combinatorial pattern of enrichment on the window level
#
# B indicates Background, E indicates Enrichment
#
# For example, 'BBE' represents the window-based posterior probabiltiy of
# local enrichment in cell line Hepg2 only (third condition,
# see mixNBHMM output), conditional on the differential HMM state.
plotPeaks(object = ENCODE,ranges = GRanges('chr2',IRanges(168750000,169200000)))
Two of the main features of mixNBHMM are its efficient EM-based
framework and the embedded mixture model. One may use the estimated
posterior probabilities from the mixture model to classify differential
combinatorial patterns of enrichment across groups from windows assigned
to belong to the differential HMM state. These posterior probabilities
are saved as metadata in the output of
mixNBHMM
.
# Logical vector for differential windows using the Viterbi sequence of states
viterbi <- (metadata(ENCODE)$viterbi=='D')
# Mixture model posterior probabilities
metadata(ENCODE)$mixProb[viterbi,]
#> EBB BEB BBE EEB EBE
#> 1: 3.491076e-06 7.997855e-01 2.542504e-05 5.032848e-03 3.977382e-07
#> 2: 4.109254e-02 1.127634e-01 7.530103e-01 8.086910e-04 1.342490e-02
#> 3: 2.576082e-02 8.144664e-01 8.148354e-02 4.327836e-03 1.076373e-03
#> 4: 1.283931e-02 8.876068e-01 3.449080e-03 3.938827e-02 3.804915e-04
#> 5: 1.140612e-01 7.026276e-01 5.929330e-02 3.117967e-02 6.541048e-03
#> ---
#> 53911: 3.347313e-02 2.632016e-05 2.302879e-01 3.378894e-05 7.349412e-01
#> 53912: 2.286187e-03 5.564515e-07 1.047326e-01 1.907502e-06 8.925135e-01
#> 53913: 9.992599e-02 1.336579e-04 2.297672e-01 1.563344e-04 6.681029e-01
#> 53914: 2.279330e-02 2.121583e-03 9.346748e-01 3.074116e-05 3.366792e-02
#> 53915: 2.629660e-04 8.027613e-07 3.101436e-01 7.133294e-07 6.851126e-01
#> BEE
#> 1: 0.1951523165
#> 2: 0.0789002076
#> 3: 0.0728850634
#> 4: 0.0563360064
#> 5: 0.0862972332
#> ---
#> 53911: 0.0012376760
#> 53912: 0.0004652566
#> 53913: 0.0019139111
#> 53914: 0.0067116810
#> 53915: 0.0044793087
The mixNBHMM package contains a function called summarizePeaks
that
exports the results for further visualization in genome browsers, such
as the UCSC Genome Browser. In particular,
it exports the set of called peaks and the associated mixture model
posterior probabilities in
BED and
bedGraph
formats, respectively.
By default, the function summarizePeaks
takes ENCODE
as input and
outputs a GRanges object with the differential peaks based on the
Viterbi path of HMM states (type='viterbi'
). One may also want to use
type='0.05'
to output the set of peaks by controlling the false
discovery rate at 0.05 level. In addition, the arguments name
, path
,
and description
specify the name of the output files (without the BED
and bedGraph extension), the location where the files will be saved, and
a brief text description, respectively.
# Calling summarizePeaks
output <- summarizePeaks(object = ENCODE,
name = 'ENCODE',
path = './output/',
description = 'mixNBHMM peak calls')
#> The following files have been saved:
#> ./output/ENCODE.bed
#> ./output/ENCODE_EBB.bedGraph
#> ./output/ENCODE_BEB.bedGraph
#> ./output/ENCODE_BBE.bedGraph
#> ./output/ENCODE_EEB.bedGraph
#> ./output/ENCODE_EBE.bedGraph
#> ./output/ENCODE_BEE.bedGraph
Once uploaded on a genome browser, one can visualize the peak calls and posterior probabilities associated to different combinatorial patterns of enrichment across cell lines.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.