library(knitr) library(mixNBHMM) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" ) opts_knit$set(root.dir = '~/Downloads/inst/extdata/')
mixNBHMM is an R package for the detection of genomic regions exhibiting short and broad differential enrichment under multi-replicate, multi-condition settings. It is applicable to data from ChIP-seq, ATAC-seq, DNase-seq, and related experiments. mixNBHMM
detects these regions using a three-state hidden Markov model (HMM) with a finite mixture of negative binomials as emission distribution in the HMM’s differential state. The HMM framework is particularly suitable for the detection of differential broad peaks and the mixture model allows the classification of differential combinatorial patterns of enrichment across conditions. Code to replicate the results from the mixNBHMM paper is available at https://github.com/plbaldoni/mixNBHMMPaper.
You can install the released version of mixNBHMM from this repository with:
devtools::install_github("plbaldoni/mixNBHMM")
This example illustrates the necessary steps to run mixNBHMM
given a set of BAM file paths as input. Using ChIP-seq data, we would like to detect differential regions of enrichment for the histone modification H3K36me3 across distinct cell lines, which typically has broad regions of enrichment.
Here, I will use BAM files from the ENCODE Consortium for the cell lines GM12878, H1-hESC, and HepG2. You can download the files here. The mixNBHMM package already contains these data in RData format that is suitable for a quick run of mixNBHMM
. You can access the processed data and their documentation by calling data(ENCODE)
and ?ENCODE
, respectively. These data have gone through some of the necessary preprocessing steps, namely alignment, filtering of low quality reads, and PCR duplicate removal. Here is an overview of these steps.
First, download and unzip the data that will be used in this example.
# Downloading the data download.file(url = 'https://www.dropbox.com/s/xyvhvr47lts6joo/mixNBHMM.zip?raw=1', destfile = './mixNBHMM.zip') # Unzipping BAM files unzip(zipfile = './mixNBHMM.zip')
The function mixNBHMMDataSetFromBam
reads in the BAM files and outputs a RangedSummarizedExperiment
object that contains the experimental read counts from your data. Additional required inputs include colData
, a data frame specifying the Condition and Replicate labels, genome
, a GRanges object with the chromosome lengths of the reference genome, and windowSize
, an integer specifying the size of the windows where read counts will be computed. Optional parameters include blackList
, a GRanges object with a set of genomic coordinates to be discarded, and fragLength
, a vector of estimated fragment lengths (one per BAM file). If you do not specify fragLength
, the package will estimate the fragment length of your experiments internally.
The current implementation of the package has a built-in hg19 genome and its curated blacklist for ease of access. Check ?mixNBHMMDataSetFromBam
for the full manual.
Because H3K36me3 is a histone mark that exhibit broad regions of enrichment, we will use a fairly large windowSize
of 500 base pairs.
library(mixNBHMM) # Specifying the path for the bam files bamFiles <- c('./wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam', './wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam', './wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam', './wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam', './wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam', './wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam') # Creating the Condition & Replicate labels. It must agree with bamFiles. colData <- data.frame(Condition = c('GM12878','GM12878', 'H1-hESC','H1-hESC', 'HepG2','HepG2'), Replicate = c(1,2,1,2,1,2)) # Setting up the data ENCODE <- mixNBHMMDataSetFromBam(bamFiles = bamFiles, colData = colData, genome = 'hg19', blackList = 'hg19', windowSize = 500)
From ENCODE
, one can extract the information about the samples, the matrix of read counts (rows are genomic windows, columns are samples), and the genomic coordinates associated with the genomic windows
# Sample information colData(ENCODE)
# Summary of read counts summary(assay(ENCODE))
# Genomic coordinates rowRanges(ENCODE)
In differential peak calling, it is critical to account for systematic differences across samples due to technical artifacts. Otherwise, these can introduce all sorts of biases in downstream analyses. Lun and Smyth (2015) present an overview on these issues in the context of differential peak calling from ChIP-seq experiments. The mixNBHMM
accounts for sample- and window-specific scaling factors through model offsets, which can be estimated by the function createOffset
.
In data exhibiting broad enrichment regions, such as data from H3K36me3, it is common to observe that differences in local ChIP-seq signal depends non linearly on the local abundance level of mapped read counts across samples. The createOffset
function implements a loess smoothing to adjust for these non linear differences^[Similar techniques have been previously used by Shao et al (2012) and Lun and Smyth (2015)]. To do so, the object ENCODE
has to be passed as an argument to createOffset
with type='loess'
. The returned object will then contain a new assay matrix offset
with the model offsets.
# Calculating model offsets ENCODE <- createOffset(object = ENCODE,type = 'loess')
# Summary of model offsets summary(assay(ENCODE,'offset'))
Once the offsets have been calculated, one may call mixNBHMM
to detect differential peaks across conditions. The function mixNBHMM
takes as an argument a list of parameters that control the behavior of its EM algorithm. These parameters are provided by the function controlEM
and you can check them from the manual by calling ?controlEM
.
# Calling peaks ENCODE <- mixNBHMM(object = ENCODE,control = controlEM())
The relevant results from the EM algorithm are saved as metadata in the ENCODE
object. Please, refer to the manual for a detailed description of these results by calling ?mixNBHMM
. For differential peak detection, one might be interested in looking at the HMM posterior probabilities or the Viterbi path of HMM states.
# HMM posterior probabilities metadata(ENCODE)$prob
# Viterbi sequence of HMM states # B = consensus background # D = differential # E = consensus enrichment head(metadata(ENCODE)$viterbi) table(metadata(ENCODE)$viterbi)
The function plotPeaks
can be used to plot normalized read counts from a given genomic region along with the called peaks (and their associated combinatorial pattern of enrichment) and the HMM posterior probabilities. To this end, plotPeaks
determines the most likely differential combinatorial pattern of enrichment from each genomic window and then checks for the most frequent pattern across genomic windows pertaining to each called peak.
plotPeaks
takes as input the object ENCODE
, a GRanges objects with the genomic coordinate to be plotted, and the type of peak calls. By default, plotPeaks
shows the peak calls based on the Viterbi sequence of HMM states (type='viterbi'
). Conversely, one may set a desired false discovery rate control level, such as type=0.05
, and the function will internally threshold HMM posterior probabilities to call peaks.
# Peak visualization # Black track represents the differential peaks # Different colors represent the most likely combinatorial pattern of enrichment on the window level # # B indicates Background, E indicates Enrichment # # For example, 'BBE' represents the window-based posterior probabiltiy of # local enrichment in cell line Hepg2 only (third condition, # see mixNBHMM output), conditional on the differential HMM state. plotPeaks(object = ENCODE,ranges = GRanges('chr2',IRanges(168750000,169200000)))
Two of the main features of mixNBHMM are its efficient EM-based framework and the embedded mixture model. One may use the estimated posterior probabilities from the mixture model to classify differential combinatorial patterns of enrichment across groups from windows assigned to belong to the differential HMM state. These posterior probabilities are saved as metadata in the output of mixNBHMM
.
# Logical vector for differential windows using the Viterbi sequence of states viterbi <- (metadata(ENCODE)$viterbi=='D') # Mixture model posterior probabilities metadata(ENCODE)$mixProb[viterbi,]
The mixNBHMM package contains a function called summarizePeaks
that exports the results for further visualization in genome browsers, such as the UCSC Genome Browser. In particular, it exports the set of called peaks and the associated mixture model posterior probabilities in BED and bedGraph formats, respectively.
By default, the function summarizePeaks
takes ENCODE
as input and outputs a GRanges object with the differential peaks based on the Viterbi path of HMM states (type='viterbi'
). One may also want to use type='0.05'
to output the set of peaks by controlling the false discovery rate at 0.05 level. In addition, the arguments name
, path
, and description
specify the name of the output files (without the BED and bedGraph extension), the location where the files will be saved, and a brief text description, respectively.
# Calling summarizePeaks output <- summarizePeaks(object = ENCODE, name = 'ENCODE', path = './output/', description = 'mixNBHMM peak calls')
Once uploaded on a genome browser, one can visualize the peak calls and posterior probabilities associated to different combinatorial patterns of enrichment across cell lines.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.