README.md

mixNBHMM

mixNBHMM is an R package for the detection of genomic regions exhibiting short and broad differential enrichment under multi-replicate, multi-condition settings. It is applicable to data from ChIP-seq, ATAC-seq, DNase-seq, and related experiments. mixNBHMM detects these regions using a three-state hidden Markov model (HMM) with a finite mixture of negative binomials as emission distribution in the HMM’s differential state. The HMM framework is particularly suitable for the detection of differential broad peaks and the mixture model allows the classification of differential combinatorial patterns of enrichment across conditions. Code to replicate the results from the mixNBHMM paper is available at https://github.com/plbaldoni/mixNBHMMPaper.

Installation

You can install the released version of mixNBHMM from this repository with:

devtools::install_github("plbaldoni/mixNBHMM")

Example

This example illustrates the necessary steps to run mixNBHMM given a set of BAM file paths as input. Using ChIP-seq data, we would like to detect differential regions of enrichment for the histone modification H3K36me3 across distinct cell lines, which typically has broad regions of enrichment.

Here, I will use BAM files from the ENCODE Consortium for the cell lines GM12878, H1-hESC, and HepG2. You can download the files here. The mixNBHMM package already contains these data in RData format that is suitable for a quick run of mixNBHMM. You can access the processed data and their documentation by calling data(ENCODE) and ?ENCODE, respectively. These data have gone through some of the necessary preprocessing steps, namely alignment, filtering of low quality reads, and PCR duplicate removal. Here is an overview of these steps.

Step 1: loading BAM files

First, download and unzip the data that will be used in this example.

# Downloading the data
download.file(url = 'https://www.dropbox.com/s/xyvhvr47lts6joo/mixNBHMM.zip?raw=1',
              destfile = './mixNBHMM.zip')

# Unzipping BAM files
unzip(zipfile = './mixNBHMM.zip')

The function mixNBHMMDataSetFromBam reads in the BAM files and outputs a RangedSummarizedExperiment object that contains the experimental read counts from your data. Additional required inputs include colData, a data frame specifying the Condition and Replicate labels, genome, a GRanges object with the chromosome lengths of the reference genome, and windowSize, an integer specifying the size of the windows where read counts will be computed. Optional parameters include blackList, a GRanges object with a set of genomic coordinates to be discarded, and fragLength, a vector of estimated fragment lengths (one per BAM file). If you do not specify fragLength, the package will estimate the fragment length of your experiments internally.

The current implementation of the package has a built-in hg19 genome and its curated blacklist for ease of access. Check ?mixNBHMMDataSetFromBam for the full manual.

Because H3K36me3 is a histone mark that exhibit broad regions of enrichment, we will use a fairly large windowSize of 500 base pairs.

library(mixNBHMM)

# Specifying the path for the bam files 
bamFiles <- c('./wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam')

# Creating the Condition & Replicate labels. It must agree with bamFiles.
colData <- data.frame(Condition = c('GM12878','GM12878',
                                   'H1-hESC','H1-hESC',
                                   'HepG2','HepG2'),
                     Replicate = c(1,2,1,2,1,2))

# Setting up the data
ENCODE <- mixNBHMMDataSetFromBam(bamFiles = bamFiles,
                                 colData = colData,
                                 genome = 'hg19',
                                 blackList = 'hg19',
                                 windowSize = 500)

From ENCODE, one can extract the information about the samples, the matrix of read counts (rows are genomic windows, columns are samples), and the genomic coordinates associated with the genomic windows

# Sample information
colData(ENCODE)
#> DataFrame with 6 rows and 2 columns
#>           Condition Replicate
#>            <factor> <numeric>
#> GM12878.1   GM12878         1
#> GM12878.2   GM12878         2
#> H1-hESC.1   H1-hESC         1
#> H1-hESC.2   H1-hESC         2
#> HepG2.1       HepG2         1
#> HepG2.2       HepG2         2
# Summary of read counts
summary(assay(ENCODE))
#>    GM12878.1        GM12878.2       H1-hESC.1       H1-hESC.2     
#>  Min.   : 0.000   Min.   : 0.00   Min.   : 0.00   Min.   : 0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 1.00   1st Qu.: 1.000  
#>  Median : 1.000   Median : 1.00   Median : 1.00   Median : 2.000  
#>  Mean   : 2.218   Mean   : 1.54   Mean   : 2.06   Mean   : 2.021  
#>  3rd Qu.: 3.000   3rd Qu.: 2.00   3rd Qu.: 3.00   3rd Qu.: 3.000  
#>  Max.   :36.000   Max.   :35.00   Max.   :72.00   Max.   :39.000  
#>     HepG2.1          HepG2.2      
#>  Min.   : 0.000   Min.   : 0.000  
#>  1st Qu.: 0.000   1st Qu.: 1.000  
#>  Median : 0.000   Median : 2.000  
#>  Mean   : 1.156   Mean   : 2.743  
#>  3rd Qu.: 1.000   3rd Qu.: 3.000  
#>  Max.   :52.000   Max.   :65.000
# Genomic coordinates
rowRanges(ENCODE)
#> GRanges object with 470905 ranges and 0 metadata columns:
#>            seqnames              ranges strand
#>               <Rle>           <IRanges>  <Rle>
#>        [1]     chr2               1-499      *
#>        [2]     chr2             500-999      *
#>        [3]     chr2           1000-1499      *
#>        [4]     chr2           1500-1999      *
#>        [5]     chr2           2000-2499      *
#>        ...      ...                 ...    ...
#>   [470901]     chr2 243050101-243050600      *
#>   [470902]     chr2 243050601-243051100      *
#>   [470903]     chr2 243051101-243051600      *
#>   [470904]     chr2 243051601-243052100      *
#>   [470905]     chr2 243199301-243199373      *
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Step 1: Data Normalization

In differential peak calling, it is critical to account for systematic differences across samples due to technical artifacts. Otherwise, these can introduce all sorts of biases in downstream analyses. Lun and Smyth (2015) present an overview on these issues in the context of differential peak calling from ChIP-seq experiments. The mixNBHMM accounts for sample- and window-specific scaling factors through model offsets, which can be estimated by the function createOffset.

In data exhibiting broad enrichment regions, such as data from H3K36me3, it is common to observe that differences in local ChIP-seq signal depends non linearly on the local abundance level of mapped read counts across samples. The createOffset function implements a loess smoothing to adjust for these non linear differences[1]. To do so, the object ENCODE has to be passed as an argument to createOffset with type='loess'. The returned object will then contain a new assay matrix offset with the model offsets.

# Calculating model offsets
ENCODE <- createOffset(object = ENCODE,type = 'loess')
# Summary of model offsets
summary(assay(ENCODE,'offset'))
#>    GM12878.1          GM12878.2          H1-hESC.1          H1-hESC.2       
#>  Min.   :-0.30252   Min.   :-0.44769   Min.   :-0.19705   Min.   :-0.25584  
#>  1st Qu.:-0.09019   1st Qu.:-0.31800   1st Qu.: 0.01525   1st Qu.: 0.03453  
#>  Median : 0.08472   Median :-0.16899   Median : 0.12479   Median : 0.14327  
#>  Mean   : 0.08682   Mean   :-0.15405   Mean   : 0.10690   Mean   : 0.11859  
#>  3rd Qu.: 0.24343   3rd Qu.:-0.03606   3rd Qu.: 0.22274   3rd Qu.: 0.22771  
#>  Max.   : 0.49778   Max.   : 0.38673   Max.   : 0.35804   Max.   : 0.31985  
#>     HepG2.1           HepG2.2        
#>  Min.   :-0.6171   Min.   :-0.23370  
#>  1st Qu.:-0.5286   1st Qu.: 0.04593  
#>  Median :-0.4814   Median : 0.21193  
#>  Mean   :-0.3940   Mean   : 0.20567  
#>  3rd Qu.:-0.3540   3rd Qu.: 0.36405  
#>  Max.   : 0.8420   Max.   : 0.80563

Step 2: Peak Calling

Once the offsets have been calculated, one may call mixNBHMM to detect differential peaks across conditions. The function mixNBHMM takes as an argument a list of parameters that control the behavior of its EM algorithm. These parameters are provided by the function controlEM and you can check them from the manual by calling ?controlEM.

# Calling peaks

ENCODE <- mixNBHMM(object = ENCODE,control = controlEM())
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
#> Setting up the data...
#> Ordered conditions: GM12878 H1-hESC HepG2 
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
#> Algorithm initialization...
#> Initialization completed!
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
#> The EM algorithm begins...
#> Done!
#> # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

The relevant results from the EM algorithm are saved as metadata in the ENCODE object. Please, refer to the manual for a detailed description of these results by calling ?mixNBHMM. For differential peak detection, one might be interested in looking at the HMM posterior probabilities or the Viterbi path of HMM states.

# HMM posterior probabilities

metadata(ENCODE)$prob
#>         Background Differential    Enrichment
#>      1:  1.0000000 4.231317e-61 4.218731e-134
#>      2:  0.9999966 3.425834e-06  2.698578e-11
#>      3:  0.9999965 3.482477e-06  2.724026e-11
#>      4:  0.9999965 3.483414e-06  2.724447e-11
#>      5:  0.9999965 3.483429e-06  2.724454e-11
#>     ---                                      
#> 470901:  0.9999838 1.622012e-05  1.046935e-10
#> 470902:  0.9999921 7.866768e-06  8.398454e-11
#> 470903:  0.9999437 5.630309e-05  5.829989e-10
#> 470904:  0.9993762 6.237519e-04  1.850620e-08
#> 470905:  0.9997153 2.846611e-04  1.237726e-08
# Viterbi sequence of HMM states
# B = consensus background
# D = differential
# E = consensus enrichment

head(metadata(ENCODE)$viterbi)
#> [1] "B" "B" "B" "B" "B" "B"

table(metadata(ENCODE)$viterbi)
#> 
#>      B      D      E 
#> 384222  53915  32768

Step 3: Peak Visualization

The function plotPeaks can be used to plot normalized read counts from a given genomic region along with the called peaks (and their associated combinatorial pattern of enrichment) and the HMM posterior probabilities. To this end, plotPeaks determines the most likely differential combinatorial pattern of enrichment from each genomic window and then checks for the most frequent pattern across genomic windows pertaining to each called peak.

plotPeaks takes as input the object ENCODE, a GRanges objects with the genomic coordinate to be plotted, and the type of peak calls. By default, plotPeaks shows the peak calls based on the Viterbi sequence of HMM states (type='viterbi'). Conversely, one may set a desired false discovery rate control level, such as type=0.05, and the function will internally threshold HMM posterior probabilities to call peaks.

# Peak visualization
# Black track represents the differential peaks
# Different colors represent the most likely combinatorial pattern of enrichment on the window level
#
# B indicates Background, E indicates Enrichment
# 
# For example, 'BBE' represents the window-based posterior probabiltiy of
# local enrichment in cell line Hepg2 only (third condition, 
# see mixNBHMM output), conditional on the differential HMM state.

plotPeaks(object = ENCODE,ranges = GRanges('chr2',IRanges(168750000,169200000)))

Step 4: Summarizing Peaks

Two of the main features of mixNBHMM are its efficient EM-based framework and the embedded mixture model. One may use the estimated posterior probabilities from the mixture model to classify differential combinatorial patterns of enrichment across groups from windows assigned to belong to the differential HMM state. These posterior probabilities are saved as metadata in the output of mixNBHMM.

# Logical vector for differential windows using the Viterbi sequence of states
viterbi <- (metadata(ENCODE)$viterbi=='D')

# Mixture model posterior probabilities
metadata(ENCODE)$mixProb[viterbi,]
#>                 EBB          BEB          BBE          EEB          EBE
#>     1: 3.491076e-06 7.997855e-01 2.542504e-05 5.032848e-03 3.977382e-07
#>     2: 4.109254e-02 1.127634e-01 7.530103e-01 8.086910e-04 1.342490e-02
#>     3: 2.576082e-02 8.144664e-01 8.148354e-02 4.327836e-03 1.076373e-03
#>     4: 1.283931e-02 8.876068e-01 3.449080e-03 3.938827e-02 3.804915e-04
#>     5: 1.140612e-01 7.026276e-01 5.929330e-02 3.117967e-02 6.541048e-03
#>    ---                                                                 
#> 53911: 3.347313e-02 2.632016e-05 2.302879e-01 3.378894e-05 7.349412e-01
#> 53912: 2.286187e-03 5.564515e-07 1.047326e-01 1.907502e-06 8.925135e-01
#> 53913: 9.992599e-02 1.336579e-04 2.297672e-01 1.563344e-04 6.681029e-01
#> 53914: 2.279330e-02 2.121583e-03 9.346748e-01 3.074116e-05 3.366792e-02
#> 53915: 2.629660e-04 8.027613e-07 3.101436e-01 7.133294e-07 6.851126e-01
#>                 BEE
#>     1: 0.1951523165
#>     2: 0.0789002076
#>     3: 0.0728850634
#>     4: 0.0563360064
#>     5: 0.0862972332
#>    ---             
#> 53911: 0.0012376760
#> 53912: 0.0004652566
#> 53913: 0.0019139111
#> 53914: 0.0067116810
#> 53915: 0.0044793087

The mixNBHMM package contains a function called summarizePeaks that exports the results for further visualization in genome browsers, such as the UCSC Genome Browser. In particular, it exports the set of called peaks and the associated mixture model posterior probabilities in BED and bedGraph formats, respectively.

By default, the function summarizePeaks takes ENCODE as input and outputs a GRanges object with the differential peaks based on the Viterbi path of HMM states (type='viterbi'). One may also want to use type='0.05' to output the set of peaks by controlling the false discovery rate at 0.05 level. In addition, the arguments name, path, and description specify the name of the output files (without the BED and bedGraph extension), the location where the files will be saved, and a brief text description, respectively.

# Calling summarizePeaks
output <- summarizePeaks(object = ENCODE,
                         name = 'ENCODE',
                         path = './output/',
                         description = 'mixNBHMM peak calls')
#> The following files have been saved:
#> ./output/ENCODE.bed 
#> ./output/ENCODE_EBB.bedGraph 
#> ./output/ENCODE_BEB.bedGraph 
#> ./output/ENCODE_BBE.bedGraph 
#> ./output/ENCODE_EEB.bedGraph 
#> ./output/ENCODE_EBE.bedGraph 
#> ./output/ENCODE_BEE.bedGraph

Once uploaded on a genome browser, one can visualize the peak calls and posterior probabilities associated to different combinatorial patterns of enrichment across cell lines.

  1. Similar techniques have been previously used by Shao et al (2012) and Lun and Smyth (2015)


plbaldoni/mixNBHMM documentation built on Dec. 24, 2019, 1:31 p.m.