library(knitr)
library(mixNBHMM)

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.path = "man/figures/README-",
    out.width = "100%"
)
opts_knit$set(root.dir = '~/Downloads/inst/extdata/')

mixNBHMM

mixNBHMM is an R package for the detection of genomic regions exhibiting short and broad differential enrichment under multi-replicate, multi-condition settings. It is applicable to data from ChIP-seq, ATAC-seq, DNase-seq, and related experiments. mixNBHMM detects these regions using a three-state hidden Markov model (HMM) with a finite mixture of negative binomials as emission distribution in the HMM’s differential state. The HMM framework is particularly suitable for the detection of differential broad peaks and the mixture model allows the classification of differential combinatorial patterns of enrichment across conditions. Code to replicate the results from the mixNBHMM paper is available at https://github.com/plbaldoni/mixNBHMMPaper.

Installation

You can install the released version of mixNBHMM from this repository with:

devtools::install_github("plbaldoni/mixNBHMM")

Example

This example illustrates the necessary steps to run mixNBHMM given a set of BAM file paths as input. Using ChIP-seq data, we would like to detect differential regions of enrichment for the histone modification H3K36me3 across distinct cell lines, which typically has broad regions of enrichment.

Here, I will use BAM files from the ENCODE Consortium for the cell lines GM12878, H1-hESC, and HepG2. You can download the files here. The mixNBHMM package already contains these data in RData format that is suitable for a quick run of mixNBHMM. You can access the processed data and their documentation by calling data(ENCODE) and ?ENCODE, respectively. These data have gone through some of the necessary preprocessing steps, namely alignment, filtering of low quality reads, and PCR duplicate removal. Here is an overview of these steps.

Step 1: loading BAM files

First, download and unzip the data that will be used in this example.

# Downloading the data
download.file(url = 'https://www.dropbox.com/s/xyvhvr47lts6joo/mixNBHMM.zip?raw=1',
              destfile = './mixNBHMM.zip')

# Unzipping BAM files
unzip(zipfile = './mixNBHMM.zip')

The function mixNBHMMDataSetFromBam reads in the BAM files and outputs a RangedSummarizedExperiment object that contains the experimental read counts from your data. Additional required inputs include colData, a data frame specifying the Condition and Replicate labels, genome, a GRanges object with the chromosome lengths of the reference genome, and windowSize, an integer specifying the size of the windows where read counts will be computed. Optional parameters include blackList, a GRanges object with a set of genomic coordinates to be discarded, and fragLength, a vector of estimated fragment lengths (one per BAM file). If you do not specify fragLength, the package will estimate the fragment length of your experiments internally.

The current implementation of the package has a built-in hg19 genome and its curated blacklist for ease of access. Check ?mixNBHMMDataSetFromBam for the full manual.

Because H3K36me3 is a histone mark that exhibit broad regions of enrichment, we will use a fairly large windowSize of 500 base pairs.

library(mixNBHMM)

# Specifying the path for the bam files 
bamFiles <- c('./wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneGm12878H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneH1hescH3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep1.markdup.q10.chr2.sorted.bam',
              './wgEncodeBroadHistoneHepg2H3k36me3StdAlnRep2.markdup.q10.chr2.sorted.bam')

# Creating the Condition & Replicate labels. It must agree with bamFiles.
colData <- data.frame(Condition = c('GM12878','GM12878',
                                   'H1-hESC','H1-hESC',
                                   'HepG2','HepG2'),
                     Replicate = c(1,2,1,2,1,2))

# Setting up the data
ENCODE <- mixNBHMMDataSetFromBam(bamFiles = bamFiles,
                                 colData = colData,
                                 genome = 'hg19',
                                 blackList = 'hg19',
                                 windowSize = 500)

From ENCODE, one can extract the information about the samples, the matrix of read counts (rows are genomic windows, columns are samples), and the genomic coordinates associated with the genomic windows

# Sample information
colData(ENCODE)
# Summary of read counts
summary(assay(ENCODE))
# Genomic coordinates
rowRanges(ENCODE)

Step 1: Data Normalization

In differential peak calling, it is critical to account for systematic differences across samples due to technical artifacts. Otherwise, these can introduce all sorts of biases in downstream analyses. Lun and Smyth (2015) present an overview on these issues in the context of differential peak calling from ChIP-seq experiments. The mixNBHMM accounts for sample- and window-specific scaling factors through model offsets, which can be estimated by the function createOffset.

In data exhibiting broad enrichment regions, such as data from H3K36me3, it is common to observe that differences in local ChIP-seq signal depends non linearly on the local abundance level of mapped read counts across samples. The createOffset function implements a loess smoothing to adjust for these non linear differences^[Similar techniques have been previously used by Shao et al (2012) and Lun and Smyth (2015)]. To do so, the object ENCODE has to be passed as an argument to createOffset with type='loess'. The returned object will then contain a new assay matrix offset with the model offsets.

# Calculating model offsets
ENCODE <- createOffset(object = ENCODE,type = 'loess')
# Summary of model offsets
summary(assay(ENCODE,'offset'))

Step 2: Peak Calling

Once the offsets have been calculated, one may call mixNBHMM to detect differential peaks across conditions. The function mixNBHMM takes as an argument a list of parameters that control the behavior of its EM algorithm. These parameters are provided by the function controlEM and you can check them from the manual by calling ?controlEM.

# Calling peaks

ENCODE <- mixNBHMM(object = ENCODE,control = controlEM())

The relevant results from the EM algorithm are saved as metadata in the ENCODE object. Please, refer to the manual for a detailed description of these results by calling ?mixNBHMM. For differential peak detection, one might be interested in looking at the HMM posterior probabilities or the Viterbi path of HMM states.

# HMM posterior probabilities

metadata(ENCODE)$prob
# Viterbi sequence of HMM states
# B = consensus background
# D = differential
# E = consensus enrichment

head(metadata(ENCODE)$viterbi)

table(metadata(ENCODE)$viterbi)

Step 3: Peak Visualization

The function plotPeaks can be used to plot normalized read counts from a given genomic region along with the called peaks (and their associated combinatorial pattern of enrichment) and the HMM posterior probabilities. To this end, plotPeaks determines the most likely differential combinatorial pattern of enrichment from each genomic window and then checks for the most frequent pattern across genomic windows pertaining to each called peak.

plotPeaks takes as input the object ENCODE, a GRanges objects with the genomic coordinate to be plotted, and the type of peak calls. By default, plotPeaks shows the peak calls based on the Viterbi sequence of HMM states (type='viterbi'). Conversely, one may set a desired false discovery rate control level, such as type=0.05, and the function will internally threshold HMM posterior probabilities to call peaks.

# Peak visualization
# Black track represents the differential peaks
# Different colors represent the most likely combinatorial pattern of enrichment on the window level
#
# B indicates Background, E indicates Enrichment
# 
# For example, 'BBE' represents the window-based posterior probabiltiy of
# local enrichment in cell line Hepg2 only (third condition, 
# see mixNBHMM output), conditional on the differential HMM state.

plotPeaks(object = ENCODE,ranges = GRanges('chr2',IRanges(168750000,169200000)))

Step 4: Summarizing Peaks

Two of the main features of mixNBHMM are its efficient EM-based framework and the embedded mixture model. One may use the estimated posterior probabilities from the mixture model to classify differential combinatorial patterns of enrichment across groups from windows assigned to belong to the differential HMM state. These posterior probabilities are saved as metadata in the output of mixNBHMM.

# Logical vector for differential windows using the Viterbi sequence of states
viterbi <- (metadata(ENCODE)$viterbi=='D')

# Mixture model posterior probabilities
metadata(ENCODE)$mixProb[viterbi,]

The mixNBHMM package contains a function called summarizePeaks that exports the results for further visualization in genome browsers, such as the UCSC Genome Browser. In particular, it exports the set of called peaks and the associated mixture model posterior probabilities in BED and bedGraph formats, respectively.

By default, the function summarizePeaks takes ENCODE as input and outputs a GRanges object with the differential peaks based on the Viterbi path of HMM states (type='viterbi'). One may also want to use type='0.05' to output the set of peaks by controlling the false discovery rate at 0.05 level. In addition, the arguments name, path, and description specify the name of the output files (without the BED and bedGraph extension), the location where the files will be saved, and a brief text description, respectively.

# Calling summarizePeaks
output <- summarizePeaks(object = ENCODE,
                         name = 'ENCODE',
                         path = './output/',
                         description = 'mixNBHMM peak calls')

Once uploaded on a genome browser, one can visualize the peak calls and posterior probabilities associated to different combinatorial patterns of enrichment across cell lines.




plbaldoni/mixNBHMM documentation built on Dec. 24, 2019, 1:31 p.m.