knitr::opts_chunk$set(echo = TRUE)
BiocStyle::markdown()
BiocStyle::markdown(css.files = c('custom.css'))
knitr::opts_chunk$set(tidy         = FALSE,
                      warning      = FALSE,
                      message      = FALSE)
library(MSPC)
library(rtracklayer)
library(GenomicRanges)

Abstract

The primary emphasis of MSPC Package is to rescue weakly enriched regions in single sample by co-localized overlapping evidence in multiple ChIP-Seq replicates. The simultaneous presence of an enriched regions in replicates experiment would justify a local decrease of the stringency criterion, leveraging on the principal that repeated evidence is compensating for weak evidence. MSPC provides set of functions to facilitate the output for downstream analysis on ChIP-Seq replicates, facilitate jointly analyzes the enriched regions of multiple sample, distinguishing between biological and technical replicates.

Introduction

Chromatin immunoprecipitation (ChIP) followed by massively parallel sequencing (ChIP-seq) is designed to detect genome wide protein–DNA interaction. ChIP-Seq can identify both sharp peaks typically associated with sequence-specific transcription factors, as well as broad histone-modification signals, and has become a central technology for the investigation of gene regulation. A critical question in the computational analysis of ChIP-Seq data relates to finding peaks in ChIP-Seq data that correspond to protein–DNA binding sites. We are interested in some genomic regions which are enriched and may cooperate with TFBS or gene regulation, but discarded because of significance score below the permissive threshold, which eventually cause misleading of studying important genomic features which has potential biological meaning.

Here we have developed MSPC package, R/Bioconductor Package for Multiple Sample Peak Calling based on original method that presented on [@ Using combined evidence from replicates to evaluate ChIP-seq peaks.], to rigorously combine the weakly enriched peaks in ChIP-Seq replicates, with the options to set a permissive p-value threshold on the repeated evidence and a minimum number of replicates bearing this evidence.

We assess the presence of overlapping enriched regions (A.K.A peaks) across multiple Replicates, the significance of overlapping peak is rigorously combined with Fisher method, which increase the statistical significance of peaks detected in the ChIP-Seq experiment; it assigns peaks to different sets, and in addition provides analysis features that allow performing further assessments and functional analyses on the identified peaks. we applied our method to Myc transcription factor ChIP-Seq datasets in k562 cells available in Encode consortium. Using replicates, we could extend up to 3 times the peak number with respect to single sample analysis with an equivalent significance permissive p-value threshold.

Citation

Original method is presented in [@Vahid_Jalili_MSPC_2015]. Vahid Jalili, Matteo Matteucci, Marco Masseroli,and Marco J. Morelli : Using combined evidence from replicates to evaluate ChIP-seq peaks. Bioinformatics 2015, 31(17):2761-2769.doi:[10.1093/bioinformatics/btv293] (http://bioinformatics.oxfordjournals.org/content/31/17/2761.full)

Downstream analysis workflow for Chip-Seq experiments

This workflow shows how to read Chip-Seq replicates to GRanges objects, using Bioconductor Package GenomicRanges to manipulate peak interval and finding overlapping regions across multiple sample, to increase the statistical significance of weak evidence which provides potential biological insights for genomic research.

Import Chip-seq replicates

The first step, we are going to read input Chip-Seq replicates and all enriched regions are stored in GRanges objects. MSPC provides readPeakFiles to load input peak files, either accept BED format, or data.frame like object, where several peak files can be read simultaneously using lapply and returned as GRangesList for next workflow. Some data sources provide Chip-seq peaks associated with significant score under specific conditions. We are going to evaluate each enriched region with permissive p-value threshold, so representing peak’s score as p-value is needed, p-value can be represented -1*log10(score), -10*log10(score), -100*log10(score), readPeakFiles ask user to select the p-value format of input peak files. For detailed information, please see? readPeakFiles method.

## load all peak files as GRanges objects
bedfiles <- getPeakFile()[1:3]
myData <- readPeakFiles(peakFolder = bedfiles, pvalueBase = 1L)
lapply(myData, head)

Data Cleaning

In previous workflow, we obtained peak files in GRangesList, where peak associated with peaks’ score, p-value as metadata. We set up permissive p-value threshold to exclude extremely weakly enriched regions, usually refers to background signal or noise. In our study, we are interested in moderately enriched regions with relatively lower significance score (or called weak peak), which may get involved in binding DNA-protein interaction. Clean up all background noise from input peak files increase the efficiency of downstream analysis. For detailed information, please see? denoise_ERs method.

## clean up all background signal by using permissive threshold
total.ERs <- denoise_ERs(peakGRs = myData , tau.w = 1.0E-04,
                         dest.dir = getwd(), nmtab = "noise", overwrite = TRUE)
options(scipen = 0)
lapply(total.ERs, head)

Simultaneous evaluation of peak overlapping across multiple Chip-Seq replicate

We designed a general methodological framework to rigorously combine the evidence of enriched regions in Chip-Seq replicates, with the option to set permissive threshold on the repeated evidence and a minimum number of samples bearing this evidence. In previous workflow, we cleaned up all background noise from input peak files, returned enriched regions in GRangesList object that used for finding overlap peak. In the original method, we are going to assess each enriched region from chosen sample with the support of rest of ChIP-Seq replicates for identifying overlapping peaks. Because identifying set of overlapping peaks across multiple sample can give rise to ambiguities, especially number of selected input bed files n>2, so global approach can depend on the order of the input peak files, permuting sample is needed. Working with original approach will bring memory inefficiency, increase the computational effort for retrieving important genomic features. However, we have inverted problem of finding overlapping of single peak interval respect to many others, using vectorized approach bring us efficient, succinct result. MSPC depends on GenomicRanges packages which provides core functionality to manipulate genomic intervals efficiently. In this solution, we have used DataFrame object to hold all overlap-hit index including self-hit associated with peaks’ score and p-value as metadata, which gives unique representation that easy to work with and efficient for downstream computation. runMSPC is the core function to evaluate enriched regions in peaks files for overlapping under several workflows that presented in original method, returned GRangesList as an output of Fisher combined method. Because of vectorized approach, all wanted step becomes intuitive and efficient. If multiple overlapping regions were detected, runMSPC only accept the one with most stringent enriched or least enriched (highest or lowest p-value), to do so, user required to use the parameter whichType. We need to further check minimum overlapping peak requirement, by using parameter replicate.type we could obtain minimum requirement criteria, only the peaks that comply with this condition can be further evaluated for fisher combined method. For detailed information, please see? runMSPC method.

Overview the result with different parameter usage

To see the confirmed enriched regions that both comply with minimum overlapping peak requirement criteria, and combined stringency test, we need to set up the parameter isConformed as True. Here is quick insight of confirmed peaks in GRangesList :

## rescued ERs by Fisher's combined test
confirmedERs <- runMSPC(peakset = total.ERs, 
                        whichType = "max",
                        replicate.type = "Biological",
                        cmbStrgThreshold = 1.0E-08, 
                        isConfirmed = TRUE)

lapply(confirmedERs, head)

In order to see the all discarded peaks which either failed from minimum overlapping condition or combined stringency test, we also need to use the parameter isConfirmed as False.

## ERs that failing for combined stringency test
discardedERs <- runMSPC(peakset = total.ERs, 
                        whichType = "max",
                        replicate.type = "Technical",
                        cmbStrgThreshold = 1.0E-08, 
                        isConfirmed = FALSE)

lapply(discardedERs, head)

Identify and Export stringent/ weak ERs

ChIP-Seq detects genome-wide DNA-protein interaction, returning enriched regions which associated with significance score. Using permissive p-value threshold tau.s for stringent enriched region, we could further identify set of stringent/weak peaks. As we mentioned earlier, repeated evidence across multiple replicates can compensate for lower significance in a single sample, using Fisher method to increase the significance of weak evidence which might get involved in TF or gene regulation. export_ERs can report enriched regions in different output set, result can be exported as standard BED file or csv, to do so, user asked to choose the parameter exportFormat. export_ERs function accept following parameter: parameter peakList_A is set of all confirmed peaks in GRangesList that fulfill both minimum overlapping peak requirement and combined stringency test (Fisher method); peakList_B is set of all discarded peaks either failed from combined stringency test or minimum requirement of overlapping; parameter tau.s is a permissive threshold for stringent peaks, peak’s p-value below this threshold, are considered stringent peak, while above this threshold is weakly enriched regions; exportFormat parameter is used to select format of file to be exported. For detailed information, please see? export_ERs method.

## Identify & Export Stringent/Weak ERs
output <- export_ERs(peakList_A = confirmedERs, 
                     peakList_B = discardedERs, 
                     tau.s = 1.0E-08 ,exportFormat = "bed")

# explore output
output

Visualize identified stringent/weak ERs

In previous workflow, we further classified the output of runMSPC with permissive p-value threshold, each ChIP-Seq replicates yield different output set: stringent, weak, confirmed, discarded respectively. We also provide getPlot function to visualize the peak set for file bar. Using dplyr package to manipulate the output set to get plot data, then visualize it by ggplot, we believe providing graphical plot will ease to understand the result.

inPlot <- getPlot(peakList_A = confirmedERs, 
                  peakList_B = discardedERs, tau.s = 1.0E-08)

inPlot

Multiple Testing Correction

In previous workflow, we obtained set of enriched regions which comply with combined stringency test by using Fisher method. However, we also need to further evaluate set of confirmed peaks with multiple testing corrections procedure, and produce final output set. we need to correct the p-value of peak using the Benjamini-Hochberg multiple testing corrections with user-specified false discovery rate, which yields multiple-testing confirmed or discarded peaks. For the further detailed information, please review help page? FDR_stats method.

## Multiple Testing Correction
mtc_output <- FDR_stats(peakList = confirmedERs, 
                       pAdjustMethod = "BH", 
                       fdr = 0.05, asPlot = FALSE)

# explore the output
mtc_output

Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

Reference



julaiti/MSPC documentation built on Oct. 17, 2019, 9:44 p.m.