BiocStyle::markdown()
library(knitr)


Package: r Rpackage("RJMCMC")
Authors: r packageDescription("RJMCMC")[["Author"]]
Version: r packageDescription("RJMCMC")$Version
Compiled date: r Sys.Date()
License: r packageDescription("RJMCMC")[["License"]]

Licensing and citing

The RJMCMC package and the underlying RJMCMC code are distributed under the Artistic license 2.0. You are free to use and redistribute this software.

If you use this package for a publication, we would ask you to cite the following:

Samb R, Khadraoui K, Belleau P, et al. (2015) Using informative Multinomial-Dirichlet prior in a t-mixture with reversible jump estimation of nucleosome positions for genome-wide profiling. Statistical Applications in Genetics and Molecular Biology. Volume 14, Issue 6, Pages 517-532, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, December 2015, doi:10.1515/sagmb-2014-0098

Introduction

Global gene expression patterns are established and maintained by the concerted actions of transcription factors (TFs) and the proteins that constitute chromatin. The key structural element of chromatin is the nucleosome, which consists of an octameric histone core wrapped by 146 bps of DNA and connected to its neighbour by approximately 10-80 pbs of linker DNA [@Polishko2012].

The literature on nucleosome positioning commonly focuses on frequentist inferences within parametric approaches (see for instance @Chen2010 and @Xi2010). In those works, the detection of nucleosome positions is done using a hidden Markov model with an assumed known order.

The RJMCMC package is an implementation of a fully Bayesian hierarchical model for profiling of nucleosome positions based on high-throughput short-read data (MNase-Seq data). The implementation is based on a strategy which incorporates four aspects. First, it jointly models local concentrations of directional reads. Second, it uses a Multinomial-Dirichlet model in the construction of an informative prior distribution coupled to a t-mixture model with unknown degrees of freedom. Third, the number of nucleosomes is considered to be a random variable and refers to a prior distribution. Fourth, the unknown parameters are simultaneously using the reversible jump Markov chain Monte Carlo (RJMCMC) simulation technique (see for instance @Green1995 and @Richardson1997).

Detailed information about the RJMCMC model can be found in the article mentioned in the precedent section.

Loading the RJMCMC package

As with any R package, the RJMCMC package should first be loaded with the following command:

library(RJMCMC)

RJMCMC analysis

A typical RJMCMC analysis consists of the following steps:

  1. Segment the genome into candidate regions that have sufficient aligned reads. The region cannot be larger than one chromosome.
  2. Estimate nucleosome positions for each region.
  3. Regroup all regions together. The final region cannot be larger than one chromosome.
  4. Post-process predictions of the regrouped region to correct certain predictions.

RJMCMC analysis step by step

A synthetic nucleosome sample containing 100 nucleosomes (80 well-positioned + 20 fuzzy) has been created using the Bioconductor package r Biocpkg("nucleoSim").

## Load nucleoSim package
library(nucleoSim)

val.num       <- 100  ### Number of well-positioned nucleosomes
val.del       <- 20   ### Number of well-positioned nucleosomes to delete
val.var       <- 30   ### variance associated to the well-positioned nucleosomes
val.fuz       <- 20   ### Number of fuzzy nucleosomes
val.fuz.var   <- 50   ### variance associated to the fuzzy nucleosomes
val.max.cover <- 70   ### Maximum coverage for one nucleosome
val.nuc.len   <- 147  ### Distance between nucleosomes
val.len.var   <- 10   ### Variance associated to the length of the reads
val.lin.len   <- 20   ### The length of the DNA linker
val.rnd.seed  <- 100  ### Set seed when result needs to be reproducible
val.offset    <- 10000  ### The number of bases used to offset all nucleosomes and reads

## Create sample using a Normal distribution
sample <- nucleoSim::syntheticNucReadsFromDist(wp.num=val.num,
                                    wp.del=val.del, wp.var=val.var,
                                    fuz.num=val.del, fuz.var=val.fuz.var,
                                    max.cover=val.max.cover,
                                    nuc.len=val.nuc.len,
                                    len.var=val.len.var,
                                    lin.len=val.lin.len,
                                    rnd.seed=val.rnd.seed,
                                    distr="Normal",
                                    offset=val.offset)

## Create visual representation of the synthetic nucleosome sample
plot(sample)

Segment the analyzed region into candidate regions

It is suggested, in order to accelerate the learning process, to segment the analyzed region into candidate regions to accelerate the analysis. Moreover, it is mandatory to analyse each chromosome separately since the \code{rjmcmc} function can only analyze one chromosome at the time.

Region segmentation can be done using the segmentPING() function from the Bioconductor r Biocpkg("PING") package.

## Load needed packages
library("PING")
library("GenomicRanges")


## Transform sample dataset into GRanges object
sampleGRanges <- GRanges(seqnames = sample$dataIP$chr, 
                          ranges = IRanges(start = sample$dataIP$start, 
                                           end = sample$dataIP$end), 
                          strand = sample$dataIP$strand)

## Segment sample into candidate regions
sampleSegmented <- PING::segmentPING(sampleGRanges)

## Number of segments created
length(slot(sampleSegmented, "List"))

Run RJMCMC

The RJMCMC function must be run on each candidate region. As an example, the first candidate region is processed using a very low number of iterations. On real data, the number of iterations should be higher.

## Extract forward and reverse reads associated to the first segment
readsForward <- slot(slot(sampleSegmented, "List")[[1]], "yF")
readsReverse <- slot(slot(sampleSegmented, "List")[[1]], "yR")

## Fix seed to enable replication of the results
set.seed(2221)

## Run RJMCMC analysis
## A higher number of iterations is recommanded for real analysis
resultSegment01 <- rjmcmc(startPosForwardReads = readsForward,
                            startPosReverseReads = readsReverse,
                            nbrIterations = 1000, lambda = 3, kMax = 30,
                            minInterval = 100, maxInterval = 200, minReads = 5)

## Print the predicted nucleosomes for the first segment
resultSegment01

## More information is available from the output, such as
## The variance of the forward reads for each nucleosome
resultSegment01$sigmaf

## The variance of the reverse reads for each nucleosome
resultSegment01$sigmar

## The distance between the maxima of the forward and reverse reads 
## position densities for each nucleosome
resultSegment01$delta

## The degrees of freedom for each nucleosome
resultSegment01$df

## The weight for each nucleosome
resultSegment01$w

Regroup all regions

Once all segments have been analyzed, the predicted nucleosomes can be merged together. Two functions are available to facilitate the merging process:

The segments of the sample created sooner have all been processed (using 100000 iterations) and saved in RDS files. Those will now be merged together.

## The directory containing the results of all segments
## On RDS file has been created for each segment
directory <- system.file("extdata", "demo_vignette", package = "RJMCMC")

## Merging predicted nucleosomes from all segments
resultsAllMerged <- mergeAllRDSFilesFromDirectory(directory)

resultsAllMerged

## More information is available from the output, such as
## The variance of the forward reads for each nucleosome
resultsAllMerged$sigmaf

## The variance of the reverse reads for each nucleosome
resultsAllMerged$sigmar

## The distance between the maxima of the forward and reverse reads 
## position densities for each nucleosome
resultsAllMerged$delta

## The degrees of freedom for each nucleosome
resultsAllMerged$df

Post-process predictions

In some cases the RJMCMC method tends to over split the distribution of reads for a single nucleosome. Although this characteristic increases the number of false positives, it is still beneficial for the region’s rich in nucleosomes.

A function, which merges closely positioned nucleosomes, has been implemented to rectify the over splitting and provide more conservative results.

The \code{postTreatment} function must be run on the entire analyzed region to be efficient. So, it should not be run on segmented results. The function needs the positions of the reads used for the RJMCMC analysis.

## Split reads from the initial sample data into forward and reverse subsets
allReadsForward <- subset(sample$dataIP, strand == "+")
allReadsReverse <- subset(sample$dataIP, strand == "-")

## Number of nucleosomes before the post-treatment
resultsAllMerged$k

## Use the post-treatment function
resultsPostTreatment <- RJMCMC::postTreatment(startPosForwardReads =  allReadsForward$start, 
                                    startPosReverseReads = allReadsReverse$end, 
                                    resultRJMCMC = resultsAllMerged,
                                    chrLength = max(allReadsForward$start, 
                                                        allReadsReverse$start))


## Number of nucleosomes after the post-treatment
length(resultsPostTreatment)

The \code{postTreatment} function significantly reduces the number of nucleosomes by merging closely positioned nucleosomes.

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()

References



ArnaudDroitLab/RJMCMC documentation built on May 5, 2019, 7:06 a.m.