knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
simpleCache::setSharedCacheDir("~")

Overview

The epihet package will allow you to calculate an epigenetic heterogeneity score from DNA methylation data. The score is called PIM, or Proportion of sites with Intermediate Methylation.

Epigenetic heterogeneity and DNA methylation

DNA methylation is fundamentally a binary signal: on a given CpG site in a single DNA strand, the C is either methylated or not methylated (ignoring the rare case of hemimethylation). But experimentally, it is difficult to produce single-allele DNA methylation data, so DNA methylation data typically arises from bulk experiments made up of a population of cells. Bulk experiments produce methylation values that vary anywhere from 0% to 100% (and are typically not at either extreme). The deviation from the binary signal is due to cell-to-cell heterogeneity, and a bulk DNA methylation level is interpreted as the proportion of cells in the population that had a methylated allele.

In a purely homogeneous sample (where every cell has an identical methylation pattern), DNA methylation at any given CpG site would mimic the binary case and be be either 0% or 100%, because there is no cell-to-cell variation. Therefore, as a sample becomes more heterogeneous, it should have decreasing 0% and 100% values and increasing evidence of CpG sites with intermediate methylation. The RPIM package leverages this observation to produce a score for epigenetic heterogeneity from bulk DNA methylation data: the Proportion of sites with Intermediate Methylation, or PIM score. In a sense, this enables you to estimate sample heterogeneity of single cells without doing single cell experiments.

Installation

devtools::install_github("databio/epihet")
library(epihet)

Importing data

To calculate a PIM score for a given sample, we require as input a table that specifies each CpG site assessed in the sample and its methylation level. The functions will accept input data in two different formats:

  1. BSDT objects, which are simply data.table objects with specific column names. Internally the RPIM package uses data.table[^1] to efficiently read and manipulate the bisulfite sequence data. An example of a single sample in BSDT format is included with the package:
data("exampleBSDT", package="epihet")

head(exampleBSDT)
  1. bsseq objects, from the Bioconductor bsseq package. epihet will accept bsseq objects as well, enabling a nice interface to existing bisulfite sequencing analysis. The bsseq package is a widely used toolkit for analyzing Bisulfite sequencing data, and includes a BSseq data structure[^3]. epihet interfaces directly with objects of this class.

The code below shows a trivial example of generating a BSseq object:

# source("https://bioconductor.org/biocLite.R")
# biocLite("bsseq")
library(bsseq)

M <- matrix(0:8, 3, 3)
Cov <- matrix(1:9, 3, 3)
BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"), pos = c(1,2,3),
             M = M, Cov = Cov, sampleNames = c("A","B", "C"))
BS1

The bsseqData package[^4] contains more robust examples of this data format:

# source("https://bioconductor.org/biocLite.R")
# biocLite("bsseqData")
library(bsseqData)

data(BS.cancer.ex)
BS.cancer.ex

Calculating the PIM score

The epihet package provides two primary functions that will be of interest: PIM() and RPIM(). With the data prepared as described above, you can input the object directly to either the PIM() or RPIM() functions.

The PIM score is a sample-level score, which means it is relative only to a sample and does not depend at all on any other samples. Every sample has a PIM score, which ranges from 0 to 1; 0 would mean that sample has no CpGs with intermediate methylation, indicating a perfectly homogeneous sample, while 1 would indicate that every CpG has intermediate methylation, indicating a very heterogeneous sample. The PIM() function takes bisulfite sequencing data for a single sample and returns a single value for the proportion of intermediate methylation. Calculate it like this:

PIM(exampleBSDT)

Under the hood, the PIM() function defines a site as IM (Intermediate Methylation) using a Bayesian credibility interval based on the number of reads that covered the CpG. The credibility interval provides a range of values surrounding the estimated DNA methylation level, which varies based on coverage level. If a credibility interval is not completely below .25, or above .75, the CpG is classified as Intermediately Methylated.

Calculating relative PIM (RPIM)

One limitation with the PIM score is that it may not always be comparable from sample to sample if the CpGs covered were of different classes. For example, if one sample covered more CpGs in CpG islands than another, and we know that CpGs in islands are more likely to be uniformly unmethylated (even in a more heterogeneous sample), then then incongruent coverage distribution would skew the PIM scores of the samples, making them incomparable.

The RPIM calculation accounts for differential coverage by doing a pairwise comparisons and restricting each to the subset of CpGs that are present in both samples. RPIM addresses this coverage issue by introducing a relative PIM score, or RPIM for short. RPIM is based on the PIM score, but it operates on two samples at a time by first restricting the set of CpGs to only those that are covered in both samples, and then calculating the log ratio of the number of IM sites in one sample versus the other. This controls for differential coverage and gives a relative score that no longer represents the absolute proportion of IM sites, but can be either positive or negative, depending on whether a sample had more or fewer IM sites than another. Finally, the RPIM score is summarized for a given sample by averaging the pairwise comparisons with all other samples in the dataset.

Thus, each sample has an RPIM score, but this score is relative to all the other samples in the dataset. This means RPIM will change if you change the samples you include. The RPIM shows the relative enrichment or depletion of intermediate methylation for a given sample when compared to all other samples. It can be either negative or positive, with a 0 score indicating that the same is the median sample. The higher the magnitude of the score (or further away it is from 0), the more heterogeneous (positive) or homogenous (negative) that sample is relative to the others.

Calculate RPIM like this:

data("BSDTlist", package="epihet")

RPIM(BSDTlist)
RPIM(BS.cancer.ex)

Caching

In order to make the intermediate methylation calculations as efficient as possible, RPIM employs optional caching via the simpleCache package[^5].

If simpleCache is installed, RPIM will look for a cache directory specified in the "RESOURCES.RACHE" global option. If none is available, you can set this preference as follows:

# devtools::install_github("databio/simpleCache")

# set a shared cache directory
simpleCache::setSharedCacheDir("cache")

PIM(exampleBSDT)
RPIM(BSDTlist)

# alternatively specify a cache directory on each RPIM call
PIM(exampleBSDT, cacheDir = "cache")
RPIM(BSDTlist, cacheDir = "cache")

Note that the simpleCache() creates a directory to store cached data. In this case, the cache consists of the binomial confidence interval used in the PIM and RPIM calculations. For more information about caching with simpleCache visit the development repository: https://github.com/databio/simpleCache

References

[^1]: Matt Dowle and Arun Srinivasan (2017). data.table: Extension of data.frame. R package version 1.10.4. https://CRAN.R-project.org/package=data.table

[^2]: Lawson J, Tomazou E, Bock C and Sheffield NC (2018). MIRA: An R package for DNA methylation-based inference of regulatory activity. R package version 0.99.95, .

[^3]: Hansen KD, Langmead B and Irizarry RA (2012). “BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions.” Genome Biology, 13(10), pp. R83. doi: 10.1186/gb-2012-13-10-r83.

[^4]: Hansen KD (2017). bsseqData: Example whole genome bisulfite data for the bsseq package. R package version 0.14.0.

[^5]: Sheffield, N and Nagraj VP (2017). simpleCache: A Simple Package for Caching R Objects. R package version 0.0.1. http://www.github.com/sheffien/simpleCache

To analyze multiple samples (i.e. calculate relative proportion of intermediate methylation), you can combine the data as a list of data.table objects. The example below demonstrates how to read in multiple sample files in .bed format.

The code includes the BSreadBiSeq() function from the MIRA package[^2], which will generate a BSDT for each sample before combining them all together in a single named list object.

fp = "http://cloud.databio.org.s3.amazonaws.com/vignettes/RPIM_vignette_data.tar.gz"
download.file(fp, destfile = "RPIM_vignette_data.tar.gz")
untar("RPIM_vignette_data.tar.gz")

dat = MIRA::BSreadBiSeq("RRBS_cpgMethylation_EWS_L10.bed")
dat2 = MIRA::BSreadBiSeq("RRBS_cpgMethylation_EWS_T133.bed")
dat3 = MIRA::BSreadBiSeq("RRBS_cpgMethylation_EWS_T111.bed")
dat4 = MIRA::BSreadBiSeq("RRBS_cpgMethylation_EWS_T120.bed")

alldat = rbind(dat,dat2, dat3, dat4)
allsplitdat = split(alldat, alldat$sampleName)


databio/epihet documentation built on May 24, 2019, 2:03 a.m.