Installation

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("BPRMeth")

## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/BPRMeth", build_vignettes = TRUE)

Introduction

DNA methylation is probably the best studied epigenomic mark, due to its well established heritability and widespread association with diseases. Yet its role in gene regulation, and the molecular mechanisms underpinning its association with diseases, are still imperfectly understood. While the methylation status of individual cytosines can sometimes be informative, several recent papers have shown that the functional role of DNA methylation is better captured by a quantitative analysis of the spatial variation of methylation across a genomic region.

The BPRMeth package is a probabilistic method to quantify explicit features of methylation profiles, in a way that would make it easier to formally use such profiles in downstream modelling efforts, such as predicting gene expression levels or clustering genomic regions according to their methylation profiles. The original implementation [1] has now been enhanced in two important ways: we introduced a fast, variational inference approach which enables the quantification of Bayesian posterior confidence measures on the model, and we adapted the method to use several observation models, making it suitable for a diverse range of platforms including single-cell analyses and methylation arrays. Technical details of the updated version are explained in [2].

In addition to being a flexible tool for methylation data, the proposed framework is in principle deployable to other measurements with a similar structure, and indeed the method was already used for single-cell chromatin accessibility data in [3].

Background

Mathematically, BPRMeth is based on a basis function generalised linear model. The basic idea is as follows: the methylation profile associated to a genomic region $D$ is defined as a (latent) function $f\colon D\rightarrow (0,1)$ which takes as input the genomic coordinate along the region and returns the propensity for that locus to be methylated. In order to enforce spatial smoothness, and to obtain a compact representation for this function in terms of interpretable features, we represent the profile function as a linear combination of basis functions

$$ f(x)=\Phi\left(\mathbf{w}^Th(x)\right) $$

where $h(x)$ are the basis functions (Gaussian bells by default), and $\Phi$ is a probit transformation (Gaussian cumulative distribution function) needed in order to map the function output to the $(0,1)$ interval. The latent function is observed at specific loci through a noise model which encapsulates the experimental technology.

knitr::include_graphics("../inst/figures/model.png")

The optimal weight parameters $\mathbf{w}$ can be recovered either by Bayesian inference or maximum likelihood estimation, providing a set of quantitative features which can be used in downstream models: in [1] these features were used in a machine learning predictor of gene expression, and to cluster genomic regions according to their methylation profiles. Modelling details and mathematical derivations for the different models can be found online: http://rpubs.com/cakapourani.

Analysis Pipeline

The workflow diagram and functionalities of the BPRMeth package for analysis of methylation profiles are shown in Figure 2.

knitr::include_graphics("../inst/figures/workflow.png")

Sample data

To illustrate the functionality of the BPRMeth package we will use real datasets that are publicly available from the ENCODE project consortium [4]. More specifically we will focus on the K562 immortalized cell line, with GEO: GSE27584 for the bulk RRBS data and GEO: GSE33480 for the RNA-Seq data. We will use directly the preprocessed files, where we have kept only the protein coding genes, and for the purpose of this vignette we focus only on chr12 and chr13. Full details of where to download the data and how to preprocess them are included in the Supplementary Material [1].

Load preprocessed data

Due to its general approach, BPRMeth can be used for a wide range of technologies such as array based, bulk sequencing (both RRBS and WGBS) and single cell sequencing methylation datasets. It is only required that the data are in the right format and we call the appropriate observation model depending on the type of technology. Since the encode data are generated from bulk RRBS experiments, the observation model for the BPRMeth package will be Binomial.

Methylation data

The BPRMeth package provides methods for reading files for methylation, annotation and expression data, however they do not cover all possible data formats available for describing biological datasets. The user can implement his own methods for reading files with different formats, provided that he can create an object similar to what is described below. First, we load the preprocessed methylation data as follows:

library(BPRMeth)  # Load package
met_region <- encode_met

The met_region object contains the following important slots:

Now let's observe the format of the met_region object. For example, we can access the $20^{th}$ promoter methylation region as follows

head(met_region$met[[20]])

Below are the annotation data

head(met_region$anno)

and the total number of genomic regions are

NROW(met_region$anno)

Read methylation and annotation data (Do not run)

In the previous section for simplicity we provided a pre-processed object which we then can directly use for inferring methylation profiles. The steps required to create this object are really simple and follow the diagram in Figure 2. First we need to read the methylation data using the read_met function. Also we need read annotation data using the read_anno file. Note that the annotation file can contain any genomic context: from promoters and gene bodies to enhancers, Nanog regulatory regions and CTCF regions; hence BPRMeth can be used for a plethora of analyses that want to take spatial genomic correlations into account. Finally, the create_region_object will create the methylation regions object which is the main object for storing methylation data. The following lines of code provide an example of creating this object ( Do not run )

# Read bulk methylation data
met_dt <- read_met(file = "met_file", type = "bulk_seq")
# Read annotation file, which will create annotation regions as well
anno_dt <- read_anno(file = "anno_file")
# Create methylation regions that have CpG coverage
met_region <- create_region_object(met_dt = met_dt, anno_dt = anno_dt)

Expression data

If the goal is to relate methylation profiles to gene expression levels, we need also to store expression data. Again, here we have pre-processed expression data which can be loaded as follows

expr <- encode_expr

The structure of the expression data is rather simple, it is a r CRANpkg("data.table") with two columns as shown below

head(expr)

To create the expression data (which are $log_{2}$ transformed) we could call ( Do not run )

# Read expression data and log2 transform them
expr <- read_expr(file = "expr_file", log2_transf = TRUE)

Create basis objects

For each genomic region, we will learn a methylation profile using the Binomial observation model with a specified number of Radial Basis Functions (RBFs) $M$. For a single input variable $x$, the RBF takes the form $h_{j}(x) = exp(-\gamma || x - \mu_{j} ||^2)$, where $\mu_{j}$ denotes the location of the $j^{th}$ basis function in the input space and $\gamma$ controls the spatial scale. The case when $M = 0$ is equivalent to learning the mena methylation rate for the given region (i.e. learn a constant function).

For our running example, we will create two RBF objects, one with 5 basis functions and the other with 0 basis functions denoting the mean methylation rate

# Create RBF basis object with 5 RBFs
basis_profile <- create_rbf_object(M = 5)
# Create RBF basis object with 0 RBFs, i.e. constant function
basis_mean <- create_rbf_object(M = 0)

The rbf object contains information such as the centre locations $\mu_{j}$ and the value of the spatial scale parameter $\gamma$

# Show the slots of the 'rbf' object
basis_profile

The $\gamma$ is computed by the number of bases $M$, however the user can tune it according to his liking. Except from RBF basis, the BPRMeth pacakge provides polynomial and Fourier basis functions which can be created with the create_polynomial_object and create_fourier_object functions, respectively.

Inferring methylation profiles

Learning the methylation profiles is equivalent to inferring the model parameters $\mathbf{W}$ of the generalised linear regression model. These parameters can be considered as the extracted features which quantitate precisely notions of shape of a methylation profile.

For performing parameter inference, the BPRMeth package is enhanced to provide both maximum likelihood estimation and Bayesian inference for the parameters. Specifically, for Bayesian estimation it provides an efficient mean-field variational inference (Variational Bayes) and a Gibbs sampling algorithm. For computational efficiency one should choose the VB implementation (Note that for Beta observation models, one should use MLE to infer the profiles, since currently the package does not provide Bayesian treatment for Beta likelihood).

To infer the methylation profiles using VB we need to call the infer_profiles_vb function

fit_profiles <- infer_profiles_vb(X = met_region$met, model = "binomial",
                                basis = basis_profile, is_parallel = FALSE)

fit_mean <- infer_profiles_vb(X = met_region$met, model = "binomial", 
                              basis = basis_mean, is_parallel = FALSE)

This results in a infer_profiles object whose most important slot is the inferred weight matrix W, below we show the structure of the fit_profiles object

# Show structure of fit_profiles object
str(fit_profiles, max.level = 1)

Below we show an example promoter region together with the fitted methylation profiles, where the points represent the DNA methylation level of each CpG site. The shape of the methylation profiles is captured by the blue curve, whereas the red line denotes the mean methylation rate

# Choose promoter region 21 -> i.e. LEPREL2 gene
gene_id <- met_region$anno$id[21]
p <- plot_infer_profiles(region = 21, obj_prof = fit_profiles,
                         obj_mean = fit_mean, obs = met_region$met, 
                         title = paste0("Gene ID ", gene_id))
print(p)

Applications

Predicting gene expression

To quantitatively predict expression at each region, we construct a regression model by taking as input the higher-order methylation features learned from the infer_profiles_vb. In addition to these features, we consider two supplementary sources of information: (1) the goodness of fit in RMSE and (2) the CpG density. For our analysis an SVM regression model is considered. We will use $70\%$ of the data for training, and we will test the model's performance on the remaining $30\%$.

All the aforementioned steps are assembled in the predict_expr wrapper function:

# Set seed for reproducible results
set.seed(22)
# Perform predictions using methylation profiles
pred_profile <- predict_expr(prof_obj = fit_profiles, expr = expr,
                             anno = met_region$anno, model_name = "svm",
                             fit_feature = "RMSE", cov_feature = TRUE, 
                             is_summary = FALSE)
# Perform predictions using mean methylation rate
pred_mean <- predict_expr(prof_obj = fit_mean, expr = expr, 
                          anno = met_region$anno, model_name = "svm",
                          is_summary = FALSE)

We can now compare the Pearson's correlation coefficient $r$ for both models and observe that the higher-order methylation features achieve test correlations twice as large compared to mean methylation rate. Note that someone should use cross-validaion to get a better estimate of the correlation performance.

print(paste("Profile r: ", round(pred_profile$test_errors$pcc, 2)))
print(paste("Mean r:", round(pred_mean$test_errors$pcc, 2)))

Below in Figure \@ref(fig:plotpred) we show a scatter plot of the predicted and measured expression values for the \verb|chr12| and \verb|chr13| of the K562 cell line.

g1 <- plot_predicted_expr(pred_obj = pred_profile, 
                          title = "Methylation profile")
g2 <- plot_predicted_expr(pred_obj = pred_mean, 
                          title = "Methylation rate")
g <- cowplot::plot_grid(g1, g2, ncol = 2, nrow = 1)
print(g)

Clustering methylation profiles

Another application of the BPRMeth model is to use the higher-order methylation features to cluster DNA methylation patterns across genomic regions and examine whether distinct methylation profiles are associated to different gene expression levels. To cluster methylation profiles, a mixture modelling approach is considered and we perform inference either via EM algorithm (MLE estimates) or using the mean-field variational inference (Variational Bayes VB) model. In this example we will use the VB model to perform variational inference.

The BPRMeth package provides the cluster_profiles_vb function for performing Bayesian clustering, where the user needs to provide the number of clusters $K$, the methylation regions $X$, the observation model and a basis object. For efficiency we run on 20 VB iterations.

# Set seed for reproducible results
set.seed(12) 
# Create basis object
basis_obj <- create_rbf_object(M = 3)
# Set number of clusters K
K <- 4
# Perform clustering
cl_obj <- cluster_profiles_vb(X = met_region$met, K = K, model = "binomial", 
                              alpha_0 = .5, beta_0 = .1,
                              basis = basis_obj, vb_max_iter = 20)

Figure \@ref(fig:clusterplot) shows the fitted methylation profiles for each cluster.

cluster_plot <- plot_cluster_profiles(cluster_obj = cl_obj)
print(cluster_plot)

Beta methylation profiles

The structure of the package makes it straighforward to work with methylation data generated from different technologies. For example, with methylation array data, the most common approach is to use the Beta distribution to capture the methylation level from the array experiment. Below we provide a minimal analysis step for inferring methylation profiles on synthetic data; note that we just need to define that model = beta. (In some cases, researchers work with M-values, where they transform the Beta values to $(-\inf, \inf)$, i.e. they make them Gaussian random variables, if that is the case the BPRMeth package can handle this by setting model = gaussian.) Note that for the Beta model, we currently provide only maximum likelihood estimation (MLE).

basis_obj <- create_rbf_object(M = 5)
# Infer beta profiles
beta_prof <- infer_profiles_mle(X = beta_data, model = "beta",
                                basis = basis_obj, is_parallel = FALSE)
p <- plot_infer_profiles(region = 15, obj_prof = beta_prof,
                         obs = beta_data, title = "Beta profile")
print(p)

Bernoulli methylation profiles

Similarly, someone might analyse single cell methylation data, where now would need to set model = bernoulli as shown in the code snippet below using again synthetic data.

basis_prof <- create_rbf_object(M = 5)
basis_mean <- create_rbf_object(M = 0)
bern_prof <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
                               basis = basis_prof, is_parallel = FALSE)
bern_mean <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
                               basis = basis_mean, is_parallel = FALSE)
p <- plot_infer_profiles(region = 60, obj_prof = bern_prof, 
                         obj_mean = bern_mean, obs = bernoulli_data, 
                         title = "Bernoulli profile")
print(p)

We can finally cluster the single-cell methylation profiles as follows

# Perform clustering
cl_obj <- cluster_profiles_vb(X = bernoulli_data, K = 3, model = "bernoulli", 
                              basis = basis_obj, vb_max_iter = 40)

cluster_plot <- plot_cluster_profiles(cluster_obj = cl_obj)
print(cluster_plot)

Session Info

This vignette was compiled using:

sessionInfo()

Bibliography

[1] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412.

[2] Kapourani, C. A. and Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129

[3] Clark, S.J., Argelaguet, R., Kapourani, C.A., Stubbs, T.M., Lee, H.J., Alda-Catalinas, C., Krueger, F., Sanguinetti, G., Kelsey, G., Marioni, J.C., Stegle, O. and Reik, W., (2018). scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nature communications, 9(1), p.781.

[4] ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74.

Acknowledgements

This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.

This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh, and by the European Research Council through grant MLCS306999.



andreaskapou/BPRMeth documentation built on June 11, 2020, 10:49 p.m.