In andreaskapou/Melissa: Bayesian clustering and imputationa of single cell methylomes

Synthetic data analysis

Imputation performance

Use the following parameter settings for measuring model performance on all parameters and latent variables:

N = 200 (Number of cells)
M = 100 (Number of genes)
K = 4 (Number of clusters)
pi_k = (.2, .4, .15, .25) (Mixing proportions)
vb_max_iter = 300 (VB iterations)
total_sims = 10 (Total simulations)
gaussian_noise = 0.05 (Noise on the probability of success per CpG)
gaussian_noise = 0.15 (For high noise experiments)

Change CpG coverage

Function for generating synthetic data: preprocessing/synthetic/encode_coverage.R

Parameters

cluster_var = 0.5 (% of regions that are dissimilar across clusters)
data_train_prcg = 0.1 (% of data to keep fully for training)
region_train_prcg = 1 (% of regions kept for training)
cpg_train_prcg = seq(.1, .9, .1) (% of CpGs kept for training in each region)

Change cluster dissimilarity

We set a specific value for the cpg_train_prcg = 0.4 (supplementary figure is cpg_train_prcg = 0.8) and change the dissimilarity percentage to show the robustness of the model across different dissimilarity levels.

Function for generating synthetic data: preprocessing/synthetic/encode_dissimilarity.R

Parameters

cluster_var = seq(0, 1, .1) (% of regions that are dissimilar across clusters)
data_train_prcg = 0.1 (% of data to keep fully for training)
region_train_prcg = 1 (% of regions kept for training)
cpg_train_prcg = 0.4 (% of CpGs kept for training in each region)

Compare with different models

BPRMeth
Rate
Melissa rate
Random Forests (RF)

Cluster performance

From the above analysis we obtain the posterior probability of each cell belonging to each cluster and then we measure cluster performance using the Adjusted Rand Index (ARI).

Model selection

On the same data run 10 simulations with K = 10 clusters and check when the model returns the actual 4 clusters that generated the data, file model-selection.R. To obtain a better understanding on the model selection from the variational approximation we use a broad and strict prior over the weights i.e.

Broad (uninformative) prior

delta_0 <- rep(3, K) (Dirichlet prior)
alpha_0 <- 0.5 (Gamma prior)
beta_0 <- 10 (Gamma prior)

Strict (shrinkage) prior

delta_0 <- rep(1e-2, K) (Dirichlet prior)
alpha_0 <- 2 (Gamma prior)
beta_0 <- 0.5 (Gamma prior)

VB efficiency

Run VB model and Gibbs model on synthetic data (file model-efficiency.R) and show the efficiency of the VB model.

Parameters:

N = c(50, 100, 200, 500, 1000, 2000) (Number of cells)
M = 200 (Number of genes)
K = 3 (Number of clusters)
pi_k = (.3, .4, .3) (Mixing proportions)
vb_max_iter = 400 (VB iterations)
gibbs_nsim = 3000 (VB iterations)
total_sims = 5 (Total simulations)

Synthetic data analysis - Pseudo single cells by subsampling

Data in datasets/ENCODE/scBS-seq/parsed/binarised Scripts in datasets/ENCODE/scBS-seq/preprocessing

Imputation performance

The same preprocessing steps as the real data. Use 40 H1-hESC (pseudo)-single cells and 40 GM12878 (pseudo)-single cells. Filtering process: For 10kb region keep regions with at least 10 CpGs coverage, and for 5kb keep regions with at least 8 CpGs.

During imputation process 3 different training procedures: 20%, 50% and 80% CpGs used for training set. The remaining CpGs are used for test set.

Mouse ESCs dataset - Angermueller 2016 / MT-Seq

Pre-processing

Steps for pre-processing methylation data:

Create different genomic region windows for promoters: 3kb, 5kb, 10kb . Regarding active enhancers, super enhancers and Nanog regions keep only those whose annotation is larger than 1kb region. Finally keep regions with at least 3 CpGs coverage, function create_promoter.R or create_region.R.
Filter processed genomic regions using the filter_met.R function, which will keep regions with at least 10 CpG coverage and that have variability across cells. More specifically:
For 3kb region:
- cov = 10
- met_sd = 0.2
- rna_var = 5
For 5kb region:
- cov = 10
- met_sd = 0.2
- rna_var = 5
For 10kb region:
- cov = 15
- met_sd = 0.2
- rna_var = 5
For active enhancers region:
- cov = 10
- met_sd = 0.2
- region_annot > 1000bp
For super enhancers region:
- cov = 10
- met_sd = 0.2
- region_annot > 1000bp
For Nanog region:
- cov = 10
- met_sd = 0.2
- region_annot > 1000bp
- The remaining regions were broaden to +/-2.5kb regions, since after filtering the number of regions was really low.

Finally, we kept only regions that had 10 CpGs coverage across 50% of the cells, so we could test the assumption of information sharing, and also have an adequate amount of information when inferring methylation states.

Imputation test

For imputation we use the following parameters settings

K = 6 (Number of clusters)
filt_region_cov = 0.5 (Filter genes that have low coverage across cells)
data_train_prcg = 0.4 (% of data that will be used fully for training set)
region_train_prcg = 0.95 (% of samples to keep for training, i.e. some genes will have no coverage for a given cell.)
cpg_train_prcg = 0.5 (% of CpGs to keep for training at each region/cell)
total_sims = 10 (Total simulations)

VB efficiency

Run VB model on MT-seq data (file model-efficiency/mt-seq/model-efficiency.R) and show the efficiency of the VB model for all contexts.

Parameters: The same as the ones when running the model for clustering and imputation.

suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(purrr))
dt <- data.table::data.table("Context" = c("Prom3k", "Prom5k", "Prom10k", "Active Enhancers", "Nanog", "Super Enhancers"), "CpGs (in millions)" = c("0.62", "2.1", "6", "0.7", "0.18", "0.5"),  "Time" = c("09:58 - 11:17", "10:38 - 13:36", "09:57 - 15:34", "09:52 - 11:38", "09:54 - 10:31", "09:53 - 10:54"), "Time elapsed (in hours)" = c("1.31", "2.9", "5.6", "1.76", "0.61", "1"))

dt %>%
  kable("html") %>%
  kable_styling()

Mouse ESCs dataset - Smallwood 2014

Pre-processing

Steps for pre-processing methylation data:

Create different genomic region windows for promoters: 3kb, 5kb, 10kb . Regarding active enhancers, super enhancers and Nanog regions keep only those whose annotation is larger than 1000bp region. Finally keep regions with at least 3 CpGs coverage, function create_region.R.
Filter processed genomic regions using the filter_met.R function, which will keep regions with at least N CpG coverage and that have variability across cells. More specifically:
For 3kb region:
- cov = 10
- met_sd = 0.2
For 5kb region:
- cov = 15
- met_sd = 0.2
For 10kb region:
- cov = 20
- met_sd = 0.2
For active enhancers region:
- cov = 10
- met_sd = 0.2
- region_annot > 1000bp
For super enhancers region:
- cov = 10
- met_sd = 0.2
- region_annot > 1000bp
For Nanog region:
- cov = 10
- met_sd = 0.2
- region_annot > 1000bp
- The remaining regions were broaden to +/-2.5kb regions, since after filtering the number of regions was really low.

Finally, we kept only regions that had N CpGs coverage across 50% of the cells, so we could test the assumption of information sharing, and also have an adequate amount of information when inferring methylation states.

Imputation test

For imputation we use the following parameters settings

K = 5 (Number of clusters)
filt_region_cov = 0.5 (Filter genes that have low coverage across cells)
data_train_prcg = 0.4 (% of data that will be used fully for training set)
region_train_prcg = 0.95 (% of samples to keep for training, i.e. some genes will have no coverage for a given cell.)
cpg_train_prcg = 0.5 (% of CpGs to keep for training at each region/cell)
total_sims = 10 (Total simulations)

VB efficiency

Run VB model on MT-seq data (file model-efficiency/mt-seq/model-efficiency.R) and show the efficiency of the VB model for all contexts.

Parameters: The same as the ones when running the model for clustering and imputation.

suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(purrr))
dt <- data.table::data.table("Context" = c("Prom3k", "Prom5k", "Prom10k", "Active Enhancers", "Nanog", "Super Enhancers"), "CpGs (in millions)" = c("0.98", "1.54", "4.13", "0.85", "0.26", "0.29"), "Time" = c("08:32 - 10:41", "08:31 - 10:44", "11:19 - 15:24", "08:33 - 10:30", "08:35 - 09:30", "08:34 - 09:28"), "Time elapsed (in hours)" = c("1.83", "2.21", "4", "2", "0.91", "0.9"))

dt %>%
  kable("html") %>%
  kable_styling()

DeepCpG ENCODE analysis steps

Filter chromosomes for synthetic data chr1 - chr6: Using file /Melissa/preprocessing/synthetic/encode_sc_deepcpg_filter_chr.R

andreaskapou/Melissa documentation built on June 12, 2020, 5:54 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

andreaskapou/Melissa
Bayesian clustering and imputationa of single cell methylomes

In andreaskapou/Melissa: Bayesian clustering and imputationa of single cell methylomes

Synthetic data analysis

Imputation performance

Change CpG coverage

Change cluster dissimilarity

Compare with different models

Cluster performance

Model selection

VB efficiency

Synthetic data analysis - Pseudo single cells by subsampling

Imputation performance

Mouse ESCs dataset - Angermueller 2016 / MT-Seq

Pre-processing

Imputation test

VB efficiency

Mouse ESCs dataset - Smallwood 2014

Pre-processing

Imputation test

VB efficiency

DeepCpG ENCODE analysis steps

R Package Documentation

Browse R Packages

We want your feedback!

andreaskapou/Melissa Bayesian clustering and imputationa of single cell methylomes

In andreaskapou/Melissa: Bayesian clustering and imputationa of single cell methylomes

Synthetic data analysis

Imputation performance

Change CpG coverage

Change cluster dissimilarity

Compare with different models

Cluster performance

Model selection

VB efficiency

Synthetic data analysis - Pseudo single cells by subsampling

Imputation performance

Mouse ESCs dataset - Angermueller 2016 / MT-Seq

Pre-processing

Imputation test

VB efficiency

Mouse ESCs dataset - Smallwood 2014

Pre-processing

Imputation test

VB efficiency

DeepCpG ENCODE analysis steps

R Package Documentation

Browse R Packages

We want your feedback!

andreaskapou/Melissa
Bayesian clustering and imputationa of single cell methylomes