Synthetic data analysis

Imputation performance

Use the following parameter settings for measuring model performance on all parameters and latent variables:

Change CpG coverage

Function for generating synthetic data: preprocessing/synthetic/encode_coverage.R

Parameters

Change cluster dissimilarity

We set a specific value for the cpg_train_prcg = 0.4 (supplementary figure is cpg_train_prcg = 0.8) and change the dissimilarity percentage to show the robustness of the model across different dissimilarity levels.

Function for generating synthetic data: preprocessing/synthetic/encode_dissimilarity.R

Parameters

Compare with different models

Cluster performance

From the above analysis we obtain the posterior probability of each cell belonging to each cluster and then we measure cluster performance using the Adjusted Rand Index (ARI).

Model selection

On the same data run 10 simulations with K = 10 clusters and check when the model returns the actual 4 clusters that generated the data, file model-selection.R. To obtain a better understanding on the model selection from the variational approximation we use a broad and strict prior over the weights i.e.

Broad (uninformative) prior

Strict (shrinkage) prior

VB efficiency

Run VB model and Gibbs model on synthetic data (file model-efficiency.R) and show the efficiency of the VB model.

Parameters:

Synthetic data analysis - Pseudo single cells by subsampling

Data in datasets/ENCODE/scBS-seq/parsed/binarised Scripts in datasets/ENCODE/scBS-seq/preprocessing

Imputation performance

The same preprocessing steps as the real data. Use 40 H1-hESC (pseudo)-single cells and 40 GM12878 (pseudo)-single cells. Filtering process: For 10kb region keep regions with at least 10 CpGs coverage, and for 5kb keep regions with at least 8 CpGs.

During imputation process 3 different training procedures: 20%, 50% and 80% CpGs used for training set. The remaining CpGs are used for test set.

Mouse ESCs dataset - Angermueller 2016 / MT-Seq

Pre-processing

Steps for pre-processing methylation data:

Finally, we kept only regions that had 10 CpGs coverage across 50% of the cells, so we could test the assumption of information sharing, and also have an adequate amount of information when inferring methylation states.

Imputation test

For imputation we use the following parameters settings

VB efficiency

Run VB model on MT-seq data (file model-efficiency/mt-seq/model-efficiency.R) and show the efficiency of the VB model for all contexts.

Parameters: The same as the ones when running the model for clustering and imputation.

suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(purrr))
dt <- data.table::data.table("Context" = c("Prom3k", "Prom5k", "Prom10k", "Active Enhancers", "Nanog", "Super Enhancers"), "CpGs (in millions)" = c("0.62", "2.1", "6", "0.7", "0.18", "0.5"),  "Time" = c("09:58 - 11:17", "10:38 - 13:36", "09:57 - 15:34", "09:52 - 11:38", "09:54 - 10:31", "09:53 - 10:54"), "Time elapsed (in hours)" = c("1.31", "2.9", "5.6", "1.76", "0.61", "1"))

dt %>%
  kable("html") %>%
  kable_styling()

Mouse ESCs dataset - Smallwood 2014

Pre-processing

Steps for pre-processing methylation data:

Finally, we kept only regions that had N CpGs coverage across 50% of the cells, so we could test the assumption of information sharing, and also have an adequate amount of information when inferring methylation states.

Imputation test

For imputation we use the following parameters settings

VB efficiency

Run VB model on MT-seq data (file model-efficiency/mt-seq/model-efficiency.R) and show the efficiency of the VB model for all contexts.

Parameters: The same as the ones when running the model for clustering and imputation.

suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(purrr))
dt <- data.table::data.table("Context" = c("Prom3k", "Prom5k", "Prom10k", "Active Enhancers", "Nanog", "Super Enhancers"), "CpGs (in millions)" = c("0.98", "1.54", "4.13", "0.85", "0.26", "0.29"), "Time" = c("08:32 - 10:41", "08:31 - 10:44", "11:19 - 15:24", "08:33 - 10:30", "08:35 - 09:30", "08:34 - 09:28"), "Time elapsed (in hours)" = c("1.83", "2.21", "4", "2", "0.91", "0.9"))

dt %>%
  kable("html") %>%
  kable_styling()

DeepCpG ENCODE analysis steps

  1. Filter chromosomes for synthetic data chr1 - chr6: Using file /Melissa/preprocessing/synthetic/encode_sc_deepcpg_filter_chr.R


andreaskapou/Melissa documentation built on June 12, 2020, 5:54 p.m.