Home

/

GitHub

/

xyz111131/SIMPLEs

/

SIMPLE: SIMPLE: Imputing zero entries and clustering for scRNASeq...

SIMPLE: SIMPLE: Imputing zero entries and clustering for scRNASeq...
In xyz111131/SIMPLEs: SIMPLEs: single-cell RNA sequencing imputation and cell clustering methods by modeling gene module variation

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/SIMPLE.R

SIMPLE imputes zeros in the gene expression data using the expression level in similar cells and gene-gene correlation. Zero entries in the observed expression matrix come from molecule loss during the experiment ('dropout') or too low expression to be measured. We used Monte Carlo EM algorithm to sample the imputed values and obtain MLEs of all the parameters.

SIMPLE(dat, K0, M0 = 1, iter = 10, est_lam = 1, impt_it = 5,
  penl = 1, sigma0 = 100, pi_alpha = 1, beta = NULL, verbose = F,
  max_lambda = F, lambda = NULL, sigma = NULL, mu = NULL,
  est_z = 1, clus = NULL, p_min = 0.8, cutoff = 0.1, K = 20,
  min_gene = 300, num_mc = 3, fix_num = F, mcmc = 50, burnin = 2)

`dat`	scRNASeq data matrix. Each row is a gene, each column is a cell.
`K0`	Number of latent gene modules. See details.
`M0`	Number of clusters. See details.
`iter`	The number of EM iterations using full data set. See details.
`est_lam`	The iteration starts to estimate lambda.
`impt_it`	The iteration starts to sample new imputed values in initial phase. See details.
`penl`	L1 penalty for the factor loadings.
`sigma0`	The variance of the prior distribution of μ.
`pi_alpha`	The hyperparameter of the prior distribution of π. See details.
`beta`	A G by K0 matrix. Initial values for factor loadings (B). If null, beta will be initialized from normal distribution with mean zero and variance M0/K0. See details.
`verbose`	Whether to show some intermediate results. Default = False.
`max_lambda`	Whether to maximize over lambda.
`lambda`	A M0 by K0 matrix. Initial values for the variances of factors. Each column is for a cell cluster. If null, lambda will initialize to be 1/M0. See details.
`sigma`	A G by M0 matrix. Initial values for the variance of idiosyncratic noises. Each column is for a cell cluster. If null, sigma will initialize to be 1. See details.
`mu`	A G by M0 matrix. Initial values for the gene expression mean of each cluster. Each column is for a cell cluster. If NULL, it will take the sample mean of cells weighted by the probability in each cluster. See details.
`est_z`	The iteration starts to update Z.
`clus`	Initial clustering of scRNASeq data. If NULL, the function will use PCA and Kmeans to do clustering initially.
`p_min`	Initialize parameters using genes expressed in at least p_min proportion of cells. If the number genes selected is less than min_gene, select min_gene genes with higest proportion of non zeros. Default = 0.8.
`cutoff`	The value below cutoff is treated as no expression. Default = 0.1.
`K`	The number of PCs used in the initial clustering. Default = 20.
`min_gene`	Minimal number of genes used in the initial phase. See details.
`num_mc`	The number of Gibbs steps for generating new imputed data after the parameters have been updated during Monte Carlo EM. Default = 3.
`fix_num`	If true, always use min_gene number of genes with the highest proportion of non zeros in the initial phase. Default = F. See details.
`mcmc`	The number of Gibbs steps to sample imputed data after EM. Default = 50.
`burnin`	The number of burnin steps before sample imputed data after EM. Default = 2.

We assume that the cells come from M0 clusters. Within each cell cluster, the 'true' gene expression is modeled by a multivariate Gaussian distribution whose covariance matrix can be composed into a low rank matrix (a couple of latent gene modules) and idiosyncratic noises. Gene modules are shared among cell clusters, although the coexpression level of each gene module in different cell cluster can be different.
Suppose there are G genes and n cells. For each cell cluster, the gene expression follows Y|Z=m~MVN(μ_m, BΛ_m B^T + Σ_m) where B is a G by K0 matrix, Σ_m is a G by G diagonal matrix whose diagonal entries are specified by sigma, and Λ_m is a K0 by K0 diagonal matrix whose diagonal entries are specified by lambda. P(Z_m) = π_m where π~Dir(α).

The algorithm first runs Monte Carlo EM using only the genes with low dropout rate (initial phase) and initializes factor loadings and clustering membership. Then it runs more rounds of Monte Carlo EM using all the genes. In the initial phase, we use the genes with dropout rate less than 1 - p_min; if the number of such genes is less than min_gene, we rank the genes by the number cells with nonzero expression and keep the top min_gene genes. If fix_num is true, then we always keep the top min_gene genes in the initial phase.

After Monte Carlo EM, we obtain MLE of B, Λ_m, etc. If num_mc > 0, this function will sample multiple imputed values and cell factors at the MLEs of all the parameters by Gibbs sampling. Based on multiple imputed values, it will evaluate cluster stability for each cell (consensus_cluster). It will also output the posterior mean and variance for the imputed values and cell factors.

SIMPLE returns a list of results in the following order.

loglik The log-likelihood of the full imputed gene expression at each iteration.
pi The prior probabilites of cells belong to each cluster.
mu Mean expression for each gene in each cluster
sigma Variances of idiosyncratic noises for each gene in each cluster.
beta Factor loadings.
lambda Variances of factors for each cluster.
z The posterior probability of each cell belonging to each cluster.
Yimp0 A matrix contains the expectation of gene expression specified by the model.
pg A G by M0 matrix, dropout rate for each gene in each cluster estimated from initial clustering.
initclus Output initial cluster results.
impt A matrix contains the mean of each imputed entry by sampling multiple imputed values while the parameters are MLE. If mcmc <= 0, output the imputed expressoin matrix at last step of EM
impt_var A matrix contains the variance of each imputed entry by sampling multiple imputed values while the parameters are MLE. NULL if mcmc <= 0.
Ef If mcmc >0, output posterior means of factors given observed data (a n by K0 matrix). If mcmc <= 0, output conditional expectation of the factors for each cluster E(f_i|z_i= m) at the last step of EM. A list with length M0, each element in the list is a n by K0 matrix.
Varf If mcmc >0, output posterior variances of factors given observed data (a n by K0 matrix). If mcmc <= 0, output conditional covariance matrix of factors for each cluster Var(f_i|z_i = m) at the last step of EM. A list with length M0, each element in the list is a K0 by K0 matrix.
consensus_cluster Score for the clustering stability of each cell by multiple imputations. NULL if mcmc <=0.

Zhirui Hu, zhiruihu@g.harvard.edu

Songpeng Zu, songpengzu@g.harvard.edu

SIMPLE_B

library(foreach) 
library(doParallel) 
library(SIMPLEs) 

# simulate number of clusters
M0 = 3 
# number of cells
n = 300
# simulation_bulk and getCluster is defined in the util.R under the util directory of the corresponding github repository.
source("utils/utils.R")
simu_data = simulation_bulk(n=300, S0 = 20, K = 6, MC=M0, block_size = 32, indepG = 1000 - 32*6, verbose=F, overlap=0)
Y2 = simu_data$Y2 
# number of factors
K0 = 6
# parallel
registerDoParallel(cores = 6)
# estimate the parameters and sample imputed values 
result <- SIMPLE(Y2, K0, M0, clus = NULL, K = 20, p_min = 0.5, max_lambda=T, min_gene = 200,cutoff=0.01)
# sample imputed values 
# evaluate cluster performance
celltype_true = simu_data$Z 
mclust::adjustedRandIndex(apply(result$z,1, which.max), celltype_true)
# or redo clustering based on imputed values (sometimes work better for real data)
getCluster(result$impt, celltype_true, Ks = 20, M0 = M0)[[1]]

xyz111131/SIMPLEs documentation built on Jan. 8, 2020, 2:48 a.m.

xyz111131/SIMPLEs index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

xyz111131/SIMPLEs
SIMPLEs: single-cell RNA sequencing imputation and cell clustering methods by modeling gene module variation

SIMPLE: SIMPLE: Imputing zero entries and clustering for scRNASeq...
In xyz111131/SIMPLEs: SIMPLEs: single-cell RNA sequencing imputation and cell clustering methods by modeling gene module variation

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to SIMPLE in xyz111131/SIMPLEs...

R Package Documentation

Browse R Packages

We want your feedback!

xyz111131/SIMPLEs SIMPLEs: single-cell RNA sequencing imputation and cell clustering methods by modeling gene module variation

SIMPLE: SIMPLE: Imputing zero entries and clustering for scRNASeq... In xyz111131/SIMPLEs: SIMPLEs: single-cell RNA sequencing imputation and cell clustering methods by modeling gene module variation

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to SIMPLE in xyz111131/SIMPLEs...

R Package Documentation

Browse R Packages

We want your feedback!

xyz111131/SIMPLEs
SIMPLEs: single-cell RNA sequencing imputation and cell clustering methods by modeling gene module variation

SIMPLE: SIMPLE: Imputing zero entries and clustering for scRNASeq...
In xyz111131/SIMPLEs: SIMPLEs: single-cell RNA sequencing imputation and cell clustering methods by modeling gene module variation