SIMPLE_B: Impute zero entries and clustering for scRNASeq data...

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/SIMPLE_B.R

Description

Impute zero entries in the gene expression matrix based on the expression level in the similar cells and gene-gene correlation. Zero entries in the observed expression matrix come from either molecule loss during the experiment ('dropout') or too low expression to be measured. We use Monte Carlo EM algorithm to sample the imputed values and learn the parameters. We use this function to further integrate the bulk RNASeq data for the similar cell types as the scRNASeq to infer the dropout rate.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SIMPLE_B(
  dat,
  K0,
  bulk,
  celltype,
  M0 = 1,
  clus = NULL,
  K = 20,
  b = 1,
  iter = 10,
  est_z = 1,
  impt_it = 3,
  max_lambda = F,
  est_lam = 1,
  penl = 1,
  sigma0 = 100,
  pi_alpha = 1,
  beta = NULL,
  lambda = NULL,
  sigma = NULL,
  mu = NULL,
  p_min = 0.8,
  min_gene = 300,
  cutoff = 0.1,
  verbose = F,
  num_mc = 3,
  fix_num = F,
  mcmc = 50,
  burnin = 2
)

Arguments

dat

scRNASeq data matrix. Each row is a gene, each column is a cell.

K0

Number of latent gene modules. See details.

bulk

Bulk RNASeq data matrix. Should be log(1 + tpm/fpkm/rpkm). Each row is a gene which must be ordered the same as the scRNASeq data. Each column is a cell type which must be ordered as the cell type label in celltype.

celltype

A numeric vector for labels of cells in the scRNASeq. Each cell type has corresponding mean expression in the bulk RNASeq data. The labels must start from 1 to the number of types. If NULL, all cells are treated as a single cell type and the input bulk RNASeq should also have one column that is the mean expression over all the cell types.

M0

Number of clusters. See details.

clus

Initial clustering of scRNASeq data. If NULL, the function will use PCA and Kmeans to do clustering initially.

K

The number of PCs used in the initial clustering. Default = 20.

b

The scaling factor between scRNASeq and bulk RNASeq. If NULL, will do a weighted linear regression between mean of scRNASeq and bulk RNASeq for each cell type. Default = 1.

iter

The number of EM iterations using full data set. See details.

est_z

The iteration starts to update Z.

impt_it

The iteration starts to sample new imputed values in initial phase. See details.

max_lambda

Whether to maximize over lambda.

est_lam

The iteration starts to estimate lambda.

penl

L1 penalty for the factor loadings.

sigma0

The variance of the prior distribution of μ.

pi_alpha

The hyperparameter of the prior distribution of π. See details.

beta

A G by K0 matrix. Initial values for factor loadings (B). If null, beta will initialze from normal distribution with mean zero and variance M0/K0. See details.

lambda

A M0 by K0 matrix. Initial values for the variances of factors. Each column is for a cell cluster. If null, lambda will initialize to be 1/M0. See details.

sigma

A G by M0 matrix. Initial values for the variance of idiosyncratic noises. Each column is for a cell cluster. If null, sigma will initialize to be 1. See details.

mu

A G by M0 matrix. Initial values for the gene expression mean of each cluster. Each column is for a cell cluster. If NULL, it will take the sample mean of cells weighted by the probability in each cluster. See details.

p_min

Initialize parameters using genes expressed in at least p_min proportion of cells. If the number genes selected is less than min_gene, select min_gene genes with higest proportion of non zeros. Default = 0.8.

min_gene

Initialize parameters using genes expressed in at least p_min proportion of cells. If the number genes selected is less than min_gene, select min_gene genes with higest proportion of non zeros. Default = 1000.

cutoff

The value below cutoff is treated as no expression. Default = 0.1.

verbose

Whether to show some intermediate results. Default = False.

num_mc

The number of Gibbs steps for generating imputed data when the parameters are updated during Monte Carlo EM. Default = 3.

fix_num

If true, always use top min_gene genes with higest proportion of non zeros in the initial phase. Default = F. See details.

mcmc

The number of Gibbs steps to sample imputed data after EM. Default = 50.

burnin

The number of burnin steps before sample imputed data after EM. Default = 2.

Details

We assume that the cells come from M0 clusters. Within each cell cluster, the 'true' gene expression is modeled by a multivariate Gaussian distribution whose covariance matrix can be composed into a low rank matrix (a couple of latent gene modules) and idiosyncratic noises. Gene modules are shared among cell clusters though the coexpression level of each gene module can be different. Suppose there are G genes and n cells. For each cell cluster, the gene expression follows Y|Z=m~MVN(μ_m, BΛ_m B^T + Σ_m) where B is a G by K0 matrix, Σ_m is a G by G diagonal matrix whose diagonal entries are specified by sigma, and Λ_m is a K0 by K0 diagonal matrix whose diagonal entries are specified by lambda. P(Z_m) = π_m where π~Dir(α).

The algorithm first runs Monte Carlo EM using only the genes with low dropout rate (initial phase) and initializes factor loadings and clustering membership. Then it runs another rounds of Monte Carlo EM using all the genes. In the initial phase, we use the genes with dropout rate less than 1 - p_min; if the number of genes is less than min_gene, we rank the genes by the number cells with nonzero expression and keep the top min_gene genes. If fix_num is true, then we always keep the top min_gene genes in the initial phase.

Value

SIMPLE_B returns a list of results in the following order.

  1. loglikThe log-likelihood of the imputed gene expression at each iteration.

  2. loglik_tot The log-likelihood of the full imputed gene expression at each iteration and the prior of B matrix.

  3. BIC BIC which is -2 *loglik_tot + penalty on the number of parameters. Can be used to select paramters.

  4. piProbabilites of cells belong to each cluster.

  5. muMean expression for each cluster

  6. sigmaVariances of idiosyncratic noises for each cluster.

  7. betaFactor loadings.

  8. lambdaVariances of factors for each cluster.

  9. zThe probability of each cell belonging to each cluster.

  10. Yimp0A matrix contains the expectation of imputed expression.

  11. pgA G by M0 matrix, dropout rate for each gene in each cluster.

  12. imptA matrix contains the mean of each imputed entry by sampling multiple imputed values at MLE. If mcmc <= 0, output imputed expressoin matrix at last step of EM

  13. impt_varA matrix contains the variance of each imputed entry by sampling multiple imputed values at MLE. NULL if mcmc <= 0.

  14. Ef If mcmc >0, output posterior means of factors given observed data (a n by K0 matrix). If mcmc <= 0, output conditional expectation of the factors for each cluster E(f_i|z_i= m) at the last step of EM. A list with length M0, each element in the list is a n by K0 matrix.

  15. Varf If mcmc >0, output posterior variances of factors given observed data (a n by K0 matrix). If mcmc <= 0, output conditional covariance matrix of factors for each cluster Var(f_i|z_i = m) at the last step of EM. A list with length M0, each element in the list is a K0 by K0 matrix.

  16. consensus_clusterScore for the clustering stability of each cell by multiple imputations. NULL if mcmc <=0

Author(s)

Zhirui Hu, zhiruihu@g.harvard.edu

Songpeng Zu, songpengzu@g.harvard.edu

See Also

SIMPLE

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(foreach) 
library(doParallel) 
library(SIMPLE) 

M0 = 3  # simulate number of clusters 
n = 300 # number of cells 

simu_data = simulation_bulk(n=300, S0 = 20, K = 6, MC=M0, block_size = 32, indepG = 1000 - 32*6, verbose=F, overlap=0) 

Y2 = simu_data$Y2 
K0 = 6 # number of factors 

registerDoParallel(cores = 6)  # parallel 

# estimate the parameters, only input the overall mean of all cell types from bulk 
result <- SIMPLE_B(Y2, K0, data.frame(simu_data$bulk), M0, celltype=rep(1, n), clus = NULL, K = 20, p_min = 0.5, min_gene = 200,cutoff=0.01, max_lambda=T) 

# evaluate cluster performance 
celltype_true = simu_data$Z 
mclust::adjustedRandIndex(apply(result$z,1, which.max), celltype_true) 
# or redo clustering based on imputed values (sometimes work better for real data) 
getCluster(result$impt, celltype_true, Ks = 20, M0 = M0)[[1]] 

JunLiuLab/SIMPLEs documentation built on March 18, 2021, 3:10 a.m.