EM_impute: Monte Carlo EM algorithm for imputation and clustering

Description Usage Arguments Details Value Author(s)

View source: R/EM_impute.R

Description

Monte Carlo EM algorithm to sample the imputed values, cluster the cells and learn the correlation structure of genes in each cluster.

Usage

1
2
3
4
5
EM_impute(Y, Y0, pg, M0, K0, cutoff, iter, beta, sigma, lambda, pi, z,
  mu = NULL, celltype = NULL, penl = 1, est_z = 2,
  max_lambda = T, est_lam = 2, impt_it = 5, sigma0 = 100,
  pi_alpha = 1, verbose = F, num_mc = 3, lower = -Inf,
  upper = Inf)

Arguments

Y

An initial imputed gene expression matrix.

Y0

Original scRNASeq data matrix.

pg

A matrix for dropout rate of each cell type. Each row is a gene, each column is the dropout rate of a cell type. The columns should be ordered as the cell type label in clus.

M0

Number of clusters.

K0

Number of latent gene modules.

cutoff

The value below cutoff is treated as no expression.

iter

Number of EM steps.

beta

A G by K0 matrix. Initial values for factor loadings (B). See details.

sigma

A G by M0 matrix. Initial values for the variance of idiosyncratic noises. Each column is for a cell cluster. See details.

lambda

A M0 by K0 matrix. Initial values for the variances of factors. Each column is for a cell cluster. See details.

pi

A vector for initial probabilites of cells belong to each cluster.

z

A n by M0 matrix for the probability of each cell belonging to each cluster. Can be initialized as the one-hot encoding of cluster membership of cells. If null, z will be updated in the first iteration.

mu

A G by M0 matrix. Initial values for the gene expression mean of each cluster. Each column is for a cell cluster. If NULL, it will take the sample mean of cells weighted by the probability in each cluster. See details.

celltype

A numeric vector for labels of cells in the scRNASeq. Each cell type has different dropout rate. If input bulk RNASeq data, each cell type has corresponding mean expression in the bulk RNASeq data. The labels must start from 1 to the number of types. If NULL, all cells are treated as a single cell type.

penl

L1 penalty for the factor loadings.

est_z

The iteration starts to update z.

max_lambda

Whether to maximize over lambda.

est_lam

The iteration starts to estimate lambda.

impt_it

The iteration starts to sample new imputed values.

sigma0

The variance of the prior distribution of μ.

pi_alpha

The hyperparameter of the prior distribution of π. See details.

verbose

Whether to show some intermediate results. Default = False.

Details

Suppose there are G genes and n cells. For each cell cluster, the gene expression follows Y|Z=m~MVN(μ_m, BΛ_m B^T + Σ_m) where B is a G by K0 matrix, Σ_m is a G by G diagonal matrix whose diagonal entries are specified by sigma, and Λ_m is a K0 by K0 diagonal matrix whose diagonal entries are specified by lambda. P(Z_m) = π_m where π~Dir(α). We remove the overall mean of each gene before running the algorithm and all the parameters are estimated based on the normalized gene expression matrix. The overall mean is returned as geneM.

Value

EM_impute returns a list of results in the following order.

  1. loglikThe log-likelihood of the imputed gene expression at each iteration.

  2. piProbabilites of cells belong to each cluster.

  3. muMean expression for each cluster.

  4. sigmaVariances of idiosyncratic noises for each cluster.

  5. betaFactor loadings.

  6. lambdaVariances of factors for each cluster.

  7. zThe probability of each cell belonging to each cluster.

  8. EfConditonal expection the factors for each cluster E(f_i|z_i = m). A list with length M0, each element in the list is a n by K0 matrix.

  9. VarfConditonal covariance of factors for each cluster Var(f_i|z_i = m). A list with length M0, each element in the list is a K0 by K0 matrix.

  10. YLast sample of imputed matrix.

  11. geneMOverall mean of each gene expression. See details.

  12. geneSdEqual to 1 for each gene.

Author(s)

Zhirui Hu, zhiruihu@g.harvard.edu

Songpeng Zu, songpengzu@g.harvard.edu


xyz111131/SIMPLEs documentation built on Jan. 8, 2020, 2:48 a.m.