selectKM: Wrapper of SIMPLE or SIMPLE-B to select optimal K and M based...

Description Usage Arguments Value Author(s) See Also Examples

View source: R/select_KM.R

Description

Wrapper of SIMPLE or SIMPLE-B to select optimal K and M based on BIC

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
selectKM(
  dat,
  bulk = NULL,
  celltype = NULL,
  b = 1,
  K0 = 10,
  M0 = 1,
  iter = 10,
  est_lam = 1,
  impt_it = 5,
  penl = 1,
  sigma0 = 100,
  pi_alpha = 1,
  beta = NULL,
  verbose = F,
  max_lambda = T,
  lambda = NULL,
  sigma = NULL,
  mu = NULL,
  est_z = 1,
  clus = NULL,
  p_min = 0.4,
  cutoff = 0.1,
  K = 20,
  min_gene = 300,
  num_mc = 3,
  fix_num = F,
  mcmc = 50,
  burnin = 2,
  rel = 0.01
)

Arguments

dat

scRNASeq data matrix. Each row is a gene, each column is a cell.

bulk

Bulk RNASeq data matrix. Should be log(1 + tpm/fpkm/rpkm). Each row is a gene which must be ordered the same as the scRNASeq data. Each column is a cell type which must be ordered as the cell type label in celltype. Default: NULL

celltype

A numeric vector for labels of cells in the scRNASeq. Each cell type has corresponding mean expression in the bulk RNASeq data. The labels must start from 1 to the number of types. If NULL, all cells are treated as a single cell type and the input bulk RNASeq should also have one column that is the mean expression over all the cell types. Default: NULL

b

The scaling factor between scRNASeq and bulk RNASeq. If NULL, will do a weighted linear regression between mean of scRNASeq and bulk RNASeq for each cell type. Default = 1. Only relevant for SIMPLE-B.

K0

a vector for number of latent gene modules to be selected. Default is 10. Note that the minimum K0 is 2.

M0

a vector for Number of clusters to be selected. Default is 1. If 1 is not in the sequence of M0, it will add 1 to the sequence; and the imputed matrix when M=1 is used to initialize imputation for other Ms.

iter

The number of EM iterations using full data set. See details.

est_lam

The iteration starts to estimate lambda.

impt_it

The iteration starts to sample new imputed values in initial phase. See details.

penl

L1 penalty for the factor loadings.

sigma0

The variance of the prior distribution of μ.

pi_alpha

The hyperparameter of the prior distribution of π. See details.

beta

A G by K0 matrix. Initial values for factor loadings (B). If null, beta will be initialized from normal distribution with mean zero and variance M0/K0. See details.

verbose

Whether to show some intermediate results. Default = False.

max_lambda

Whether to maximize over lambda. Default is True.

lambda

A M0 by K0 matrix. Initial values for the variances of factors. Each column is for a cell cluster. If null, lambda will initialize to be 1/M0. See details.

sigma

A G by M0 matrix. Initial values for the variance of idiosyncratic noises. Each column is for a cell cluster. If null, sigma will initialize to be 1. See details.

mu

A G by M0 matrix. Initial values for the gene expression mean of each cluster. Each column is for a cell cluster. If NULL, it will take the sample mean of cells weighted by the probability in each cluster. See details.

est_z

The iteration starts to update Z.

clus

Initial clustering of scRNASeq data. If NULL, the function will use PCA and Kmeans to do clustering initially.

p_min

Initialize parameters using genes expressed in at least p_min proportion of cells. If the number genes selected is less than min_gene, select min_gene genes with higest proportion of non zeros. Default = 0.4.

cutoff

The value below cutoff is treated as no expression. Default = 0.1.

K

The number of PCs used in the initial clustering. Default is 20.

min_gene

Minimal number of genes used in the initial phase. See details.

num_mc

The number of Gibbs steps for generating new imputed data after the parameters have been updated during Monte Carlo EM. Default = 3.

fix_num

If true, always use min_gene number of genes with the highest proportion of non zeros in the initial phase. Default = F. See details.

mcmc

The number of Gibbs steps to sample imputed data after EM. Default = 50.

burnin

The number of burnin steps before sample imputed data after EM. Default = 2.

rel

when increasing K, the algorithm will stop when the relative decrease of BIC is less than rel; but it will enumerate all given Ms. Default: 0.01.

Value

selectKM returns a matrix of BIC for all Ks and Ms tests, mK, the best K; mM, the best M; result, list for the result with smallest BIC in the following order.

  1. loglik The log-likelihood of each MCMC sample of imputed gene expressionafter EM. NULL if mcmc <= 0

  2. loglik_tot The log-likelihood of the full imputed gene expression at each iteration and the prior of B matrix.

  3. BIC BIC which is -2 *loglik_tot + penalty on the number of parameters. Can be used to select paramters.

  4. pi The prior probabilites of cells belong to each cluster.

  5. mu Mean expression for each gene in each cluster

  6. sigma Variances of idiosyncratic noises for each gene in each cluster.

  7. beta Factor loadings.

  8. lambda Variances of factors for each cluster.

  9. z The posterior probability of each cell belonging to each cluster.

  10. Yimp0 A matrix contains the expectation of gene expression specified by the model.

  11. pg A G by M0 matrix, dropout rate for each gene in each cluster estimated from initial clustering.

  12. initclus Output initial cluster results.

  13. impt A matrix contains the mean of each imputed entry by sampling multiple imputed values while the parameters are MLE. If mcmc <= 0, output the imputed expressoin matrix at last step of EM

  14. impt_var A matrix contains the variance of each imputed entry by sampling multiple imputed values while the parameters are MLE. NULL if mcmc <= 0.

  15. Ef If mcmc >0, output posterior means of factors given observed data (a n by K0 matrix). If mcmc <= 0, output conditional expectation of the factors for each cluster E(f_i|z_i= m) at the last step of EM. A list with length M0, each element in the list is a n by K0 matrix.

  16. Varf If mcmc >0, output posterior variances of factors given observed data (a n by K0 matrix). If mcmc <= 0, output conditional covariance matrix of factors for each cluster Var(f_i|z_i = m) at the last step of EM. A list with length M0, each element in the list is a K0 by K0 matrix.

  17. consensus_cluster Score for the clustering stability of each cell by multiple imputations. NULL if mcmc <=0.

Author(s)

Zhirui Hu, zhiruihu@g.harvard.edu

Songpeng Zu, songpengzu@g.harvard.edu

See Also

SIMPLE SIMPLE_B

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
library(foreach)
library(doParallel)
library(SIMPLEs)

# simulate number of clusters
M0 <- 3
# number of cells
n <- 300
# simulation_bulk and getCluster is defined in the util.R under the util directory of the corresponding github repository.
source("utils/utils.R")
simu_data <- simulation_bulk(n = 300, S0 = 20, K = 6, MC = M0, block_size = 32, indepG = 1000 - 32 * 6, verbose = F, overlap = 0)
Y2 <- simu_data$Y2
# number of factors
K <- c(6,10)
M <- c(1, 3)
# parallel
registerDoParallel(cores = 6)
# estimate the parameters and sample imputed values
results <- selectKM(Y2, K0=K, M0=M, clus = NULL, K = 20, p_min = 0.5, max_lambda = T, min_gene = 200, cutoff = 0.01)
print(sprintf("best M and K: %d, %d", results$mM, results$mK))
result = results$result
# evaluate cluster performance
celltype_true <- simu_data$Z
mclust::adjustedRandIndex(apply(result$z, 1, which.max), celltype_true)
# or redo clustering based on imputed values (sometimes work better for real data)
getCluster(result$impt, celltype_true, Ks = 20, M0 = M0)[[1]]

JunLiuLab/SIMPLEs documentation built on March 18, 2021, 3:10 a.m.