clone_id: Infer clonal identity of single cells

View source: R/clone_id.R

Clone IDR Documentation

Infer clonal identity of single cells

Description

Infer clonal identity of single cells

Assign cells to clones using an EM algorithm

Assign cells to clones using a Gibbs sampling algorithm

Usage

clone_id(
  A,
  D,
  Config = NULL,
  n_clone = NULL,
  Psi = NULL,
  relax_Config = TRUE,
  relax_rate_fixed = NULL,
  inference = "sampling",
  n_chain = 1,
  n_proc = 1,
  verbose = TRUE,
  ...
)

clone_id_EM(
  A,
  D,
  Config,
  Psi = NULL,
  min_iter = 10,
  max_iter = 1000,
  logLik_threshold = 1e-05,
  verbose = TRUE
)

clone_id_Gibbs(
  A,
  D,
  Config,
  Psi = NULL,
  relax_Config = TRUE,
  relax_rate_fixed = NULL,
  relax_rate_prior = c(1, 9),
  keep_base_clone = TRUE,
  prior0 = c(0.2, 99.8),
  prior1 = c(0.45, 0.55),
  min_iter = 5000,
  max_iter = 20000,
  buin_frac = 0.5,
  wise = "variant",
  relabel = FALSE,
  verbose = TRUE
)

Arguments

A

variant x cell matrix of integers; number of alternative allele reads in variant i cell j

D

variant x cell matrix of integers; number of total reads covering variant i cell j

Config

variant x clone matrix of binary values. The clone-variant configuration, which encodes the phylogenetic tree structure. This is the output Z of Canopy

n_clone

integer(1), the number of clone to reconstruct. This is in use only if Config is NULL

Psi

A vector of float. The fractions of each clone, output P of Canopy

relax_Config

logical(1), If TRUE, relaxing the Clone Configuration by changing it from fixed value to act as a prior Config with a relax rate.

relax_rate_fixed

numeric(1), If the value is between 0 to 1, the relax rate will be set as a fix value during updating clone Config. If NULL, the relax rate will be learned automatically with relax_rate_prior.

inference

character(1), the method to use for inference, either "sampling" to use Gibbs sampling (default) or "EM" to use expectation-maximization (faster)

n_chain

integer(1), the number of chains to run, which will be averaged as an output result

n_proc

integer(1), the number of processors to use. This parallel computing can largely reduce time when using multiple chains

verbose

logical(1), should the function output verbose information as it runs?

...

arguments passed to clone_id_Gibbs or clone_id_EM (as appropriate)

min_iter

A integer. The minimum number of iterations in the Gibbs sampling. The real iteration may be longer until the convergence.

max_iter

A integer. The maximum number of iterations in the Gibbs sampling, even haven't passed the convergence diagnosis

logLik_threshold

A float. The threshold of logLikelihood increase for detecting convergence.

relax_rate_prior

numeric(2), the two parameters of beta prior distribution of the relax rate for relaxing the clone Configuration. This mode is used when relax_relax is NULL.

keep_base_clone

bool(1), if TRUE, keep the base clone of Config to its input values when relax mode is used.

prior0

numeric(2), alpha and beta parameters for the Beta prior distribution on the inferred false positive rate.

prior1

numeric(2), alpha and beta parameters for the Beta prior distribution on the inferred (1 - false negative) rate.

buin_frac

numeric(1), the fraction of chain as burn-in period

wise

A string, the wise of parameters for theta1: global, variant, element.

relabel

bool(1), if TRUE, relabel the samples of both Config and prob during the Gibbs sampling.

Details

The two Bernoulli components correspond to false positive and false negative rates. The two binomial components correspond to the read distributions with and without the mutation present.

Value

If inference method is "EM", a list containing theta, a vector of two floats denoting the parameters of the two components of the base model, i.e., mean of Bernoulli or binomial model given variant exists or not, prob, the matrix of posterior probabilities of each cell belonging to each clone with fitted parameters, and logLik, the log likelihood of the final parameters.

If inference method is "sampling", a list containing: theta0, the mean of sampled false positive parameter values; theta1 the mean of sampled (1 - false negative rate) parameter values; theta0_all, all sampled false positive parameter values; theta1_all, all sampled (1 - false negative rate) parameter values; element; logLik_all, log-likelihood for model for all sampled parameter sets; prob_all; prob, matrix with mean of sampled cell-clone assignment posterior probabilities (the key output of the model); prob_variant.

a list containing theta, a vector of two floats denoting the binomial rates given variant exists or not, prob, the matrix of posterior probabilities of each cell belonging to each clone with fitted parameters, and logLik, the log likelihood of the final parameters.

Author(s)

Yuanhua Huang and Davis McCarthy

Yuanhua Huang

Examples

data(example_donor)
assignments <- clone_id(A_clone, D_clone,
    Config = tree$Z,
    min_iter = 800, max_iter = 1200
)
prob_heatmap(assignments$prob)

assignments_EM <- clone_id(A_clone, D_clone,
    Config = tree$Z,
    inference = "EM"
)
prob_heatmap(assignments_EM$prob)

PMBio/cardelino documentation built on Nov. 21, 2022, 4:52 a.m.