vst: Variance stabilizing transformation for UMI count data

View source: R/vst.R

vstR Documentation

Variance stabilizing transformation for UMI count data


Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters. For details see \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-019-1874-1")}.


  cell_attr = NULL,
  latent_var = c("log_umi"),
  batch_var = NULL,
  latent_var_nonreg = NULL,
  n_genes = 2000,
  n_cells = NULL,
  method = "poisson",
  do_regularize = TRUE,
  theta_regularization = "od_factor",
  res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
  bin_size = 500,
  min_cells = 5,
  residual_type = "pearson",
  return_cell_attr = FALSE,
  return_gene_attr = TRUE,
  return_corrected_umi = FALSE,
  min_variance = -Inf,
  bw_adjust = 3,
  gmean_eps = 1,
  theta_estimation_fun = "theta.ml",
  theta_given = NULL,
  exclude_poisson = FALSE,
  use_geometric_mean = TRUE,
  use_geometric_mean_offset = FALSE,
  fix_intercept = FALSE,
  fix_slope = FALSE,
  scale_factor = NA,
  vst.flavor = NULL,
  verbosity = 2,
  verbose = NULL,
  show_progress = NULL



A matrix of UMI counts with genes as rows and cells as columns


A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated


The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi")


The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited


The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL


Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes)


Number of cells to use when estimating parameters (default uses all cells)


Method to use for initial parameter estimation; one of 'poisson', 'qpoisson', 'nb_fast', 'nb', 'nb_theta_given', 'glmGamPoi', 'offset', 'offset_shared_theta_estimate', 'glmGamPoi_offset'; default is 'poisson'


Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes); default is FALSE


Method to use to regularize theta; use 'log_theta' for the behavior prior to version 0.3; default is 'od_factor'


Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi)))


Number of genes to process simultaneously; this will determine how often the progress bars are updated and how much memory is being used; default is 500


Only use genes that have been detected in at least this many cells; default is 5


What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson'


Make cell attributes part of the output; default is FALSE


Calculate gene attributes and make part of output; default is TRUE


If set to TRUE output will contain corrected UMI matrix; see correct function


Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; one of 'umi_median', 'model_median', 'model_mean' or a numeric. default is -Inf. When set to 'umi_median' uses (median of non-zero UMIs / 5)^2 as the minimum variance so that a median UMI (often 1) results in a maximum pearson residual of 5. When set to 'model_median' or 'model_mean' uses the mean/median of the model estimated mu per gene as the minimum_variance.#'


Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3


Small value added when calculating geometric mean of a gene to avoid log(0); default is 1


Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative


If method is set to nb_theta_given, this should be a named numeric vector of fixed theta values for the genes; if method is offset, this should be a single value; default is NULL


Exclude poisson genes (i.e. mu < 0.001 or mu > variance) from regularization; default is FALSE


Use geometric mean instead of arithmetic mean for all calculations ; default is TRUE


Use geometric mean instead of arithmetic mean in the offset model; default is FALSE


Fix intercept as defined in the offset model; default is FALSE


Fix slope to log(10) (equivalent to using library size as an offset); default is FALSE


Replace all values of UMI in the regression model by this value instead of the median UMI; default is NA


When set to 'v2' sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; default is NULL which uses the original sctransform model


An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2


Deprecated; use verbosity instead


Deprecated; use verbosity instead


A list with components


Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if residual_type = 'none'


Matrix of corrected UMI counts (optional)


Character representation of the model formula


Matrix of estimated model parameters per gene (theta and regression coefficients)


Vector indicating whether a gene was considered to be an outlier


Matrix of fitted / regularized model parameters


Character representation of model for non-regularized variables


Model parameters for non-regularized variables


log-geometric mean of genes used in initial step of parameter estimation


Cells used in initial step of parameter estimation


List of function call arguments


Data frame of cell meta data (optional)


Data frame with gene attributes such as mean, detection rate, etc. (optional)


Time stamps at various points in the function


In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done on a subset of genes and/or cells to speed things up. If method is set to 'poisson', a poisson regression is done and the negative binomial theta parameter is estimated using the response residuals in theta_estimation_fun. If method is set to 'qpoisson', coefficients and overdispersion (phi) are estimated by quasi poisson regression and theta is estimated based on phi and the mean fitted value - this is currently the fastest method with results very similar to 'glmGamPoi' If method is set to 'nb_fast', coefficients and theta are estimated as in the 'poisson' method, but coefficients are then re-estimated using a proper negative binomial model in a second call to glm with family = MASS::negative.binomial(theta = theta). If method is set to 'nb', coefficients and theta are estimated by a single call to MASS::glm.nb. If method is set to 'glmGamPoi', coefficients and theta are estimated by a single call to glmGamPoi::glm_gp.

A special case is method = 'offset'. Here no regression parameters are learned, but instead an offset model is assumed. The latent variable is set to log_umi and a fixed slope of log(10) is used (offset). The intercept is given by log(gene_mean) - log(avg_cell_umi). See Lause et al. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-021-02451-7")} for details. Theta is set to 100 by default, but can be changed using the theta_given parameter (single numeric value). If the offset method is used, the following parameters are overwritten: cell_attr <- NULL, latent_var <- c('log_umi'), batch_var <- NULL, latent_var_nonreg <- NULL, n_genes <- NULL, n_cells <- NULL, do_regularize <- FALSE. Further, method = 'offset_shared_theta_estimate' exists where the 250 most highly expressed genes with detection rate of at least 0.5 are used to estimate a theta that is then shared across all genes. Thetas are estimated per individual gene using 5000 randomly selected cells. The final theta used for all genes is then the average.


vst_out <- vst(pbmc)

sctransform documentation built on Oct. 19, 2023, 9:08 a.m.