vst: Variance stabilizing transformation for UMI count data
In sctransform: Variance Stabilizing Transformations for Single Cell UMI Data

View source: R/vst.R

vst	R Documentation

Variance stabilizing transformation for UMI count data

Description

Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters. For details see \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-019-1874-1")}.

Usage

vst(
  umi,
  cell_attr = NULL,
  latent_var = c("log_umi"),
  batch_var = NULL,
  latent_var_nonreg = NULL,
  n_genes = 2000,
  n_cells = NULL,
  method = "poisson",
  do_regularize = TRUE,
  theta_regularization = "od_factor",
  res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
  bin_size = 500,
  min_cells = 5,
  residual_type = "pearson",
  return_cell_attr = FALSE,
  return_gene_attr = TRUE,
  return_corrected_umi = FALSE,
  min_variance = -Inf,
  bw_adjust = 3,
  gmean_eps = 1,
  theta_estimation_fun = "theta.ml",
  theta_given = NULL,
  exclude_poisson = FALSE,
  use_geometric_mean = TRUE,
  use_geometric_mean_offset = FALSE,
  fix_intercept = FALSE,
  fix_slope = FALSE,
  scale_factor = NA,
  vst.flavor = NULL,
  verbosity = 2,
  verbose = NULL,
  show_progress = NULL
)

Arguments

`umi`	A matrix of UMI counts with genes as rows and cells as columns
`cell_attr`	A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated
`latent_var`	The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi")
`batch_var`	The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited
`latent_var_nonreg`	The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL
`n_genes`	Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes)
`n_cells`	Number of cells to use when estimating parameters (default uses all cells)
`method`	Method to use for initial parameter estimation; one of 'poisson', 'qpoisson', 'nb_fast', 'nb', 'nb_theta_given', 'glmGamPoi', 'offset', 'offset_shared_theta_estimate', 'glmGamPoi_offset'; default is 'poisson'
`do_regularize`	Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes); default is FALSE
`theta_regularization`	Method to use to regularize theta; use 'log_theta' for the behavior prior to version 0.3; default is 'od_factor'
`res_clip_range`	Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi)))
`bin_size`	Number of genes to process simultaneously; this will determine how often the progress bars are updated and how much memory is being used; default is 500
`min_cells`	Only use genes that have been detected in at least this many cells; default is 5
`residual_type`	What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson'
`return_cell_attr`	Make cell attributes part of the output; default is FALSE
`return_gene_attr`	Calculate gene attributes and make part of output; default is TRUE
`return_corrected_umi`	If set to TRUE output will contain corrected UMI matrix; see `correct` function
`min_variance`	Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; one of 'umi_median', 'model_median', 'model_mean' or a numeric. default is -Inf. When set to 'umi_median' uses (median of non-zero UMIs / 5)^2 as the minimum variance so that a median UMI (often 1) results in a maximum pearson residual of 5. When set to 'model_median' or 'model_mean' uses the mean/median of the model estimated mu per gene as the minimum_variance.#'
`bw_adjust`	Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3
`gmean_eps`	Small value added when calculating geometric mean of a gene to avoid log(0); default is 1
`theta_estimation_fun`	Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative
`theta_given`	If method is set to nb_theta_given, this should be a named numeric vector of fixed theta values for the genes; if method is offset, this should be a single value; default is NULL
`exclude_poisson`	Exclude poisson genes (i.e. mu < 0.001 or mu > variance) from regularization; default is FALSE
`use_geometric_mean`	Use geometric mean instead of arithmetic mean for all calculations ; default is TRUE
`use_geometric_mean_offset`	Use geometric mean instead of arithmetic mean in the offset model; default is FALSE
`fix_intercept`	Fix intercept as defined in the offset model; default is FALSE
`fix_slope`	Fix slope to log(10) (equivalent to using library size as an offset); default is FALSE
`scale_factor`	Replace all values of UMI in the regression model by this value instead of the median UMI; default is NA
`vst.flavor`	When set to 'v2' sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; default is NULL which uses the original sctransform model
`verbosity`	An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2
`verbose`	Deprecated; use verbosity instead
`show_progress`	Deprecated; use verbosity instead

Value

A list with components

`y`	Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if `residual_type = 'none'`
`umi_corrected`	Matrix of corrected UMI counts (optional)
`model_str`	Character representation of the model formula
`model_pars`	Matrix of estimated model parameters per gene (theta and regression coefficients)
`model_pars_outliers`	Vector indicating whether a gene was considered to be an outlier
`model_pars_fit`	Matrix of fitted / regularized model parameters
`model_str_nonreg`	Character representation of model for non-regularized variables
`model_pars_nonreg`	Model parameters for non-regularized variables
`genes_log_gmean_step1`	log-geometric mean of genes used in initial step of parameter estimation
`cells_step1`	Cells used in initial step of parameter estimation
`arguments`	List of function call arguments
`cell_attr`	Data frame of cell meta data (optional)
`gene_attr`	Data frame with gene attributes such as mean, detection rate, etc. (optional)
`times`	Time stamps at various points in the function

Details

In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done on a subset of genes and/or cells to speed things up. If method is set to 'poisson', a poisson regression is done and the negative binomial theta parameter is estimated using the response residuals in theta_estimation_fun. If method is set to 'qpoisson', coefficients and overdispersion (phi) are estimated by quasi poisson regression and theta is estimated based on phi and the mean fitted value - this is currently the fastest method with results very similar to 'glmGamPoi' If method is set to 'nb_fast', coefficients and theta are estimated as in the 'poisson' method, but coefficients are then re-estimated using a proper negative binomial model in a second call to glm with family = MASS::negative.binomial(theta = theta). If method is set to 'nb', coefficients and theta are estimated by a single call to MASS::glm.nb. If method is set to 'glmGamPoi', coefficients and theta are estimated by a single call to glmGamPoi::glm_gp.

A special case is method = 'offset'. Here no regression parameters are learned, but instead an offset model is assumed. The latent variable is set to log_umi and a fixed slope of log(10) is used (offset). The intercept is given by log(gene_mean) - log(avg_cell_umi). See Lause et al. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-021-02451-7")} for details. Theta is set to 100 by default, but can be changed using the theta_given parameter (single numeric value). If the offset method is used, the following parameters are overwritten: cell_attr <- NULL, latent_var <- c('log_umi'), batch_var <- NULL, latent_var_nonreg <- NULL, n_genes <- NULL, n_cells <- NULL, do_regularize <- FALSE. Further, method = 'offset_shared_theta_estimate' exists where the 250 most highly expressed genes with detection rate of at least 0.5 are used to estimate a theta that is then shared across all genes. Thetas are estimated per individual gene using 5000 randomly selected cells. The final theta used for all genes is then the average.