vst: Variance stabilizing transformation for UMI count data

Description Usage Arguments Value Details Examples

View source: R/vst.R

Description

Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters. For details see doi: 10.1186/s13059-019-1874-1.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
vst(
  umi,
  cell_attr = NULL,
  latent_var = c("log_umi"),
  batch_var = NULL,
  latent_var_nonreg = NULL,
  n_genes = 2000,
  n_cells = NULL,
  method = "poisson",
  do_regularize = TRUE,
  theta_regularization = "od_factor",
  res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
  bin_size = 500,
  min_cells = 5,
  residual_type = "pearson",
  return_cell_attr = FALSE,
  return_gene_attr = TRUE,
  return_corrected_umi = FALSE,
  min_variance = -Inf,
  bw_adjust = 3,
  gmean_eps = 1,
  theta_estimation_fun = "theta.ml",
  theta_given = NULL,
  verbosity = 2,
  verbose = NULL,
  show_progress = NULL
)

Arguments

umi

A matrix of UMI counts with genes as rows and cells as columns

cell_attr

A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated

latent_var

The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi")

batch_var

The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited

latent_var_nonreg

The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL

n_genes

Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes)

n_cells

Number of cells to use when estimating parameters (default uses all cells)

method

Method to use for initial parameter estimation; one of 'poisson', 'qpoisson', 'nb_fast', 'nb', 'nb_theta_given', 'glmGamPoi', 'offset', 'offset_shared_theta_estimate'; default is 'poisson'

do_regularize

Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes); default is FALSE

theta_regularization

Method to use to regularize theta; use 'log_theta' for the behavior prior to version 0.3; default is 'od_factor'

res_clip_range

Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi)))

bin_size

Number of genes to process simultaneously; this will determine how often the progress bars are updated and how much memory is being used; default is 500

min_cells

Only use genes that have been detected in at least this many cells; default is 5

residual_type

What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson'

return_cell_attr

Make cell attributes part of the output; default is FALSE

return_gene_attr

Calculate gene attributes and make part of output; default is TRUE

return_corrected_umi

If set to TRUE output will contain corrected UMI matrix; see correct function

min_variance

Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; default is -Inf

bw_adjust

Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3

gmean_eps

Small value added when calculating geometric mean of a gene to avoid log(0); default is 1

theta_estimation_fun

Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative

theta_given

If method is set to nb_theta_given, this should be a named numeric vector of fixed theta values for the genes; if method is offset, this should be a single value; default is NULL

verbosity

An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2

verbose

Deprecated; use verbosity instead

show_progress

Deprecated; use verbosity instead

Value

A list with components

y

Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if residual_type = 'none'

umi_corrected

Matrix of corrected UMI counts (optional)

model_str

Character representation of the model formula

model_pars

Matrix of estimated model parameters per gene (theta and regression coefficients)

model_pars_outliers

Vector indicating whether a gene was considered to be an outlier

model_pars_fit

Matrix of fitted / regularized model parameters

model_str_nonreg

Character representation of model for non-regularized variables

model_pars_nonreg

Model parameters for non-regularized variables

genes_log_gmean_step1

log-geometric mean of genes used in initial step of parameter estimation

cells_step1

Cells used in initial step of parameter estimation

arguments

List of function call arguments

cell_attr

Data frame of cell meta data (optional)

gene_attr

Data frame with gene attributes such as mean, detection rate, etc. (optional)

times

Time stamps at various points in the function

Details

In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done on a subset of genes and/or cells to speed things up. If method is set to 'poisson', a poisson regression is done and the negative binomial theta parameter is estimated using the response residuals in theta_estimation_fun. If method is set to 'qpoisson', coefficients and overdispersion (phi) are estimated by quasi poisson regression and theta is estimated based on phi and the mean fitted value - this is currently the fastest method with results very similar to 'glmGamPoi' If method is set to 'nb_fast', coefficients and theta are estimated as in the 'poisson' method, but coefficients are then re-estimated using a proper negative binomial model in a second call to glm with family = MASS::negative.binomial(theta = theta). If method is set to 'nb', coefficients and theta are estimated by a single call to MASS::glm.nb. If method is set to 'glmGamPoi', coefficients and theta are estimated by a single call to glmGamPoi::glm_gp.

A special case is method = 'offset'. Here no regression parameters are learned, but instead an offset model is assumed. The latent variable is set to log_umi and a fixed slope of log(10) is used (offset). The intercept is given by log(gene_mean) - log(avg_cell_umi). See Lause et al. (bioRxiv 2020.12.01.405886) for details. Theta is set to 100 by default, but can be changed using the theta_given parameter (single numeric value). If the offset method is used, the following parameters are overwritten: cell_attr <- NULL, latent_var <- c('log_umi'), batch_var <- NULL, latent_var_nonreg <- NULL, n_genes <- NULL, n_cells <- NULL, do_regularize <- FALSE. Further, method = 'offset_shared_theta_estimate' exists where the 250 most highly expressed genes with detection rate of at least 0.5 are used to estimate a theta that is then shared across all genes. Thetas are estimated per individual gene using 5000 randomly selected cells. The final theta used for all genes is then the average.

Examples

1
vst_out <- vst(pbmc)

sctransform documentation built on Jan. 13, 2021, 10:48 p.m.