vst | R Documentation |
Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters. For details see \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-019-1874-1")}.
vst(
umi,
cell_attr = NULL,
latent_var = c("log_umi"),
batch_var = NULL,
latent_var_nonreg = NULL,
n_genes = 2000,
n_cells = NULL,
method = "poisson",
do_regularize = TRUE,
theta_regularization = "od_factor",
res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
bin_size = 500,
min_cells = 5,
residual_type = "pearson",
return_cell_attr = FALSE,
return_gene_attr = TRUE,
return_corrected_umi = FALSE,
min_variance = -Inf,
bw_adjust = 3,
gmean_eps = 1,
theta_estimation_fun = "theta.ml",
theta_given = NULL,
exclude_poisson = FALSE,
use_geometric_mean = TRUE,
use_geometric_mean_offset = FALSE,
fix_intercept = FALSE,
fix_slope = FALSE,
scale_factor = NA,
vst.flavor = NULL,
verbosity = 2,
verbose = NULL,
show_progress = NULL
)
umi |
A matrix of UMI counts with genes as rows and cells as columns |
cell_attr |
A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated |
latent_var |
The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi") |
batch_var |
The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited |
latent_var_nonreg |
The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL |
n_genes |
Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes) |
n_cells |
Number of cells to use when estimating parameters (default uses all cells) |
method |
Method to use for initial parameter estimation; one of 'poisson', 'qpoisson', 'nb_fast', 'nb', 'nb_theta_given', 'glmGamPoi', 'offset', 'offset_shared_theta_estimate', 'glmGamPoi_offset'; default is 'poisson' |
do_regularize |
Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes); default is FALSE |
theta_regularization |
Method to use to regularize theta; use 'log_theta' for the behavior prior to version 0.3; default is 'od_factor' |
res_clip_range |
Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi))) |
bin_size |
Number of genes to process simultaneously; this will determine how often the progress bars are updated and how much memory is being used; default is 500 |
min_cells |
Only use genes that have been detected in at least this many cells; default is 5 |
residual_type |
What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson' |
return_cell_attr |
Make cell attributes part of the output; default is FALSE |
return_gene_attr |
Calculate gene attributes and make part of output; default is TRUE |
return_corrected_umi |
If set to TRUE output will contain corrected UMI matrix; see |
min_variance |
Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; one of 'umi_median', 'model_median', 'model_mean' or a numeric. default is -Inf. When set to 'umi_median' uses (median of non-zero UMIs / 5)^2 as the minimum variance so that a median UMI (often 1) results in a maximum pearson residual of 5. When set to 'model_median' or 'model_mean' uses the mean/median of the model estimated mu per gene as the minimum_variance.#' |
bw_adjust |
Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3 |
gmean_eps |
Small value added when calculating geometric mean of a gene to avoid log(0); default is 1 |
theta_estimation_fun |
Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative |
theta_given |
If method is set to nb_theta_given, this should be a named numeric vector of fixed theta values for the genes; if method is offset, this should be a single value; default is NULL |
exclude_poisson |
Exclude poisson genes (i.e. mu < 0.001 or mu > variance) from regularization; default is FALSE |
use_geometric_mean |
Use geometric mean instead of arithmetic mean for all calculations ; default is TRUE |
use_geometric_mean_offset |
Use geometric mean instead of arithmetic mean in the offset model; default is FALSE |
fix_intercept |
Fix intercept as defined in the offset model; default is FALSE |
fix_slope |
Fix slope to log(10) (equivalent to using library size as an offset); default is FALSE |
scale_factor |
Replace all values of UMI in the regression model by this value instead of the median UMI; default is NA |
vst.flavor |
When set to 'v2' sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; default is NULL which uses the original sctransform model |
verbosity |
An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2 |
verbose |
Deprecated; use verbosity instead |
show_progress |
Deprecated; use verbosity instead |
A list with components
y |
Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if |
umi_corrected |
Matrix of corrected UMI counts (optional) |
model_str |
Character representation of the model formula |
model_pars |
Matrix of estimated model parameters per gene (theta and regression coefficients) |
model_pars_outliers |
Vector indicating whether a gene was considered to be an outlier |
model_pars_fit |
Matrix of fitted / regularized model parameters |
model_str_nonreg |
Character representation of model for non-regularized variables |
model_pars_nonreg |
Model parameters for non-regularized variables |
genes_log_gmean_step1 |
log-geometric mean of genes used in initial step of parameter estimation |
cells_step1 |
Cells used in initial step of parameter estimation |
arguments |
List of function call arguments |
cell_attr |
Data frame of cell meta data (optional) |
gene_attr |
Data frame with gene attributes such as mean, detection rate, etc. (optional) |
times |
Time stamps at various points in the function |
In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done
on a subset of genes and/or cells to speed things up.
If method
is set to 'poisson', a poisson regression is done and
the negative binomial theta parameter is estimated using the response residuals in
theta_estimation_fun
.
If method
is set to 'qpoisson', coefficients and overdispersion (phi) are estimated by quasi
poisson regression and theta is estimated based on phi and the mean fitted value - this is currently
the fastest method with results very similar to 'glmGamPoi'
If method
is set to 'nb_fast', coefficients and theta are estimated as in the
'poisson' method, but coefficients are then re-estimated using a proper negative binomial
model in a second call to glm with family = MASS::negative.binomial(theta = theta)
.
If method
is set to 'nb', coefficients and theta are estimated by a single call to
MASS::glm.nb
.
If method
is set to 'glmGamPoi', coefficients and theta are estimated by a single call to
glmGamPoi::glm_gp
.
A special case is method = 'offset'
. Here no regression parameters are learned, but
instead an offset model is assumed. The latent variable is set to log_umi and a fixed
slope of log(10) is used (offset). The intercept is given by log(gene_mean) - log(avg_cell_umi).
See Lause et al. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-021-02451-7")} for details.
Theta is set
to 100 by default, but can be changed using the theta_given
parameter (single numeric value).
If the offset method is used, the following parameters are overwritten:
cell_attr <- NULL, latent_var <- c('log_umi'), batch_var <- NULL, latent_var_nonreg <- NULL,
n_genes <- NULL, n_cells <- NULL, do_regularize <- FALSE
. Further, method = 'offset_shared_theta_estimate'
exists where the 250 most highly expressed genes with detection rate of at least 0.5 are used
to estimate a theta that is then shared across all genes. Thetas are estimated per individual gene
using 5000 randomly selected cells. The final theta used for all genes is then the average.
vst_out <- vst(pbmc)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.