vst | R Documentation |

Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters. For details see \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-019-1874-1")}.

```
vst(
umi,
cell_attr = NULL,
latent_var = c("log_umi"),
batch_var = NULL,
latent_var_nonreg = NULL,
n_genes = 2000,
n_cells = NULL,
method = "poisson",
do_regularize = TRUE,
theta_regularization = "od_factor",
res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
bin_size = 500,
min_cells = 5,
residual_type = "pearson",
return_cell_attr = FALSE,
return_gene_attr = TRUE,
return_corrected_umi = FALSE,
min_variance = -Inf,
bw_adjust = 3,
gmean_eps = 1,
theta_estimation_fun = "theta.ml",
theta_given = NULL,
exclude_poisson = FALSE,
use_geometric_mean = TRUE,
use_geometric_mean_offset = FALSE,
fix_intercept = FALSE,
fix_slope = FALSE,
scale_factor = NA,
vst.flavor = NULL,
verbosity = 2,
verbose = NULL,
show_progress = NULL
)
```

`umi` |
A matrix of UMI counts with genes as rows and cells as columns |

`cell_attr` |
A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated |

`latent_var` |
The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi") |

`batch_var` |
The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited |

`latent_var_nonreg` |
The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL |

`n_genes` |
Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes) |

`n_cells` |
Number of cells to use when estimating parameters (default uses all cells) |

`method` |
Method to use for initial parameter estimation; one of 'poisson', 'qpoisson', 'nb_fast', 'nb', 'nb_theta_given', 'glmGamPoi', 'offset', 'offset_shared_theta_estimate', 'glmGamPoi_offset'; default is 'poisson' |

`do_regularize` |
Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes); default is FALSE |

`theta_regularization` |
Method to use to regularize theta; use 'log_theta' for the behavior prior to version 0.3; default is 'od_factor' |

`res_clip_range` |
Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi))) |

`bin_size` |
Number of genes to process simultaneously; this will determine how often the progress bars are updated and how much memory is being used; default is 500 |

`min_cells` |
Only use genes that have been detected in at least this many cells; default is 5 |

`residual_type` |
What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson' |

`return_cell_attr` |
Make cell attributes part of the output; default is FALSE |

`return_gene_attr` |
Calculate gene attributes and make part of output; default is TRUE |

`return_corrected_umi` |
If set to TRUE output will contain corrected UMI matrix; see |

`min_variance` |
Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; one of 'umi_median', 'model_median', 'model_mean' or a numeric. default is -Inf. When set to 'umi_median' uses (median of non-zero UMIs / 5)^2 as the minimum variance so that a median UMI (often 1) results in a maximum pearson residual of 5. When set to 'model_median' or 'model_mean' uses the mean/median of the model estimated mu per gene as the minimum_variance.#' |

`bw_adjust` |
Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3 |

`gmean_eps` |
Small value added when calculating geometric mean of a gene to avoid log(0); default is 1 |

`theta_estimation_fun` |
Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative |

`theta_given` |
If method is set to nb_theta_given, this should be a named numeric vector of fixed theta values for the genes; if method is offset, this should be a single value; default is NULL |

`exclude_poisson` |
Exclude poisson genes (i.e. mu < 0.001 or mu > variance) from regularization; default is FALSE |

`use_geometric_mean` |
Use geometric mean instead of arithmetic mean for all calculations ; default is TRUE |

`use_geometric_mean_offset` |
Use geometric mean instead of arithmetic mean in the offset model; default is FALSE |

`fix_intercept` |
Fix intercept as defined in the offset model; default is FALSE |

`fix_slope` |
Fix slope to log(10) (equivalent to using library size as an offset); default is FALSE |

`scale_factor` |
Replace all values of UMI in the regression model by this value instead of the median UMI; default is NA |

`vst.flavor` |
When set to 'v2' sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; default is NULL which uses the original sctransform model |

`verbosity` |
An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2 |

`verbose` |
Deprecated; use verbosity instead |

`show_progress` |
Deprecated; use verbosity instead |

A list with components

`y` |
Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if |

`umi_corrected` |
Matrix of corrected UMI counts (optional) |

`model_str` |
Character representation of the model formula |

`model_pars` |
Matrix of estimated model parameters per gene (theta and regression coefficients) |

`model_pars_outliers` |
Vector indicating whether a gene was considered to be an outlier |

`model_pars_fit` |
Matrix of fitted / regularized model parameters |

`model_str_nonreg` |
Character representation of model for non-regularized variables |

`model_pars_nonreg` |
Model parameters for non-regularized variables |

`genes_log_gmean_step1` |
log-geometric mean of genes used in initial step of parameter estimation |

`cells_step1` |
Cells used in initial step of parameter estimation |

`arguments` |
List of function call arguments |

`cell_attr` |
Data frame of cell meta data (optional) |

`gene_attr` |
Data frame with gene attributes such as mean, detection rate, etc. (optional) |

`times` |
Time stamps at various points in the function |

In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done
on a subset of genes and/or cells to speed things up.
If `method`

is set to 'poisson', a poisson regression is done and
the negative binomial theta parameter is estimated using the response residuals in
`theta_estimation_fun`

.
If `method`

is set to 'qpoisson', coefficients and overdispersion (phi) are estimated by quasi
poisson regression and theta is estimated based on phi and the mean fitted value - this is currently
the fastest method with results very similar to 'glmGamPoi'
If `method`

is set to 'nb_fast', coefficients and theta are estimated as in the
'poisson' method, but coefficients are then re-estimated using a proper negative binomial
model in a second call to glm with `family = MASS::negative.binomial(theta = theta)`

.
If `method`

is set to 'nb', coefficients and theta are estimated by a single call to
`MASS::glm.nb`

.
If `method`

is set to 'glmGamPoi', coefficients and theta are estimated by a single call to
`glmGamPoi::glm_gp`

.

A special case is `method = 'offset'`

. Here no regression parameters are learned, but
instead an offset model is assumed. The latent variable is set to log_umi and a fixed
slope of log(10) is used (offset). The intercept is given by log(gene_mean) - log(avg_cell_umi).
See Lause et al. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1186/s13059-021-02451-7")} for details.
Theta is set
to 100 by default, but can be changed using the `theta_given`

parameter (single numeric value).
If the offset method is used, the following parameters are overwritten:
```
cell_attr <- NULL, latent_var <- c('log_umi'), batch_var <- NULL, latent_var_nonreg <- NULL,
n_genes <- NULL, n_cells <- NULL, do_regularize <- FALSE
```

. Further, `method = 'offset_shared_theta_estimate'`

exists where the 250 most highly expressed genes with detection rate of at least 0.5 are used
to estimate a theta that is then shared across all genes. Thetas are estimated per individual gene
using 5000 randomly selected cells. The final theta used for all genes is then the average.

```
vst_out <- vst(pbmc)
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.