vst: Variance stabilizing transformation for UMI count data

Description Usage Arguments Value Details Examples

View source: R/vst.R

Description

Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model. This will remove unwanted effects from UMI data and return Pearson residuals. Uses mclapply; you can set the number of cores it will use to n with command options(mc.cores = n). If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the initial model parameters.

Usage

1
2
3
4
5
6
7
8
vst(umi, cell_attr = NULL, latent_var = c("log_umi"),
  batch_var = NULL, latent_var_nonreg = NULL, n_genes = 2000,
  n_cells = NULL, method = "poisson", do_regularize = TRUE,
  res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
  bin_size = 256, min_cells = 5, return_cell_attr = FALSE,
  return_gene_attr = FALSE, return_dev_residuals = FALSE,
  return_corrected_umi = FALSE, bw_adjust = 3, theta_given = NULL,
  show_progress = TRUE)

Arguments

umi

A matrix of UMI counts with genes as rows and cells as columns

cell_attr

A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated

latent_var

The dependent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi_per_gene")

batch_var

The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited

latent_var_nonreg

The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL

n_genes

Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes)

n_cells

Number of cells to use when estimating parameters (default uses all cells)

method

Method to use for initial parameter estimation; one of 'poisson', 'nb_fast', 'nb'

do_regularize

Boolean that, if set to FALSE, will bypass parameter regularization

res_clip_range

Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi)))

bin_size

Number of genes to put in each bin (to show progress)

min_cells

Only use genes that have been detected in at least this many cells

return_cell_attr

Make cell attributes part of the output

return_gene_attr

Calculate gene attributes and make part of output

return_dev_residuals

If set to TRUE output will be deviance residuals, NOT Pearson residuals; default is FALSE

return_corrected_umi

If set to TRUE output will contain corrected UMI matrix; see denoise function

bw_adjust

Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3

theta_given

Named numeric vector of fixed theta values for the genes; will only be used if method is set to nb_theta_given; default is NULL

show_progress

Whether to print progress bar

Value

A list with components

y

Matrix of transformed data, i.e. Pearson residuals

umi_corrected

Matrix of corrected UMI counts (optional)

model_str

Character representation of the model formula

model_pars

Matrix of estimated model parameters per gene (theta and regression coefficients)

model_pars_outliers

Vector indicating whether a gene was considered to be an outlier

model_pars_fit

Matrix of fitted / regularized model parameters

model_str_nonreg

Character representation of model for non-regularized variables

model_pars_nonreg

Model parameters for non-regularized variables

genes_log_mean_step1

log-mean of genes used in initial step of parameter estimation

cells_step1

Cells used in initial step of parameter estimation

arguments

List of function call arguments

cell_attr

Data frame of cell meta data (optional)

gene_attr

Data frame with gene attributes such as mean, detection rate, etc. (optional)

Details

In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done on a subset of genes and/or cells to speed things up. If method is set to 'poisson', glm will be called with family = poisson and the negative binomial theta parameter will be estimated using the response residuals in MASS::theta.ml. If method is set to 'nb_fast', glm coefficients and theta are estimated as in the 'poisson' method, but coefficients are then re-estimated using a proper negative binomial model in a second call to glm with family = MASS::negative.binomial(theta = theta). If method is set to 'nb', coefficients and theta are estimated by a single call to MASS::glm.nb.

Examples

1
vst_out <- vst(pbmc)

sctransform documentation built on Nov. 18, 2018, 5:04 p.m.