In ChristophH/sctransform: Variance Stabilizing Transformations for Single Cell UMI Data

library('Matrix')
library('ggplot2')
library('reshape2')
library('sctransform')
library('knitr')
library('dplyr')
library('GGally')
knit_hooks$set(optipng = hook_optipng)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  digits = 2,
  tidy = TRUE,
  tidy.opts = list(width.cutoff=80),
  optipng = '-o 5 -strip all -quiet',
  fig.width=6, fig.height=4, dpi=100, out.width = '70%'
)
old_theme <- theme_set(theme_classic(base_size=11))
# some of the vst steps can use multiple cores
# We use the Future API for parallel processing; set parameters here
future::plan(strategy = 'multicore', workers = 1)
options(future.globals.maxSize = 8 * 1024 ^ 3)
options(future.fork.enable = FALSE)

Introduction

In sctransform::vst we support several methods to estimate the parameters of the per-gene linear models. Here we briefly go over the methods and compare their results and runtime.

Methods

poisson - does poisson regression and the negative binomial theta parameter is estimated using the response residuals. By default theta is estimated using MASS::theta.ml, but MASS::theta.mm can also be used by changing the theta_estimation_fun parameter.
qpoisson - does quasi poisson regression to obtain coefficients and overdispersion (phi) and theta is estimated based on phi and the mean fitted value - this is currently the fastest method with results very similar to glmGamPoi
nb_fast - coefficients and theta are estimated as in the poisson method, but coefficients are then re-estimated using a proper negative binomial model in a call to glm with family = MASS::negative.binomial(theta = theta).
nb - coefficients and theta are estimated by MASS::glm.nb.
glmGamPoi - coefficients and theta are estimated by glmGamPoi::glm_gp.
offset - no regression parameters are learned, but instead an offset model is assumed. The latent variable is set to log_umi and a fixed slope of log(10) is used (offset). The intercept is given by log(gene_mean) - log(avg_cell_umi). Theta is set to 100 by default, but can be changed using the theta_given parameter
offset_shared_theta_estimate - like offset above, but the 250 most highly expressed genes with detection rate of at least 0.5 are used to estimate a theta that is then shared across all genes. Thetas are estimated per individual gene using 5000 randomly selected cells. The final theta used for all genes is then the average.

Comparison

We are going to process a PBMC dataset with all the methods listed above. The dataset is available from 10x Genomics.

cm <- Seurat::Read10X_h5(file = '~/Projects/data_warehouse/raw_public_10x/pbmc_10k_v3_filtered_feature_bc_matrix.h5')
# downsample to speed up compilation of this vignette
set.seed(42)
cm <- cm[, sample(x = ncol(cm), size = 5000)]
message(nrow(cm), ' genes across ', ncol(cm), ' cells')

result_lst <- list()
estimation_methods <- c('poisson', 'qpoisson', 'nb_fast', 'nb', 'glmGamPoi', 'offset', 'offset_shared_theta_estimate')
for (estimation_method in estimation_methods) {
  if (estimation_method %in% c('poisson', 'nb_fast')) {
    for (theta_estimation_fun in c('theta.ml', 'theta.mm')) {
      method_name <- paste(estimation_method, theta_estimation_fun, sep = '-')
      message(method_name)
      set.seed(33)
      vst_out <- vst(umi = cm, method = estimation_method, 
                     theta_estimation_fun = theta_estimation_fun, 
                     verbosity = 0)
      vst_out$y <- NULL
      result_lst[[method_name]] <- vst_out
    }
  } else {
    method_name <- estimation_method
    message(method_name)
    set.seed(33)
    vst_out <- vst(umi = cm, method = estimation_method, 
                   verbosity = 0)
    vst_out$y <- NULL
    result_lst[[method_name]] <- vst_out
  }
}

Show how the residual variances compare between methods (plot below shows log10-transformed values)

mat <- sapply(result_lst, function(x) x$gene_attr$residual_variance)
colnames(mat) <- names(result_lst)
ggpairs(data.frame(log10(mat)), progress = FALSE)

Show how the residual means compare between methods

mat <- sapply(result_lst, function(x) x$gene_attr$residual_mean)
colnames(mat) <- names(result_lst)
ggpairs(data.frame(mat), progress = FALSE)

Show the model parameters for all methods

plot_lst <- lapply(names(result_lst), function(method_name) {
  plot_model_pars(result_lst[[method_name]], show_theta = TRUE) + ggtitle(method_name)
})
plot(cowplot::plot_grid(plotlist = plot_lst, ncol = 2))

Overall, the regularized parameters and the resulting residuals are very similar across the methods. It is unlikely that the small differences we see above would lead to big differences in downstream analyses like dimensionality reduction and clustering.

How do the methods compare with respect to runtime?

tmp <- lapply(names(result_lst), function(method_name) {
  times <- result_lst[[method_name]]$times
  delta_t_total <- as.numeric(times$done) - as.numeric(times$start_time)
  delta_t_model <- as.numeric(times$reg_model_pars) - as.numeric(times$get_model_pars)
  data.frame(method_name, delta_t_total, delta_t_model)
})
df <- do.call(rbind, tmp) %>% mutate(delta_t_rest = delta_t_total - delta_t_model)
df$method_name <- factor(df$method_name, levels = df$method_name, ordered = TRUE)
df <- arrange(df, delta_t_total) %>% mutate(method_name = factor(method_name, unique(method_name)))
melt(df, id.vars = 'method_name', measure.vars = c('delta_t_model', 'delta_t_rest')) %>%
  ggplot(aes(method_name, value, fill = variable)) +
  geom_bar(position = 'stack', stat = 'identity') +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  xlab('Method') + ylab('Wall clock time in seconds') +
  scale_fill_discrete(name = 'Part of algorithm', labels = c('Model fitting', 'Rest'))

df

The runtime comparison above is not an in-depth benchmark, but it gives a good idea about the relative speed of the different methods. All times above were obtained using a single CPU core. However, the model fitting step supports the Future API for parallel processing.

ChristophH/sctransform documentation built on June 11, 2025, 1:17 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ChristophH/sctransform
Variance Stabilizing Transformations for Single Cell UMI Data

In ChristophH/sctransform: Variance Stabilizing Transformations for Single Cell UMI Data

Introduction

Methods

Comparison

R Package Documentation

Browse R Packages

We want your feedback!

ChristophH/sctransform Variance Stabilizing Transformations for Single Cell UMI Data

In ChristophH/sctransform: Variance Stabilizing Transformations for Single Cell UMI Data

Introduction

Methods

Comparison

R Package Documentation

Browse R Packages

We want your feedback!

ChristophH/sctransform
Variance Stabilizing Transformations for Single Cell UMI Data