library('Matrix')
library('ggplot2')
library('reshape2')
library('sctransform')
library('knitr')
knit_hooks$set(optipng = hook_optipng)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  digits = 2,
  tidy = TRUE,
  tidy.opts = list(width.cutoff=80),
  optipng = '-o 5 -strip all -quiet',
  fig.width=6, fig.height=4, dpi=100, out.width = '70%'
)
old_theme <- theme_set(theme_classic(base_size=8))
# some of the vst steps can use multiple cores
# We use the Future API for parallel processing; set parameters here
future::plan(strategy = 'multicore', workers = 3)
options(future.globals.maxSize = 8 * 1024 ^ 3)
options(future.fork.enable = TRUE)

In this vignette we show how the regression model in the variance stabilizing transformation can be used to output corrected data. Implicitly there are two levels of smoothing/de-noising when we apply the standard workflow using vst. First we specify latent variables that are used in the regression model - their contribution to the overall variance in the data will be removed. Second, we usually perform dimensionality reduction, which acts like a smoothing operation. Here we show how to reverse these two operations to obtain corrected UMI counts.

Load data and transform

We use data from a recent publication: Mayer, Hafemeister, Bandler et al., Nature 2018 (free read-only version). We load a subset of the cells, namely one of the CGE E12.5 dropseq samples with contaminating cell populations removed. These cells come from a developing continuum and provide a nice example for de-noising and count correction.

First load the data and run variance stabilizing transformation.

cm <- readRDS('~/Projects/data/in-lineage_dropseq_CGE3_digitial_expression.Rds')

set.seed(42)
vst_out <- sctransform::vst(cm, latent_var = 'log_umi_per_gene', return_cell_attr = TRUE, verbosity = 1)

To place the cells on a maturation trajectory, we perform PCA and fit a principal curve to the first two dimensions, similar to the approach used in the original publication.

pca <- irlba::prcomp_irlba(t(vst_out$y), n = 2)

# fit principal curve through first two PCs
pricu <- princurve::principal_curve(pca$x, smoother='lowess', f=0.5, stretch=333)
# cell projection onto curve is maturation score
maturation_score <- pricu$lambda/max(pricu$lambda)

We will now smooth the Pearson residual by PCA. Internally smooth_via_pca performs PCA, then keeps only significant dimensions and back-rotates to the original data space. The number of significant dimensions is determined using the 'elbow' method. To speed things up we could also use only a subset of the genes for this operation (e.g. variable genes).

y_smooth <- sctransform::smooth_via_pca(vst_out$y, do_plot = TRUE)

The data matrix y_smooth is in Pearson residual space. Based on these values we can reverse the negative binomial regression model to derive UMI counts per gene. To remove the variability from the latent factor (here log_umi_per_gene as a proxy of sequencing depth), we can use a fixed value for all cells. The next step uses the smoothed Pearson residual and the median of all latent factors to obtain corrected UMI counts.

cm_corrected <- sctransform::correct(vst_out, data = y_smooth, verbosity = 1)

To give a better idea of what the data really looks like we will plot UMI, Pearson residual, corrected Pearson residual, and corrected UMI counts for some key genes related to neuronal development.

goi <- c('Nes', 'Ccnd2', 'Tuba1a')
df <- list()
df[[1]] <- melt(t(as.matrix(cm[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[1]]$type <- 'UMI'
df[[1]]$maturation_rank <- rank(maturation_score)
df[[2]] <- melt(t(as.matrix(vst_out$y[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[2]]$type <- 'Pearson residual'
df[[2]]$maturation_rank <- rank(maturation_score)
df[[3]] <- melt(t(as.matrix(y_smooth[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[3]]$type <- 'corrected Pearson residual'
df[[3]]$maturation_rank <- rank(maturation_score)
df[[4]] <- melt(t(as.matrix(cm_corrected[goi, ])), varnames = c('cell', 'gene'), value.name = 'value')
df[[4]]$type <- 'corrected UMI'
df[[4]]$maturation_rank <- rank(maturation_score)
df <- do.call(rbind, df)
df$gene <- factor(df$gene, ordered=TRUE, levels=unique(df$gene))
df$type <- factor(df$type, ordered=TRUE, levels=unique(df$type))
ggplot(df, aes(maturation_rank, value)) + 
  geom_point(alpha=0.5, shape=16) + 
  geom_density_2d(contour_var = 'ndensity', size=0.5, bins = 5) + 
  facet_grid(type ~ gene, scales = 'free')


ChristophH/sctransform documentation built on Oct. 22, 2023, 3:28 p.m.