library('Matrix')
library('ggplot2')
library('reshape2')
library('sctransform')
library('knitr')
knit_hooks$set(optipng = hook_optipng)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  digits = 2,
  tidy = TRUE,
  tidy.opts = list(width.cutoff=80),
  optipng = '-o 5 -strip all -quiet',
  fig.width=4, fig.height=2.5, dpi=100, out.width = '70%'
)
old_theme <- theme_set(theme_classic(base_size=10))

With this vignette we show how data that has been transformed using the vst function can be tested for differentially expressed genes across subsets of cells.

Load some data

First we will follow the Seurat clustering tutorial and load a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here. We will use the cell type identies from the Seurat tutorial in this vignette.

pbmc_clusters <- readRDS(file = "~/Projects/data/pbmc3k_celltypes.rds")
pbmc_data <- readRDS(file = "~/Projects/data/pbmc3k_umi_counts.Rds")
pbmc_data <- pbmc_data[, names(pbmc_clusters)]

class(pbmc_data)
dim(pbmc_data)

pbmc_data is a sparse matrix of UMI counts (32,738 genes as rows and 2,638 cells as columns). Perform the variance stabilizing transformation:

set.seed(43)
vst_out <- sctransform::vst(pbmc_data, latent_var = c('log_umi_per_gene'), return_gene_attr = TRUE, return_cell_attr = TRUE, verbosity = 1)

Perform differential expression test between the two monocyte clusters.

res1 <- sctransform:::compare_expression(x = vst_out, umi = pbmc_data, group = pbmc_clusters, 
                                        val1 = 'CD14+ Monocytes', 
                                        val2 = 'FCGR3A+ Monocytes', 
                                        verbosity = 1)

By default, for every gene compare_expression uses a likelihood ratio test between two models. The first model has only an intercept term, while the second model also includes a group indicator variable, effectively fitting one intercept per group. Both models include an offset term to account for the expected UMI counts given the vst model. Model 1: $\log(\mu) = \beta_0 + \log(o)$, model 2: $\log(\mu) = \beta_1 + \beta_2 x + \log(o)$, where $\mu$ is the expected number of UMI counts of the given gene, $o$ the offset term (expected value under the regularized negative binomial regression model - see variance stabilizing transformation vignette), $x$ indicator variable that is 1 for cells belonging to group 2 and 0 otherwise. A negative binomial error distribution is assumed and both models use the regularized theta from the vst model to specify variance. The fold-change can be determined directly from the model 2 coefficient: $\log_2(e^{\beta_2})$

The results are ordered by p-value and we show the top ranking genes for each group.

head(subset(res1, log_fc < 0), 10)
head(subset(res1, log_fc > 0), 10)

Generate a volcano-plot

ggplot(res1, aes(log_fc, -log10(p_value))) + geom_point(alpha=0.4, shape=16)

Genes only detected in one group receive large absolute values for fold change, but not necessarily the smallest p-values.

We can also test one cluster vs all others. For example CD8 T cells:

res2 <- sctransform:::compare_expression(x = vst_out, umi = pbmc_data, group = pbmc_clusters, val1 = setdiff(pbmc_clusters, 'CD8 T cells'), val2 = 'CD8 T cells', verbosity = 1)
head(subset(res2, log_fc > 0), 10)

And plot the distribution in raw UMI space and Pearson residual space.

goi <- rownames(subset(res2, log_fc > 0))[1:3]
df <- melt(t(as.matrix(pbmc_data[goi, ])), varnames = c('cell', 'gene'), value.name = 'UMI')
df$cluster <- pbmc_clusters
ggplot(df, aes(x = gene, y = log10(UMI + 1), color = cluster == 'CD8 T cells')) + geom_violin(scale = 'width')
df <- melt(t(as.matrix(vst_out$y[goi, ])), varnames = c('cell', 'gene'), value.name = 'Pearson_residual')
df$cluster <- pbmc_clusters
ggplot(df, aes(x = gene, y = Pearson_residual, color = cluster == 'CD8 T cells')) + geom_violin(scale = 'width')


ChristophH/sctransform documentation built on Oct. 22, 2023, 3:28 p.m.