In GabrielHoffman/dreamlet: Scalable differential expression analysis of single cell transcriptomics datasets with complex study designs

knitr::opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  error = FALSE,
  tidy = FALSE,
  dev = c("png"),
  cache = TRUE
)

Introduction

As the scale of single cell/nucleus RNA-seq has increased, so has the complexity of study designs. Analysis of datasets with simple study designs can be performed using linear model as in the muscat package. Yet analysis of datsets with complex study designs such as repeated measures or many technical batches can benefit from linear mixed model analysis to model to correlation structure between samples. We previously developed dream to apply linear mixed models to bulk RNA-seq data using a limma-style workflow. Dreamlet extends the previous work of dream and muscat to apply linear mixed models to pseudobulk data. Dreamlet also supports linear models and facilitates application of 1) variancePartition to quantify the contribution of multiple variables to expression variation, and 2) zenith to perform gene set analysis on the differential expression signatures.

Installation

To install this package, start R (version "4.3") and enter:

if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

# Select release #1 or #2

# 1) Bioconductor release
BiocManager::install("dreamlet")

# 2) Latest stable release
devtools::install_github("DiseaseNeurogenomics/dreamlet")

Process single cell count data

Here we perform analysis of PBMCs from 8 individuals stimulated with interferon-β (Kang, et al, 2018, Nature Biotech). This is a small dataset that does not have repeated measures or high dimensional batch effects, so the sophisticated features of dreamlet are not strictly necessary. But this gives us an opportunity to walk through a standard dreamlet workflow.

Preprocess data

Here, single cell RNA-seq data is downloaded from ExperimentHub.

library(dreamlet)
library(muscat)
library(ExperimentHub)
library(zenith)
library(scater)

# Download data, specifying EH2259 for the Kang, et al study
eh <- ExperimentHub()
sce <- eh[["EH2259"]]

sce$ind = as.character(sce$ind)

# only keep singlet cells with sufficient reads
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce <- sce[, colData(sce)$multiplets == "singlet"]

# compute QC metrics
qc <- perCellQCMetrics(sce)

# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]

# set variable indicating stimulated (stim) or control (ctrl)
sce$StimStatus <- sce$stim

Aggregate to pseudobulk

Dreamlet, like muscat, performs analysis at the pseudobulk-level by summing raw counts across cells for a given sample and cell type. aggregateToPseudoBulk is substantially faster for large on-disk datasets than muscat::aggregateData.

# Since 'ind' is the individual and 'StimStatus' is the stimulus status,
# create unique identifier for each sample
sce$id <- paste0(sce$StimStatus, sce$ind)

# Create pseudobulk data by specifying cluster_id and sample_id
# Count data for each cell type is then stored in the `assay` field
# assay: entry in assayNames(sce) storing raw counts
# cluster_id: variable in colData(sce) indicating cell clusters
# sample_id: variable in colData(sce) indicating sample id for aggregating cells
pb <- aggregateToPseudoBulk(sce,
  assay = "counts",
  cluster_id = "cell",
  sample_id = "id",
  verbose = FALSE
)

# one 'assay' per cell type
assayNames(pb)

Voom for pseudobulk

Apply voom-style normalization for pseudobulk counts within each cell cluster using voomWithDreamWeights to handle random effects (if specified).

# Normalize and apply voom/voomWithDreamWeights
res.proc <- processAssays(pb, ~StimStatus, min.count = 5)

# the resulting object of class dreamletProcessedData stores
# normalized data and other information
res.proc

processAssays() retains samples with at least min.cells in a given cell type. While dropping a few samples usually is not a problem, in some cases dropping sames can mean that a variable included in the regression formula no longer has any variation. For example, dropping all stimulated samples from analysis of a given cell type would be mean the variable StimStatus has no variation and is perfectly colinear with the intercept term. This colinearity issue is detected internally and variables with these problem are dropped from the regression formula for that particular cell type. The number of samples retained and the resulting formula used in each cell type can be accessed as follows. In this analysis, samples are dropped from 3 cell types but the original formula remains valid in each case.

# view details of dropping samples
details(res.proc)

Here the mean-variance trend from voom is shown for each cell type. Cell types with sufficient number of cells and reads show a clear mean-variance trend. While in rare cell types like megakaryocytes, fewer genes have sufficient reads and the trend is less apparent.

# show voom plot for each cell clusters
plotVoom(res.proc)

# Show plots for subset of cell clusters
# plotVoom( res.proc[1:3] )

# Show plots for one cell cluster
# plotVoom( res.proc[["B cells"]])

Variance partitioning

The variancePartition package uses linear and linear mixed models to quanify the contribution of multiple sources of expression variation at the gene-level. For each gene it fits a linear (mixed) model and evalutes the fraction of expression variation explained by each variable.

Variance fractions can be visualized at the gene-level for each cell type using a bar plot, or genome-wide using a violin plot.

# run variance partitioning analysis
vp.lst <- fitVarPart(res.proc, ~StimStatus)

# Show variance fractions at the gene-level for each cell type
genes <- vp.lst$gene[2:4]
plotPercentBars(vp.lst[vp.lst$gene %in% genes, ])

# Summarize variance fractions genome-wide for each cell type
plotVarPart(vp.lst, label.angle = 60)

Differential expression

Since the normalized expression data and metadata are stored within res.proc, only the regression formula remains to be specified. Here we only included the stimulus status, but analyses of larger datasets can include covariates and random effects. With formula ~ StimStatus, an intercept is fit and coefficient StimStatusstim log fold change between simulated and controls.

# Differential expression analysis within each assay,
# evaluated on the voom normalized data
res.dl <- dreamlet(res.proc, ~StimStatus)

# names of estimated coefficients
coefNames(res.dl)

# the resulting object of class dreamletResult
# stores results and other information
res.dl

Volcano plots

The volcano plot can indicate the strength of the differential expression signal with each cell type. Red points indicate FDR < 0.05.

plotVolcano(res.dl, coef = "StimStatusstim")

Gene-level heatmap

For each cell type and specified gene, show z-statistic from dreamlet analysis. Grey indicates that insufficient reads were observed to include the gene in the analysis.

genes <- c("ISG20", "ISG15")
plotGeneHeatmap(res.dl, coef = "StimStatusstim", genes = genes)

Extract results

Each entry in res.dl stores a model fit by dream(), and results can be extracted using topTable() as in limma by specifying the coefficient of interest. The results shows the gene name, log fold change, average expression, t-statistic, p-value, FDR (i.e. adj.P.Val).

# results from full analysis
topTable(res.dl, coef = "StimStatusstim")

# only B cells
topTable(res.dl[["B cells"]], coef = "StimStatusstim")

Forest plot

A forest plot shows the log fold change and standard error of a given gene across all cell types. The color indicates the FDR.

plotForest(res.dl, coef = "StimStatusstim", gene = "ISG20")

Box plot

Examine the expression of ISG20 between stimulation conditions within CD14+ Monocytes. Use extractData() to create a tibble with gene expression data and metadata from colData() from one cell type.

# get data
df <- extractData(res.proc, "CD14+ Monocytes", genes = "ISG20")

# set theme
thm <- theme_classic() +
  theme(aspect.ratio = 1, plot.title = element_text(hjust = 0.5))

# make plot
ggplot(df, aes(StimStatus, ISG20)) +
  geom_boxplot() +
  thm +
  ylab(bquote(Expression ~ (log[2] ~ CPM))) +
  ggtitle("ISG20")

Advanced used of contrasts

A hypothesis test of the difference between two or more coefficients can be performed using contrasts. The contrast matrix is evaluated for each cell type in the backend, so only the contrast string must be supplied to dreamlet().

# create a contrasts called 'Diff' that is the difference between expression
# in the stimulated and controls.
# More than one can be specified
contrasts <- c(Diff = "StimStatusstim - StimStatusctrl")

# Evalaute the regression model without an intercept term.
# Instead estimate the mean expression in stimulated, controls and then
# set Diff to the difference between the two
res.dl2 <- dreamlet(res.proc, ~ 0 + StimStatus, contrasts = contrasts)

# see estimated coefficients
coefNames(res.dl2)

# Volcano plot of Diff
plotVolcano(res.dl2[1:2], coef = "Diff")

This new Diff variable can then be used downstream for analysis asking for a coefficient. But note that since there is no intercept term in this model, the meaning of StimStatusstim changes here. When the formula is 0 + StimStatus then StimStatusstim is the mean expression in stimulated samples.

For further information about using contrasts see makeContrastsDream() and vignette.

Gene set analysis

While standard enrichment methods like Fishers exact test, requires specifying a FDR cutoff to identify differentially expressed genes. However, dichotomizing differential expression results is often too simple and ignores the quantitative variation captured by the differential expression test statistics. Here we use zenith, a wrapper for limma::camera, to perform gene set analysis using the full spectrum of differential expression test statistics. zenith/camera is a competetive test that compares the mean test statistic for genes in a given gene set, to genes not in that set while accounting for correlation between genes.

Here, zenith_gsa takes a dreamletResult object, the coefficient of interest, and gene sets as a GeneSetCollection object from GSEABase.

# Load Gene Ontology database
# use gene 'SYMBOL', or 'ENSEMBL' id
# use get_MSigDB() to load MSigDB
# Use Cellular Component (i.e. CC) to reduce run time here
go.gs <- get_GeneOntology("CC", to = "SYMBOL")

# Run zenith gene set analysis on result of dreamlet
res_zenith <- zenith_gsa(res.dl, coef = "StimStatusstim", go.gs)

# examine results for each ell type and gene set
head(res_zenith)

Heatmap of top genesets

# for each cell type select 5 genesets with largest t-statistic
# and 1 geneset with the lowest
# Grey boxes indicate the gene set could not be evaluted because
#    to few genes were represented
plotZenithResults(res_zenith, 5, 1)

All gene sets with FDR < 30%

Here, show all genes with FDR < 5% in any cell type

# get genesets with FDR < 30%
# Few significant genesets because uses Cellular Component (i.e. CC)
gs <- unique(res_zenith$Geneset[res_zenith$FDR < 0.3])

# keep only results of these genesets
df <- res_zenith[res_zenith$Geneset %in% gs, ]

# plot results, but with no limit based on the highest/lowest t-statistic
plotZenithResults(df, Inf, Inf)

Comparing expression in clusters

Identifying genes that are differentially expressed between cell clusters incorporates a paired analysis design, since each individual is observed for each cell cluster.

# test differential expression between B cells and the rest of the cell clusters
ct.pairs <- c("CD4 T cells", "rest")

fit <- dreamletCompareClusters(pb, ct.pairs, method = "fixed")

# The coefficient 'compare' is the value logFC between test and baseline:
# compare = cellClustertest - cellClusterbaseline
df_Bcell <- topTable(fit, coef = "compare")

head(df_Bcell)

Gene-cluster specificity

Evaluate the specificity of each gene for each cluster, retaining only highly expressed genes:

df_cts <- cellTypeSpecificity(pb)

# retain only genes with total CPM summed across cell type > 100
df_cts <- df_cts[df_cts$totalCPM > 100, ]

# Violin plot of specificity score for each cell type
plotViolin(df_cts)

Highlight expression fraction for most specific gene from each cell type:

genes <- rownames(df_cts)[apply(df_cts, 2, which.max)]
plotPercentBars(df_cts, genes = genes)

dreamlet::plotHeatmap(df_cts, genes = genes)

Session Info

sessionInfo()

GabrielHoffman/dreamlet documentation built on April 5, 2025, 5:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

GabrielHoffman/dreamlet
Scalable differential expression analysis of single cell transcriptomics datasets with complex study designs

In GabrielHoffman/dreamlet: Scalable differential expression analysis of single cell transcriptomics datasets with complex study designs

Introduction

Installation

Process single cell count data

Preprocess data

Aggregate to pseudobulk

Voom for pseudobulk

Variance partitioning

Differential expression

Volcano plots

Gene-level heatmap

Extract results

Forest plot

Box plot

Advanced used of contrasts

Gene set analysis

Heatmap of top genesets

All gene sets with FDR < 30%

Comparing expression in clusters

Gene-cluster specificity

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

GabrielHoffman/dreamlet Scalable differential expression analysis of single cell transcriptomics datasets with complex study designs

In GabrielHoffman/dreamlet: Scalable differential expression analysis of single cell transcriptomics datasets with complex study designs

Introduction

Installation

Process single cell count data

Preprocess data

Aggregate to pseudobulk

Voom for pseudobulk

Variance partitioning

Differential expression

Volcano plots

Gene-level heatmap

Extract results

Forest plot

Box plot

Advanced used of contrasts

Gene set analysis

Heatmap of top genesets

All gene sets with FDR < 30%

Comparing expression in clusters

Gene-cluster specificity

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

GabrielHoffman/dreamlet
Scalable differential expression analysis of single cell transcriptomics datasets with complex study designs