knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, error = FALSE, tidy = FALSE, dev = c("png"), cache = TRUE )
As the scale of single cell/nucleus RNA-seq has increased, so has the complexity of study designs. Analysis of datasets with simple study designs can be performed using linear model as in the muscat package. Yet analysis of datsets with complex study designs such as repeated measures or many technical batches can benefit from linear mixed model analysis to model to correlation structure between samples. We previously developed dream to apply linear mixed models to bulk RNA-seq data using a limma-style workflow. Dreamlet extends the previous work of dream and muscat to apply linear mixed models to pseudobulk data. Dreamlet also supports linear models and facilitates application of 1) variancePartition to quantify the contribution of multiple variables to expression variation, and 2) zenith to perform gene set analysis on the differential expression signatures.
To install this package, start R (version "4.3") and enter:
if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # Select release #1 or #2 # 1) Bioconductor release BiocManager::install("dreamlet") # 2) Latest stable release devtools::install_github("DiseaseNeurogenomics/dreamlet")
Here we perform analysis of PBMCs from 8 individuals stimulated with interferon-β (Kang, et al, 2018, Nature Biotech). This is a small dataset that does not have repeated measures or high dimensional batch effects, so the sophisticated features of dreamlet are not strictly necessary. But this gives us an opportunity to walk through a standard dreamlet workflow.
Here, single cell RNA-seq data is downloaded from ExperimentHub.
library(dreamlet) library(muscat) library(ExperimentHub) library(zenith) library(scater) # Download data, specifying EH2259 for the Kang, et al study eh <- ExperimentHub() sce <- eh[["EH2259"]] sce$ind = as.character(sce$ind) # only keep singlet cells with sufficient reads sce <- sce[rowSums(counts(sce) > 0) > 0, ] sce <- sce[, colData(sce)$multiplets == "singlet"] # compute QC metrics qc <- perCellQCMetrics(sce) # remove cells with few or many detected genes ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE) sce <- sce[, !ol] # set variable indicating stimulated (stim) or control (ctrl) sce$StimStatus <- sce$stim
Dreamlet, like muscat, performs analysis at the pseudobulk-level by summing raw counts across cells for a given sample and cell type. aggregateToPseudoBulk
is substantially faster for large on-disk datasets than muscat::aggregateData
.
# Since 'ind' is the individual and 'StimStatus' is the stimulus status, # create unique identifier for each sample sce$id <- paste0(sce$StimStatus, sce$ind) # Create pseudobulk data by specifying cluster_id and sample_id # Count data for each cell type is then stored in the `assay` field # assay: entry in assayNames(sce) storing raw counts # cluster_id: variable in colData(sce) indicating cell clusters # sample_id: variable in colData(sce) indicating sample id for aggregating cells pb <- aggregateToPseudoBulk(sce, assay = "counts", cluster_id = "cell", sample_id = "id", verbose = FALSE ) # one 'assay' per cell type assayNames(pb)
Apply voom-style normalization for pseudobulk counts within each cell cluster using voomWithDreamWeights to handle random effects (if specified).
# Normalize and apply voom/voomWithDreamWeights res.proc <- processAssays(pb, ~StimStatus, min.count = 5) # the resulting object of class dreamletProcessedData stores # normalized data and other information res.proc
processAssays()
retains samples with at least min.cells
in a given cell type. While dropping a few samples usually is not a problem, in some cases dropping sames can mean that a variable included in the regression formula no longer has any variation. For example, dropping all stimulated samples from analysis of a given cell type would be mean the variable StimStatus
has no variation and is perfectly colinear with the intercept term. This colinearity issue is detected internally and variables with these problem are dropped from the regression formula for that particular cell type. The number of samples retained and the resulting formula used in each cell type can be accessed as follows. In this analysis, samples are dropped from 3 cell types but the original formula remains valid in each case.
# view details of dropping samples details(res.proc)
Here the mean-variance trend from voom is shown for each cell type. Cell types with sufficient number of cells and reads show a clear mean-variance trend. While in rare cell types like megakaryocytes, fewer genes have sufficient reads and the trend is less apparent.
# show voom plot for each cell clusters plotVoom(res.proc) # Show plots for subset of cell clusters # plotVoom( res.proc[1:3] ) # Show plots for one cell cluster # plotVoom( res.proc[["B cells"]])
The variancePartition package uses linear and linear mixed models to quanify the contribution of multiple sources of expression variation at the gene-level. For each gene it fits a linear (mixed) model and evalutes the fraction of expression variation explained by each variable.
Variance fractions can be visualized at the gene-level for each cell type using a bar plot, or genome-wide using a violin plot.
# run variance partitioning analysis vp.lst <- fitVarPart(res.proc, ~StimStatus)
# Show variance fractions at the gene-level for each cell type genes <- vp.lst$gene[2:4] plotPercentBars(vp.lst[vp.lst$gene %in% genes, ])
# Summarize variance fractions genome-wide for each cell type plotVarPart(vp.lst, label.angle = 60)
Since the normalized expression data and metadata are stored within res.proc
, only the regression formula remains to be specified. Here we only included the stimulus status, but analyses of larger datasets can include covariates and random effects. With formula ~ StimStatus
, an intercept is fit and coefficient StimStatusstim
log fold change between simulated and controls.
# Differential expression analysis within each assay, # evaluated on the voom normalized data res.dl <- dreamlet(res.proc, ~StimStatus) # names of estimated coefficients coefNames(res.dl) # the resulting object of class dreamletResult # stores results and other information res.dl
The volcano plot can indicate the strength of the differential expression signal with each cell type. Red points indicate FDR < 0.05.
plotVolcano(res.dl, coef = "StimStatusstim")
For each cell type and specified gene, show z-statistic from dreamlet
analysis. Grey indicates that insufficient reads were observed to include the gene in the analysis.
genes <- c("ISG20", "ISG15") plotGeneHeatmap(res.dl, coef = "StimStatusstim", genes = genes)
Each entry in res.dl
stores a model fit by dream()
, and results can be extracted using topTable()
as in limma
by specifying the coefficient of interest. The results shows the gene name, log fold change, average expression, t-statistic, p-value, FDR (i.e. adj.P.Val).
# results from full analysis topTable(res.dl, coef = "StimStatusstim") # only B cells topTable(res.dl[["B cells"]], coef = "StimStatusstim")
A forest plot shows the log fold change and standard error of a given gene across all cell types. The color indicates the FDR.
plotForest(res.dl, coef = "StimStatusstim", gene = "ISG20")
Examine the expression of ISG20 between stimulation conditions within CD14+ Monocytes. Use extractData()
to create a tibble
with gene expression data and metadata from colData()
from one cell type.
# get data df <- extractData(res.proc, "CD14+ Monocytes", genes = "ISG20") # set theme thm <- theme_classic() + theme(aspect.ratio = 1, plot.title = element_text(hjust = 0.5)) # make plot ggplot(df, aes(StimStatus, ISG20)) + geom_boxplot() + thm + ylab(bquote(Expression ~ (log[2] ~ CPM))) + ggtitle("ISG20")
A hypothesis test of the difference between two or more coefficients can be performed using contrasts. The contrast matrix is evaluated for each cell type in the backend, so only the contrast string must be supplied to dreamlet()
.
# create a contrasts called 'Diff' that is the difference between expression # in the stimulated and controls. # More than one can be specified contrasts <- c(Diff = "StimStatusstim - StimStatusctrl") # Evalaute the regression model without an intercept term. # Instead estimate the mean expression in stimulated, controls and then # set Diff to the difference between the two res.dl2 <- dreamlet(res.proc, ~ 0 + StimStatus, contrasts = contrasts) # see estimated coefficients coefNames(res.dl2) # Volcano plot of Diff plotVolcano(res.dl2[1:2], coef = "Diff")
This new Diff
variable can then be used downstream for analysis asking for a coefficient. But note that since there is no intercept term in this model, the meaning of StimStatusstim
changes here. When the formula is 0 + StimStatus
then StimStatusstim
is the mean expression in stimulated samples.
For further information about using contrasts see makeContrastsDream() and vignette.
While standard enrichment methods like Fishers exact test, requires specifying a FDR cutoff to identify differentially expressed genes. However, dichotomizing differential expression results is often too simple and ignores the quantitative variation captured by the differential expression test statistics. Here we use zenith
, a wrapper for limma::camera
, to perform gene set analysis using the full spectrum of differential expression test statistics. zenith/camera
is a competetive test that compares the mean test statistic for genes in a given gene set, to genes not in that set while accounting for correlation between genes.
Here, zenith_gsa
takes a dreamletResult
object, the coefficient of interest, and gene sets as a GeneSetCollection
object from GSEABase.
# Load Gene Ontology database # use gene 'SYMBOL', or 'ENSEMBL' id # use get_MSigDB() to load MSigDB # Use Cellular Component (i.e. CC) to reduce run time here go.gs <- get_GeneOntology("CC", to = "SYMBOL") # Run zenith gene set analysis on result of dreamlet res_zenith <- zenith_gsa(res.dl, coef = "StimStatusstim", go.gs) # examine results for each ell type and gene set head(res_zenith)
# for each cell type select 5 genesets with largest t-statistic # and 1 geneset with the lowest # Grey boxes indicate the gene set could not be evaluted because # to few genes were represented plotZenithResults(res_zenith, 5, 1)
Here, show all genes with FDR < 5% in any cell type
# get genesets with FDR < 30% # Few significant genesets because uses Cellular Component (i.e. CC) gs <- unique(res_zenith$Geneset[res_zenith$FDR < 0.3]) # keep only results of these genesets df <- res_zenith[res_zenith$Geneset %in% gs, ] # plot results, but with no limit based on the highest/lowest t-statistic plotZenithResults(df, Inf, Inf)
Identifying genes that are differentially expressed between cell clusters incorporates a paired analysis design, since each individual is observed for each cell cluster.
# test differential expression between B cells and the rest of the cell clusters ct.pairs <- c("CD4 T cells", "rest") fit <- dreamletCompareClusters(pb, ct.pairs, method = "fixed") # The coefficient 'compare' is the value logFC between test and baseline: # compare = cellClustertest - cellClusterbaseline df_Bcell <- topTable(fit, coef = "compare") head(df_Bcell)
Evaluate the specificity of each gene for each cluster, retaining only highly expressed genes:
df_cts <- cellTypeSpecificity(pb) # retain only genes with total CPM summed across cell type > 100 df_cts <- df_cts[df_cts$totalCPM > 100, ] # Violin plot of specificity score for each cell type plotViolin(df_cts)
Highlight expression fraction for most specific gene from each cell type:
genes <- rownames(df_cts)[apply(df_cts, 2, which.max)] plotPercentBars(df_cts, genes = genes) dreamlet::plotHeatmap(df_cts, genes = genes)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.