knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  cache = T
)
set.seed(1234)
library(ComplexHeatmap)
library(circlize)
library(Seurat, warn.conflicts = F)
library(SeuratObject)
library(CellMembrane, warn.conflicts = F)

What is Pseudobulking?

Pseudobulking is a technique used in single cell RNA Sequencing to simplify the data from many single cells into a (pseudo) bulked sample by summing all of the gene expression from each cell in the group. Groupings can be determined several different ways.

Pseudobulking approaches should include groupings closely aligned with your experimental variables (e.g. pseudobulking on Subject, Treatment Status, and Timepoint) and generally not be used downstream of unsupervised clustering for reasons discussed below.

Why should we use Pseudobulking?

In scRNASeq analyses, it is very common to use unsupervised clustering approaches to create clusters of cells for which their gene expression is grossly different. These cell groupings sometimes correspond to distinct cellular phenotypes, but other times clusters capture transient cell states such as division or metabolic shifts. Once these clusters are obtained, one generally runs a variation of Seurat's FindMarkers() to perform a wilcoxon test between two clusters, or a one-versus-all test to compare a single cluster to the rest of the data grouped together. This approach works well for determining the top 10/50/100 genes that define the population, but falls short when one asks "What are the differentially expressed genes?".

Differential gene expression usually considers the "strength" of expression $Log_2(\dfrac{Expression_{experimental}}{Expression_{control}})$ and the p value as computed via a hypothesis test versus a null distribution. Frequently, you will discover that the p values obtained via the clustering and FindMarkers() method will be infinitesimally small. This is due to a phenomenon known as double dipping, wherein the clustering was determined through differences in gene expression, and then the genes were tested for "how different is the gene's expression between clusters" using a null hypothesis that is no longer feasible (e.g. using $Expression_{cluster_1} = Expression_{cluster_2}$ as a null hypothesis when cluster membership maximizes $Expression_{cluster_1} \neq Expression_{cluster_2}$).

Pseudobulking offers a way to statistically determine gene sets that are generally congruent with one's expectations of type I and type II error control.

For more on double dipping and approaches to circumvent it, please read https://arxiv.org/abs/2301.07276 and related works from the Witten lab.

How to Pseudobulk

#Load seurat object. GITHUB_WORKSPACE is set in the GitHub actions build, and if this is not set, try to build relative to this file:
seuratObj <- suppressMessages(suppressWarnings(
    Seurat::UpdateSeuratObject(readRDS(paste0(Sys.getenv('GITHUB_WORKSPACE', unset = '..'), "/tests/testdata/seuratOutput.rds")))
))

#assign some random experimental variables: vaccine groups, experimental time points, and subject Ids. 
seuratObj@meta.data[,"vaccine_cohort"] <- base::rep(c("control", "vaccineOne", "vaccineTwo", "unvax"), length.out = length(colnames(seuratObj)))
  seuratObj@meta.data[,"timepoint"] <- base::rep(c("baseline", "necropsy", "day4"), length.out = length(colnames(seuratObj)))
  seuratObj@meta.data[,"subject"] <- base::sample(c(1,2,3,4), size = 1557, replace = T)

  seuratObj@meta.data$sample_identifier <- paste0("Subject-", seuratObj$subject, '-', seuratObj$vaccine_cohort, "-", seuratObj$timepoint)

#set up pseudobulking
pbulk <- suppressWarnings(PseudobulkSeurat(seuratObj, groupFields = c("vaccine_cohort", "timepoint", "sample_identifier")))

Inspecting the Pseudobulked Object.

The pseudobulked Seurat object has two layers of interest: the counts layer and the pct.expression layer. Within the counts layer, you will find the sum of the gene expression for each gene across all cells from an individual (grouped on sample_identifier). The pct.expression layer is the percentage of cells from that individual that express the gene. For instance, if 1 out of 10 cells from a subject express a gene, the pct.expression will be 0.1.

#inspecting counts. The matrices are sparse, so zeroes are represented with the '.' character.
GetAssayData(pbulk, layer = "counts")[1:3,1:3]

#inspecting pct.expression. This is also sparse, so unexpressed genes in a sample are represented with the '.' character.
GetAssayData(pbulk, layer = "pct.expression")[1:3,1:3]

Differential Expression: What is a 'Contrast'?

Contrasts are comparisons of samples. When one discusses differential expression between two groups, that comparison is performed via a "contrast" between those two groups. To link which groups will be compared to one another, we use a design matrix, which allows us to impose our experimental design onto the samples and group them appropriately to determine gene expression differences between the groups.

design <- DesignModelMatrix(pbulk, contrast_columns = c("vaccine_cohort"), sampleIdCol = "sample_identifier")

ComplexHeatmap::Heatmap(design, 
                        cluster_columns = F, 
                        cluster_rows = F, 
                        show_row_names = T,
                        column_title = "Experimental Groups",
                        row_title = "Individual Sample Labels",
                        col = colorRamp2(c(0, 1), c("cadetblue2", "coral")),
                        rect_gp = gpar(col = "black", lwd = 2),
                        heatmap_legend_param = list(title = "Group Membership", 
                                                    at = c(0,1), 
                                                    color_bar = "discrete", 
                                                    labels = c("Not in Group", "In Group")
                                                    ))

However, you can make more complicated designs that incorporate multiple levels of experimental variables. Fitting a model downstream will require at minimum more than 1 data point per group, but more biological replication is better.

design <- DesignModelMatrix(pbulk, contrast_columns = c("vaccine_cohort", "timepoint"), sampleIdCol = "sample_identifier")



ComplexHeatmap::Heatmap(design, 
                        cluster_columns = F, 
                        cluster_rows = F, 
                        show_row_names = T,
                        column_title = "Experimental Groups",
                        row_title = "Individual Sample Labels",
                        col = colorRamp2(c(0, 1), c("cadetblue2", "coral")),
                        rect_gp = gpar(col = "black", lwd = 2),
                        heatmap_legend_param = list(title = "Group Membership", 
                                                    at = c(0,1), 
                                                    color_bar = "discrete", 
                                                    labels = c("Not in Group", "In Group")
                                                    ))

Running a Contrast: vaccineOne vs vaccineTwo

Now that you have the sample grouping established, we can fit models to our data and compare those to calculate log fold changes and statistical significance. Normally, these models are regressions using some kind of count-based distribution (poisson, quasipoisson, negative binomial) using packages such as DESeq2 or edgeR.

Let's define a contrast using just vaccine information and compare vaccine one and vaccine two:

design <- DesignModelMatrix(pbulk, contrast_columns = c("vaccine_cohort"), sampleIdCol = "sample_identifier")


ComplexHeatmap::Heatmap(design, 
                        cluster_columns = F, 
                        cluster_rows = F, 
                        show_row_names = T,
                        column_title = "Experimental Groups",
                        row_title = "Individual Sample Labels",
                        col = colorRamp2(c(0, 1), c("cadetblue2", "coral")),
                        rect_gp = gpar(col = "black", lwd = 2),
                        heatmap_legend_param = list(title = "Group Membership", 
                                                    at = c(0,1), 
                                                    color_bar = "discrete", 
                                                    labels = c("Not in Group", "In Group")
                                                    ))
#fit a GLM using edgeR
fit <- PerformGlmFit(pbulk, design = design, test.use = "QLF")

# Define Contrast using the notation Group1 - Group2. 
# Genes that are upregulated in group 1 will have a positive log fold change
# while genes upregulated in group 2 will have a negative log fold change.
contrast_name <- "vaccineOne - vaccineTwo"
contrast <-
  limma::makeContrasts(contrasts = contrast_name, levels = colnames(design))

result <- PerformDifferentialExpression(
  fit,
  contrast,
  contrast_name,
  logFC_threshold = 1,
  FDR_threshold = 0.05,
  test.use = "QLF",
  showPlots = T
)

Great, now let us automate it.

In larger studies where you have several experimental variables, the number of possible comparisons grow combinatorially.

#single level design matrix
design <- DesignModelMatrix(pbulk, contrast_columns = c("vaccine_cohort"), sampleIdCol = "sample_identifier")

#6 possible contrasts with a single level design
ncol(combn(x = colnames(design),m = 2))

#two-level design matrix
design <- DesignModelMatrix(pbulk, contrast_columns = c("vaccine_cohort", "timepoint"), sampleIdCol = "sample_identifier")

#66 possible contrasts by including timepoint in our contrasts...
ncol(combn(x = colnames(design),m = 2))

However, you probably don't care about all of the possible comparisons. If your design matrix includes cell type, timepoint, and vaccine, there will be many differentially expressed genes between B cells and Myeloid cells, but not due to timepoint or vaccine cohort. Let's use some logic to specify exactly what we want to measure across all possible pseudobulking contrasts.

Here, I'll use the two-level study design, but I've learned that vaccineTwo recently failed its clinical trial, so I only want to investigate vaccineOne over my time course. However, I also know that the effect of the vaccine could vary over time, so I want to make sure that I only compare the vaccineOne to either unvax or control within a given timepoint.

logicList <- list( list("vaccine_cohort", "xor", "vaccineOne"), 
                   list("vaccine_cohort", "nor", "vaccineTwo"))
#We can easily visualize the list of lists as a dataframe to inspect the logic gates.
do.call("rbind", logicList)

This list of lists sets up our logic gates for filtering contrasts. One of the sides of the contrast needs to be vaccineOne (XOR gate on vaccineOne), and neither of the contrasts can have vaccineTwo in them (NOR gate on vaccineTwo).

To accomplish ensuring that the timepoint variable remains fixed across the contrasts, we can pass 'timepoint' to the 'requireIdenticalFields' argument in the FilterPseudobulkContrasts.

filtered_contrasts <- FilterPseudobulkContrasts(logicList = logicList, 
                                                          design = design, 
                                                          useRequireIdenticalLogic = T, 
                                                          requireIdenticalFields = c("timepoint"), 
                                                          filteredContrastsOutputFile = tempfile())

filtered_contrasts

Now we can see that we have six contrasts (down from 66) that compare vaccineOne to either control or unvax, within a fixed timepoint.

Run Filtered Contrasts

We can use the RunFilteredContrasts function to execute the basic edgeR QLF differential expression function on each of these contrasts. This will print the biological coefficient of variation plot for each subset so you can grossly interpret whether or not the model fitting is suitable.

DE_results <- RunFilteredContrasts(seuratObj = pbulk, 
                       filteredContrastsDataframe = filtered_contrasts, 
                       design = design,
                       test.use = "QLF", 
                       logFC_threshold = 1,
                       FDR_threshold = 0.05,
                       minCountsPerGene = 1, 
                       assayName = "RNA")

Thresholds are for optional plotting in RunFilteredContrasts, but are used for filtering in PseudobulkingBarPlot

Please note that the logFC and FDR thresholds here are for displaying the volcano plots (if desired) only, and they perform no filtering on genes in the DE_results list.

Permissive thresholds (on one contrast):

permissive_contrast <- RunFilteredContrasts(seuratObj = pbulk, 
                       filteredContrastsDataframe = filtered_contrasts[4,], 
                       design = design,
                       test.use = "QLF", 
                       logFC_threshold = 0,
                       FDR_threshold = 0.5,
                       minCountsPerGene = 1, 
                       assayName = "RNA", 
                       showPlots = TRUE)

Stricter thresholds (on one contrast):

strict_contrast <- RunFilteredContrasts(seuratObj = pbulk, 
                       filteredContrastsDataframe = filtered_contrasts[4,], 
                       design = design,
                       test.use = "QLF", 
                       logFC_threshold = 1,
                       FDR_threshold = 1e-30,
                       minCountsPerGene = 1, 
                       assayName = "RNA", 
                       showPlots = TRUE)

The results of these fits are collected in an indexed list.

head(DE_results$`1`)

Pseudobulking Bar Plot

We can visualize the results of RunFilteredContrasts in a bar plot, showing the magnitude of up- and down-regulated genes over our contrasts that address our hypothesis. The colors indicate whether or not the genes are unique (red or dark blue) to their contrast or shared across the contrasts (orange, light blue). Please note, the thresholds are used to determine if the genes are differentially expressed or not within each contrast.

barplot_results <- PseudobulkingBarPlot(DE_results, 
                                        title = "Vaccine One vs Unvaxxed/Control", 
                                        logFC_threshold = 0, 
                                        FDR_threshold = .5)
barplot_results <- PseudobulkingBarPlot(DE_results, 
                                        title = "Vaccine One vs Unvaxxed/Control", 
                                        logFC_threshold = 0.1, 
                                        FDR_threshold = 0.1)

Further annotation of the bar plot requires some custom code, but can be formatted like so using the data frame returned from PseudobulkingBarPlot:

library(dplyr)
library(ggplot2)

#inspect dataframe column names to determine how the fill aesthetic needs to be specified for the geom_tile() calls. 
barplot_results$dataframe %>% colnames()

metadata_tiles <- ggplot(barplot_results$dataframe) + 
  #define a custom scale for timepoint metadata
  scale_fill_manual(values = rev(grDevices::hcl.colors(n_distinct(pbulk$timepoint), palette = "Purples 3")), guide = guide_legend(order = 1))+
  labs(fill = "Timepoint")+
  geom_tile(aes(x = reorder(contrast_name, -abs(total_DEGs)), y = 1, fill = barplot_results$dataframe$negative_contrast_timepoint)) +
  geom_tile(aes(x = reorder(contrast_name, -abs(total_DEGs)), y = 2, fill = barplot_results$dataframe$positive_contrast_timepoint)) +
  ggnewscale::new_scale_fill() +

  #define a custom scale for vaccine_cohort metadata
  scale_fill_manual(values = rev(grDevices::hcl.colors(n_distinct(pbulk$vaccine_cohort), palette = "Inferno")), guide = guide_legend(order = 1))+
  labs(fill = "Vaccine")+
  geom_tile(aes(x = reorder(contrast_name, -abs(total_DEGs)), y = 3, fill = barplot_results$dataframe$negative_contrast_vaccine_cohort)) +
  geom_tile(aes(x = reorder(contrast_name, -abs(total_DEGs)), y = 4, fill = barplot_results$dataframe$positive_contrast_vaccine_cohort)) +
  ggnewscale::new_scale_fill() +

  #match the formatting of the Pseudobulking Bar Plot
  egg::theme_article() +
  ggplot2::theme(axis.text.x = ggplot2::element_blank())  + 
  xlab("Contrast")+ 
  ylab("Metadata Variables")+
  #add some dashed lines to delineate metadata groupings
  geom_hline(yintercept =c(2.5, 4.5), color = "black", linetype = "dashed")+
  geom_vline(xintercept = seq_along(unique(barplot_results$dataframe$contrast_name))+0.5) +
  scale_y_discrete(limits = rev(c("Positive Contrast Vaccine", "Negative Contrast Vaccine","Positive Contrast Timepoint","Negative Contrast Timepoint"))) + facet_wrap(~DEG_Magnitude, scales = "free_x")

metadata_tiles

Then, you can add these plots together using patchwork to annotate the contrasts shown in the Pseudobulking Bar Plot.

library(patchwork)

(barplot_results$barPlot + xlab("")) / (metadata_tiles + xlab("Differential Expression Contrasts") + facet_wrap(~DEG_Magnitude, scales = "free_x")) + plot_layout(guides = 'collect')

Differential Expression Heatmap

Very often, we would like to interpret a group of genes' (module) expression relative to our experimental design. We can visualize gene groupings using a carefully formatted heatmap. For this heatmap, we'd like to show genes that are upregulated in our experimental condition compared to our control.

First, let's reuse the results of FilterPseudobulkingContrasts and apply some differential gene expression cutoffs so we're working with a smaller gene space. You can also use other feature selection algorithms such as CellMembrane's FitRegularizedClassificationGlm() or other machine learning approaches to determine appropriate gene sets.

#bind the list into a dataframe using thresholds to determine the differentially expressed genes
DEGs <- dplyr::bind_rows(DE_results) %>% 
  dplyr::filter(abs(logFC) > 1) %>% 
  dplyr::filter(FDR < 0.05)

#select all of the genes that are differentially expressed in any contrast across the transcriptome
genes <- unique(DEGs$gene)

One can also filter based on the number of cells that express the gene in each contrast. You can use this to bias towards genes that are expressed in a larger proportion of cells in one of the experimental conditions (i.e. "marker genes"), or to ensure that the gene is expressed in many cells, rather than acutely in a few cells. Within the DE_results dataframes, pct.1 refers to the proportion of cells that express the gene in the first group of the contrast (i.e. positive contrast), and pct.2 refers to the proportion of cells that express the gene in the second group of the contrast (i.e. negative contrast).

DEGs_no_rare_cells <- dplyr::bind_rows(DE_results) %>% 
  dplyr::filter(abs(logFC) > 1) %>% 
  dplyr::filter(FDR < 0.05) %>% 
  dplyr::filter(pct.1 > 0.1 & pct.2 > 0.1) # an example of filtering for genes that have at least 10% of cells expressing the gene in each group.

DEGs_more_common_in_positive_contrast <- dplyr::bind_rows(DE_results) %>% 
  dplyr::filter(abs(logFC) > 1) %>% 
  dplyr::filter(FDR < 0.05) %>% 
  dplyr::filter(pct.1 - pct.2 > 0.25) # an example of filtering for genes that have at least 25% more cells expressing the gene in the positive contrast than the negative contrast.

Create the heatmap on just the DEGs

Then, we can make a heatmap on just the differentially expressed genes so we can interpret their expression in relation to our vaccine groups.

heatmap_list <- PseudobulkingDEHeatmap(seuratObj = pbulk, 
                       geneSpace = genes, 
                       contrastField = "vaccine_cohort", 
                       negativeContrastValue = "control", 
                       sampleIdCol = 'sample_identifier'
                       )

We can compare the gene log fold changes over the vaccine cohorts and find gene modules that are consistently upregulated.

heatmap_list$heatmap

DE Heatmap Using Two Variables

Often, you'll have a treatment and time course, where you measure transcriptomics in subjects over time after treatment. We can pass that information as a subgroupingVariable to PseudobulkingDEHeatmap.

heatmap_list <- PseudobulkingDEHeatmap(seuratObj = pbulk, 
                       geneSpace = genes, 
                       contrastField = "vaccine_cohort", 
                       negativeContrastValue = "control", 
                       subgroupingVariable = 'timepoint',
                       sampleIdCol = 'sample_identifier'
                       )

Now, we can interpret the changes within vaccine cohorts and over time!

heatmap_list$heatmap


bimberlabinternal/CellMembrane documentation built on Oct. 16, 2024, 6:53 a.m.