Introduction

Celda is a Bayesian hierarchical model that can perform bi-clustering of features into modules and observations into subpopulations. In this tutorial, we will apply Celda to a real-world single-cell RNA sequencing (scRNA-seq) dataset of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor. This dataset (PBMC3K) is available from 10X Genomics and can be found on the 10X website{target="_blank"}.

The celda package uses the SingleCellExperiment{target="_blank"} (SCE) object for management of expression matrices, feature/cell annotation data, and metadata. All of the functions have an SCE object as the first input parameter. The functions operate on a matrix stored in the assay slot of the SCE object. The parameter useAssay can be used to specify which matrix to use (the default is "counts"). Matrices can be of class matrix or dgCMatrix from the Matrix package. While the primary clustering is performed with functions from the celda package, the singleCellTK package is used for some other tasks such as importing data, quality control, and marker identification with differential expression.

Importing data

The PBMC3K data can be easily loaded via the Bioconductor package TENxPBMCData{target="_blank"}. TENxPBMCData is an experiment package that provides resources for various PBMC datasets generated by 10X Genomics. When using this package, the column names of returned SCE object are NULL by default. For this example, we paste together the name of the sample with the cell barcode to generate column names for the SCE object. Additionally, the count matrix within sce object is converted from a DelayedMatrix object to a sparse matrix dgCMatrix object.

library(TENxPBMCData)
sce <- TENxPBMCData("pbmc3k")
colnames(sce) <- paste0("pbmc3k_", colData(sce)$Sequence)
counts(sce) <- as(counts(sce), "dgCMatrix")

If you have the singleCellTK{target="_blank"} package installed, then this dataset can be imported and converted with a single command:

library(singleCellTK)
sce <- importExampleData("pbmc3k")

To get your own data into a SingleCellExperiment object, the singleCellTK package has several importing functions for different preprocessing tools including CellRanger, STARsolo, BUStools, Optimus, DropEST, SEQC, and Alevin/Salmon. For example, the following code can be used as a template to read in multiple samples processed with CellRanger:

library(singleCellTK)
sce <- importCellRanger(sampleDirs = c("path/to/sample1/", "path/to/sample2/"))

Note: As a reminder, you can view the assays, column annotation, and row annotation stored in the SCE with the commands assays(sce), colData(sce), and rowData(sce), respectively.

Finally, we set the rownames of the SCE to the gene symbol:

rownames(sce) <- rowData(sce)$Symbol_TENx

Quality Control

Quality control and filtering of cells is often needed before down-stream analyses such as dimensionality reduction and clustering. Typical filtering procedures include exclusion of poor quality cells with low numbers of counts/UMIs, estimation and removal of ambient RNA, and identification of potential doublet/multiplets. Many tools and packages are available to perform these operations and users are free to apply their tool(s) of choice as the celda clustering functions will work with any matrix stored in an SCE object. The celda package does contain a Bayesian method called decontX{target="_blank"} to estimate and remove transcript contamination in individual cells in a scRNA-seq dataset.

To perform QC, we suggest using the runCellQC function in singleCellTK package. This is a wrapper for several methods for calculation of QC metrics, doublet detection, and estimation of ambient RNA (including decontX). Below is a quick example of how to perform standard QC before applying celda. If you have another preferred approach or your data has already been QC'ed, you can move to Feature selection section. For this tutorial, we will only run one doublet detection algorithm and one decontamination algorithms. For a full list of algorithms that this function runs by default, see ?runCellQC. We will also quantify the percentage of mitochondrial genes in each cell as this is often used as a measure of cell viability.

library(singleCellTK)

# Get list of mitochondrial genes
mito.genes <- grep("^MT-", rownames(sce), value = TRUE)

# Run QC
sce <- runCellQC(sce, sample = NULL, algorithms = c("QCMetrics", "scDblFinder", "decontX"), geneSetList = list(mito=mito.genes), geneSetListLocation = "rownames")

Note: If you have cells from multiple samples stored in the SCE object, make sure to supply the sample parameter as the QC tools need to be applied to cells from each sample individually.

Individual sets of QC metrics can be plotted with specific functions. For example to plot distributions of total numbers of UMIs derived from runPerCellQC, doublet scores from runScDblFinder, and contamination scores from runDecontX (all of which were run by the runCellQC function), the following plotting functions can be used:

plotRunPerCellQCResults(sce)
plotScDblFinderResults(sce, reducedDimName = "decontX_UMAP")
plotDecontXResults(sce, reducedDimName = "decontX_UMAP")

An comprehensive HTML report can be generated to visualize and explore the QC metrics in greater detail:

reportCellQC(sce)

After examining the distributions of various QC metrics, poor quality cells will need to be removed. Typically, thresholds for QC metrics should exclude cells that are outliers of the distribution (i.e. long tails in the violin or density plots). Cells can be removed using the subsetSCECols function. Metrics stored in the colData of the SCE object can be filtered using the colData parameter. Here we will limit to cells with at least 600 counts and 300 genes detected:

# Filter SCE
sce <- subsetSCECols(sce, colData = c("total > 600", "detected > 300"))

# See number of cells after filtering
ncol(sce)

Other common metrics to filter on include subsets_mito_percent for removal of cells with high mitochondrial percentage, decontX_contamination for removal of cells with higher levels of contamination from ambient RNA, scDblFinder_class to remove doublets (or calls from any of the other doublet detection algorithms). See the singleCellTK documentation For more information on performing comprehensive QC and filtering.

Feature selection {#featureselection}

In general, removing features with low numbers of counts across all cells is recommended to reduce computational run time. A simple selection can be performed by removing features with a minimum number of counts in a minimum number of cells using the selectFeatures function:

# Select features with at least 3 counts in at least 3 cells
library(celda)
useAssay <- "counts"
altExpName <- "featureSubset"
sce <- selectFeatures(sce, minCount = 3, minCell = 3, useAssay = useAssay, altExpName = altExpName)

# See number of features after filtering
nrow(altExp(sce, altExpName))

The useAssay parameter is used to denote which assay/matrix within the SCE to use for filtering. The default raw counts matrix is traditionally stored in the "counts" assay. If decontX was previously run during QC, then the decontaminated counts can be used by setting this parameter to "decontXcounts". We will save this parameter in a variable called useAssay which will be used as input in several downstream functions.

Note: The subsetted matrix is stored in the "alternative experiment" slot (altExp) within the SCE. This allows for a matrix with a different number of rows to be stored within the same SCE object (rather than creating two SCE objects). The celda functions described in the next several sections operate on a matrix stored in the altExp slot. The default name given to the alternative experiment and used in all downstream celda functions is "featureSubset". If the altExpName parameter is changed here, then it will need to be supplied to downstream plotting functions as well. The list of alternative experiments in an SCE can be view with altExpNames(sce). If you have already have an SCE with selected features or do not want to perform feature selection, then you need to set the alternative experiment directly with a command like altExp(sce, "featureSubset") <- assay(sce, "counts"). In the future, this will be updated to be more simple by utilizing the ExperimentSubset package.

If the number of features is still relatively large (e.g. >5000), an alternative approach is to select highly variable features that can be used in the downstream clustering. The advantage of this approach is that it can greatly speed up celda and can improve with module detection among highly variable features with overall lower expression. The disadvantage of this approach is that features that do not fall into the highly variable group will not be clustered into modules. The celda package does not include methods for selection of highly variable genes (HVGs). However, the singleCellTK provides wrappers for methods used in Seurat{target="_blank"} and Scran. We recommend keeping at least 2,000-5,000 HVGs for clustering. Here is some example code of how to select the top 5,000 most variable genes and store it back in the SCE as an altExp:

library(singleCellTK)
sce <- seuratFindHVG(sce, useAssay = useAssay, hvgMethod = "vst")
g <- getTopHVG(sce, method = "vst", n = 5000)
altExp(sce, altExpName) <- sce[g, ]

For the rest of the analysis with the PBMC3K data, we will use the first approach where features with at least 3 counts in 3 cells were included.

Analysis with Celda

Bi-clustering with known numbers of clusters

As mentioned earlier, celda is discrete Bayesian model that is able to simultaneously bi-cluster features into modules and cells into cell clusters. The primary bi-clustering model can be accessed with the function celda_CG. This function operates on a matrix stored as an alternative experiment in the altExp slot. If you did not perform feature selection as recommended in the previous section and your matrix of interest is not currently located in an altExp slot, the following code can be used to copy a matrix in the main assay slot to the altExp slot:

useAssay <- "counts"
altExpName <- "featureSubset"
altExp(sce, altExpName) <- assay(sce, useAssay)`. 

The two major adjustable parameters in this model are L, the number of modules, and K, the number of cell populations. The following code bi-clusters the PBMC3K dataset into 100 modules and 15 cell populations:

sce <- celda_CG(sce, L = 100, K = 15, useAssay = useAssay, altExpName = altExpName)

However, in most cases, the number of feature modules (L) and the number of cell clusters (K) are not known beforehand. In the next sections, we outline procedures that can be used suggest reasonable choices for these parameters. If the data is clustered with the code above by supplying K and L directly to the celda_CG function, then you can skip the next section and proceed to Creating 2-D embeddings.

Finding the number of modules

In order to help choose a reasonable solutions for L and K, celda provides step-wise splitting procedures along with measurements of perplexity to suggest reasonable choices for L and K. First, the function recursiveSplitModule can be used to cluster features into modules for a range of L. Within each step, the best split of an existing module into 2 new modules is chosen to create the L-th module. The module labels of the previous model with $L-1$ modules are used as the initial starting values in the next model with $L$ modules. Note that the initialization step may take longer with larger numbers of cells in the dataset and the splitting procedure will take longer with larger numbers features in the dataset. Celda models with a L range between initialL = 10 and maxL = 150 are tested in the example below.

moduleSplit <- recursiveSplitModule(sce, useAssay = useAssay, altExpName = altExpName, initialL = 10, maxL = 150)

Perplexity has been commonly used in the topic models to measure how well a probabilistic model predicts observed samples (Blei et al., 2003{target="_blank"}). Here, we use perplexity to evaluate the performance of individual models by calculating the probability of observing expression counts given an estimated Celda model. Rather than performing cross-validation which is computationally expensive, a series of test sets are created by sampling the counts from each cell according to a multinomial distribution defined by dividing the counts for each gene in the cell by the total number of counts for that cell. Perplexity is then calculated on each test set and can be visualized using function plotGridSearchPerplexity. A lower perplexity indicates a better model fit.

plotGridSearchPerplexity(moduleSplit, altExpName = altExpName, sep = 10)

The perplexity alone often does not show a clear elbow or "leveling off". However, the rate of perplexity change (RPC) can be more informative to determine when adding new modules does not add much additional information Zhao et al., 2015{target="_blank"}). An RPC closer to zero indicates that the addition of new modules or cell clusters is not substantially decreasing the perplexity. The RPC of models can be visualized using function plotRPC:

plotRPC(moduleSplit, altExpName = altExpName)

In this case, we will choose an L of 80 as the RPC curve tends to level off at this point:

L <- 80

| Note: Perplexity and RPC are meant to be guides to give a sense of a possible starting point for L. However, they may not always give a clear "leveling off" depending of the complexity and quality of the dataset. Do not give up if the choice of L is unclear or imperfect! If the L to choose is unclear from these, then you can set a somewhat high number (e.g. 75) and move to the next step of selecting K. Later on, manual review of modules using functions such as moduleHeatmap can give a sense of whether individual modules should be further split up by selecting higher L. For example, you can start exploring the cell populations and modules with L = 75. If some modules need to be further split, you can then try L = 100, L = 125, and so on.

Finding the number of cell subpopulations

Now we extract the Celda model of L =$L$ with function subsetCeldaList and run recursiveSplitCell to fit models with a range of K between 3 and 25:

temp <- subsetCeldaList(moduleSplit, list(L = L))
sce <- recursiveSplitCell(sce, useAssay = useAssay, altExpName = altExpName, initialK = 3, maxK = 25, yInit = celdaModules(temp))

The perplexities and RPC of models can be visualized using the same functions plotGridSearchPerplexity and plotRPC.

plotGridSearchPerplexity(sce)
plotRPC(sce, , altExpName = altExpName)

The perplexity continues to decrease with larger values of K. The RPC generally levels off between 13 and 16 and we choose the model with K = 14 for downstream analysis. The follow code selects the final celda_CG model with L = 80 and K = 14:

K <- 14
sce <- subsetCeldaList(sce, list(L = L, K = K))

Note: Similar to choosing L, you can guess an initial value of K based off of the perplexity and RPC plots and then move to the downstream exploratory analyses described in the next several sections. After reviewing the cell clusters on 2-D embeddings and module heatmaps, you may have to come back to tweak the choice of K until you have something that captures the cellular heterogeneity within the data without "over-clustering" cells into too many subpopulations. This may be an iterative procedure of going back-and-forth between choices of K and plotting the results. So do not let imperfect perplexity/PRC plots prevent you from moving on to the rest of the analysis. Often times, using an initial guess for K will allow you to move on in the analysis to get a sense of the major sources of biological heterogeneity present in the data.

Exploring cell populations

Creating 2-D embeddings {#embed}

After selecting a celda model with specific values of L and K, we can then perform additional exploratory and downstream analyses to understand the biology of the transcriptional modules and cell populations. We can start by generating a dimension reduction plot with the Uniform Manifold Approximation and Projection (UMAP) method to visualize the relationships between the cells in a 2-D embedding. This can be done with function celdaUmap.

sce <- celdaUmap(sce, useAssay = useAssay, altExpName = altExpName)

Alternatively, a t-distributed stochastic neighbor embedding (t-SNE) can be generated using function celdaTsne. The UMAP and t-SNE plots generated by celdaUmap and celdaTsne are computed based on the module probabilities (analogous to using PCs from PCA). The calculated dimension reduction coordinates for the cells are stored under the reducedDim slot of the altExp slot in the original SCE object. The follow command lists the names of the dimensionality reductions that can be used in downstream plotting functions in the next few sections:

reducedDimNames(altExp(sce, altExpName))

Plotting cell population cluster labels

The function plotDimReduceCluster can be used to plot the cluster labels for cell populations identified by celda on the UMAP:

plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", labelClusters = TRUE)

Plotting expression of specific features

Usually, biological features of some cell populations are known a priori and can be identified with known marker genes. The expression of selected marker genes can be plotted on the UMAP with the function plotDimReduceFeature.

markers <- c("CD3D", "IL7R", "CD4", "CD8B", "CD19", "FCGR3A", "CD14", "FCER1A", "PF4")

plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = useAssay, altExpName = altExpName, normalize = TRUE)

The parameter displayName can be used to switch between IDs stored in the rownames of the SCE and columns of the rowData of the SCE. If the assay denoted by useAssay is a raw counts matrix, then setting normalize = TRUE is recommended (otherwise the z-score of the raw counts will be plotted). When set to TRUE, each count will be normalized by dividing by the total number of counts in each cell. An alternative approach is to perform normalization with another method and then point to the normalized assay with the useAssay parameter. For example, normalization can be performed with the scater package:

library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")
plotDimReduceFeature(x = sce, features = markers, reducedDimName = "celda_UMAP", useAssay = "logcounts", altExpName = altExpName, normalize = FALSE)

This second approach may be faster if plotting a lot of marker genes or if the dataset is relatively large.

Plotting cell subpopulations with labels

Once we identify of various cell subpopulations using the known marker genes, these custom labels can be added on the UMAP colored by cluster:

g <- plotDimReduceCluster(sce, reducedDimName = "celda_UMAP", altExpName = altExpName, labelClusters = TRUE)

labels <- c("1: Megakaryocytes",
    "2: CD14+ Monocytes 1",
    "3: CD14+ Monocytes 2",
    "4: FCGR3A (CD16+) Monocytes",
    "5: CD14+ Monocytes 3",
    "6: CD8+ Cytotoxic T-cells",
    "7: CD4+ T-cells",
    "8: CD8+ Cytotoxic T-cells",
    "9: B-cells",
    "10: Naive CD8+ T-cells",
    "11: Naive CD4+ T-cells",
    "12: NK-cells",
    "13: Unknown T-cells",
    "14: Dendritic cells")

library(ggplot2)
g <- g + scale_color_manual(labels = labels,
    values = distinctColors(length(labels)))
print(g)

Exploring relationship between modules and cell populations {#probmap}

Celda has the ability to identify modules of co-expressed features and quantify the probability of these modules in each cell population. An overview of the relationships between modules and cell subpopulations can be explored with the function celdaProbabilityMap. The "Absolute probability" heatmap on the left shows the proportion of counts in each module for each cell population. The "Absolute probability" map gives insights into the absolute abundance of a module within a given cell subpopulation. The absolute heatmap can be used to explore which modules are higher than other modules within a cell population. The "Relative expression" map shows the standardized (z-scored) module probabilities across cell subpopulations. The relative heatmap can be used to explore which modules are relatively higher than other modules across cell populations.

celdaProbabilityMap(sce, useAssay = useAssay, altExpName = altExpName)

In this plot, we can see a variety of patterns. Modules 15 - 20 are highly expressed across most cell populations indicating that they may contain housekeeping genes (e.g. ribosomal). Other modules are specific to a cell population or groups of cell populations. For example, module 35 is only on in population 1 while module 70 is expressed across populations 2, 3, and to some degree in population 5. The unknown T-cell population 13 has highly specific levels of modules 30. In the next section, we can look at the genes in these modules to gain insights into the biological properties of each of these cell populations.

Exploring feature modules

The primary advantage of celda over other tools is that it can cluster features that are co-expressed across cells into modules. These modules are often more biologically coherent than features correlated with principal components from PCA. Below are several ways in which modules can be explored and visualized.

Table of features in each module

The function featureModuleTable can be used to get the names of all features in each module into a data.frame.

# Save to a data.frame
ta <- featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName)
dim(ta)
head(ta[,"L70"])

The parameter displayName can be used to switch between IDs stored in the rownames of the SCE and columns of the rowData of the SCE. The the outputFile parameter is set, the table will be saved to a tab-delimited text file instead of to a data.frame:

# Save to file called "modules.txt"
featureModuleTable(sce, useAssay = useAssay, altExpName = altExpName, outputFile = "modules.txt")

The modules for this model are shown below:

library(knitr)
library(kableExtra)
table <- featureModuleTable(sce,useAssay = "counts",altExpName = "featureSubset")
kb <- kable(table, style = 'html', row.names = FALSE)
kb <- kable_styling(kb, bootstrap_options = c("striped", "condensed"))
kb <- scroll_box(kb, width = "100%", height = "550px")
kb

Module lookup

If you want to quickly find which module a particular feature was assigned to, the featureModuleLookup function can be used. Here will will look up a marker gene for T-cells called "CD3E":

mod <- featureModuleLookup(sce, feature = c("CD3E", "S100A8"))
mod

Module heatmaps

The function moduleHeatmap can be used to view the expression of features across cells for a specific module. The featureModule parameter denotes the module(s) to be displayed. Cells are ordered from those with the lowest probability of the module on the left to the highest probability on the right. Similarly, features are ordered from those with the highest probability within the module on the top to the lowest probability on the bottom.

moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName)

The parameter topCells can be used to control the number of cells included in the heatmap. By default, only the 100 cells with the lowest probabilities and the 100 cells with the highest probabilities for each selected module are included (i.e. topCells = 100 by default). To display all cells, this parameter can be set to NULL:

moduleHeatmap(sce, featureModule = 27, topCells = NULL, useAssay = useAssay, altExpName = altExpName)

Note: Multiple modules can be displayed by giving a vector of module indices to the parameter featureModule. If featureModule is not specified, then all modules will be plotted.

Module probabilities on 2-D embeddings

The function plotDimReduceModule can be used visualize the probabilities of a particular module or sets of modules on a reduced dimensional plot such as a UMAP. This can be another quick method to see how modules are expressed across various cells in 2-D space. As an example, we can look at module r as.numeric(as.character(mod["S100A8"])) which contained S100A8:

plotDimReduceModule(sce, modules = 70, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

Similarly, multiple modules can be plotting in a grid of UMAPs:

plotDimReduceModule(sce, modules = 70:78, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

In this grid, we can see that module 70 (which has high levels of S100A8 and S100A9) is highly expressed in cell populations 2 and 3, module 71 (which contains CD14) can be used to identify all CD14+ monocytes, module 72 (which contains CST3) is expressed across both CD14 and FCGR3A (CD16) expressing monocytes, and module 73 (which contains CD4) is expressed broadly across both monocytes and dendritic cells as well as some T-cell populations. If we were interesting in defining transcriptional programs active across all monocytes, we could examine the genes found in module 72. If we were interested in defining transcriptional programs for all CD14+ monocytes, we could examine the genes in module 71. These patterns can also be observed in the Probability Map

In the celda probability map, we saw that the unknown T-cell population 13 had high levels of module 30. We can examine both module heatmaps and module probability maps to further explore this:

moduleHeatmap(sce, featureModule = 30, useAssay = useAssay, altExpName = altExpName)

plotDimReduceModule(sce, modules = 30, useAssay = useAssay, altExpName = altExpName, reducedDimName = "celda_UMAP")

Module 30 has high levels of genes associated with proliferation including HMGA1, STMN1, PCNA, HMGB2, and TUBA1B. We can therefore re-label these cells as "Proliferating T-cells".

Identification and plotting of marker genes

In addition to examining modules, differential expression can be used to identify potential marker genes up-regulated in specific cell populations. The function findMarkerDiffExp in the singleCellTK package will find markers up-regulated in each cell population compared to all the others.

Differential expression to identify marker genes

# Normalize counts (if not performed previously)
library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")

# Run differential expression analysis
sce <- findMarkerDiffExp(sce, useAssay = "logcounts", method = "wilcox", cluster = celdaClusters(sce), minMeanExpr = 0, fdrThreshold = 0.05, log2fcThreshold = 0, minClustExprPerc = 0, maxCtrlExprPerc = 1)

The function plotMarkerDiffExp can be used to plot the results in a heatmap. The topN parameter will plot the top N ranked genes for each cluster.

# Plot differentially expressed genes that pass additional thresholds 'minClustExprPerc' and 'maxCtrlExprPerc'
plotMarkerDiffExp(sce, topN = 5, log2fcThreshold = 0, rowLabel = TRUE, fdrThreshold = 0.05, minClustExprPerc = 0.6, maxCtrlExprPerc = 0.4, minMeanExpr = 0)

Other parameters such as minClustExprPerc (the minimum number of cells expressing the marker gene in the cluster) and maxCtrlExprPerc (the maximum number of cells expression the marker gene in other clusters) can be used to control how specific each marker gene is to each cell populations. Similarly, adding a log2 fold-change cutoff (e.g. 1) can select for markers that are more strongly up-regulated in a cell population.

Violin plots for marker genes

The plotCeldaViolin function can be used to examine the distribution of expression of various features across cell population clusters derived from celda. Here we can see that the gene CD79A has high expression in the B-cell cluster and HMGB2 has high expression in the proliferating T-cell population.

# Normalize counts if not performed in previous steps
library(scater)
sce <- logNormCounts(sce, exprs_values = useAssay, name = "logcounts")

# Make violin plots for marker genes
plotCeldaViolin(sce, useAssay = "logcounts", features = c("CD79A", "HMGB2"))

Generating HTML reports

The celda package comes with two functions for generating comprehensive HTML reports that 1) capture the process of selecting K/L for a celda_CG model and 2) plot the results from the downstream analysis. The first report runs both recursiveSplitModule and recursiveSplitCell for selection of L and K, respectively. To recapitulate the complete analysis presented in this tutorial in the HTML report, the following command can be used:

sce <- reportCeldaCGRun(sce, sampleLabel = NULL, useAssay = useAssay, altExpName = altExpName, minCell = 3, minCount = 3, initialL = 10, maxL = 150, initialK = 3, maxK = 25, L = 80, K = 14)

All of the parameters in this function are the same that were used throughout this tutorial in the selectFeatures, recursiveSplitModule, and recursiveSplitCell functions. Note that this report does not do cell filtering, so that must be completed before running this function. The returned SCE object will have the celda_CG model with selected K and L which can be used in any of the downstream plotting functions as well as input into the second plotting report described next.

The second report takes in as input an SCE object with a fitted celda_CG model and systematically generates several plots that facilitate exploratory analysis including cell subpopulation cluster labels on 2-D embeddings, user-specified annotations on 2-D embeddings, module heatmaps, module probabilities, expression of marker genes on 2-D embeddings, and the celda probability map. The report can be generated with the following code:

reportCeldaCGPlotResults(sce, reducedDimName = "celda_UMAP", features = markers, useAssay = useAssay, altExpName = altExpName, cellAnnot = c("total", "detected", "decontX_contamination", "subsets_mito_percent"), cellAnnotLabel = "scDblFinder_doublet_call")

User-supplied annotations to plot on the 2-D embedding can be specified through the cellAnnot and cellAnnotLabel variables. Both parameters will allow for plotting of variables stored in the colData of the SCE on the 2-D embedding plot specified by reducedDimName parameter. For cellAnnot, integer and numeric variables will be plotted as as continuous variables while factors and characters will be plotted as categorical variables. For cellAnnotLabel, all variables will be coerced to a factor and the labels of the categories will be plotted on the scatter plot.

Other useful functions

Matrix factorization

The celda model factorizes the original matrix into three matrices:

1) module - The probability of each feature in each module (Psi)

2) cellPopulation - The probability of each module in each cell population (Phi)

3) sample - The probability of each cell population in each sample (Theta)

Additionally, we can calculate the probability of each module within each cell (cell). The cell matrix can essentially be used to replace PCs from PCA and is useful for downstream visualization (e.g. generating 2-D embeddings). All of these matrices can be retrieved with the factorizeMatrix function. The matrices are returned in three different versions: unnormalized counts, proportions (normalized by the total), or posterior estimates (where the Dirichlet concentration parameter is added in before normalization).

# Factorize the original counts matrix
fm <- factorizeMatrix(sce)

# Three different version of each matrix:
names(fm)

# Get normalized proportional matrices
dim(fm$proportions$cell) # Matrix of module probabilities for each cell
dim(fm$proportions$module) # Matrix of feature probabilities for each module
dim(fm$proportions$cellPopulation) # Matrix of module probabilities for each cell population
dim(fm$proportions$sample) # Matrix of cell population probabilities in each sample

Changing the feature display name

The parameter displayName can be used to change the labels of the rows from the rownames to a column in the rowData of the SCE object. The function is available in plotDimReduceFeature and moduleHeatmap. For example, if we did not change the rownames to Symbol_TENx in the beginning of the tutorial, the following code still could be run in moduleHeatmap to display the gene symbol even if the rownames were set to the original Ensembl IDs:

moduleHeatmap(sce, featureModule = 27, useAssay = useAssay, altExpName = altExpName, displayName = "Symbol_TENx")

Session information

sessionInfo()

sessionInfo()



campbio/celda documentation built on April 5, 2024, 11:47 a.m.