This vignette continues from Part 1, where we introduced the basic steps of a topic modeling analysis for single-cell data. Here we give more detailed guidance and discuss complications that may arise.
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold", fig.align = "center",dpi = 120)
We begin our analysis by loading the packages. Then we set the seed so that the results can be reproduced.
library(Matrix) library(fastTopics) library(ggplot2) library(cowplot) set.seed(1)
Now we load the data set and the $K = 6$ pre-fitted topic model.
data(pbmc_facs) counts <- pbmc_facs$counts fit <- pbmc_facs$fit
The topic model was fitted using the fit_topic_model
function. This
function, as we mentioned, hides most of the complexities of model
fitting. Nevertheless, it is a good idea to check that the model
fitting has converged to a maximum-likelihood solution. This is
easily done with 'fastTopics':
plot_progress(fit,x = "iter",add.point.every = 10,colors = "black") + theme_cowplot(font_size = 10)
Internally, fit_topic_model
first fits a non-negative matrix
factorization (NMF) to the count data, then converts the fitted NMF to
a topic model, so the plot shows the improvement in the model fit over
time. Judging by this plot, the parameter estimates get close to a
maxium-likelihood solution after about 150 updates.
For larger or more complex data sets, more updates may be needed to
obtain a good fit. For guidance, see help(fit_topic_model)
and
help(fit_poisson_nmf)
.
The topic model can be used to calculate a likelihood for each cell:
loglik <- loglik_multinom_topic_model(counts,fit)
This can be used to assess how well the topic model "fits" each cell.
pdat <- data.frame(loglik) ggplot(pdat,aes(loglik)) + geom_histogram(bins = 64,color = "white",fill = "black",size = 0.25) + labs(y = "number of cells") + theme_cowplot(font_size = 10)
Most of the poorly fit cells are in the CD34+ subpopulation:
subpop_colors <- c("dodgerblue","forestgreen","darkmagenta","skyblue","gold") pdat <- data.frame(loglik = loglik,subpop = pbmc_facs$samples$subpop) ggplot(pdat,aes(x = loglik,fill = subpop)) + geom_histogram(bins = 64,color = "white",size = 0.25) + scale_fill_manual(values = subpop_colors) + labs(y = "number of cells") + theme_cowplot(font_size = 10)
In [Part 1][single_cell_rnaseq_practical], we made use of additional information about the cells to help us interpret the topics. Even without the cell labels, the Structure plot can still be effective:
topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue", "gold","darkorange") structure_plot(fit,colors = topic_colors)
From this Structure plot, the topics clearly distinguish the B cells (dark blue), CD14+ monocytes (green) and CD34+ cells (purple). The NK cells (light blue) and T cells (cells high proportions of yellow and orange) are harder to distinguish, and this is reflected in sharing of topics (mostly topic 6) among NK and T cells.
Of course, without the cell labels, we cannot know that the topics correspond to these cell types without further downstream analyses---for this, we performed a GoM DE analysis (see Part 1).
In more complex data sets, some structure may not show up well without additional fine-tuning of the plot.
One simple strategy that often works well is to first identify clusters in the topic proportions matrix, $L$. Here we use $k$-means to identify clusters, but other methods could be used.
set.seed(1) pca <- prcomp(fit$L)$x clusters <- kmeans(pca,centers = 6,iter.max = 100)$cluster summary(factor(clusters))
We chose 6 clusters, but to be clear the best number of clusters may not be the same as the number of topics.
Now we create another Structure plot in which the cells are grouped according to the clusters:
structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25, grouping = clusters) + theme(axis.text.x = element_blank())
k-means somewhat arbitrarily split the T cells (cells with high proportions of topics 5 and 6) into clusters 1 and 3. Therefore, we merge these two clusters:
clusters[clusters == 3] <- 1 clusters <- factor(clusters)
We have found that visual inspection of the clusters followed by manual refinement is often be needed to get the "right" clustering.
Now we can label the clusters by these cell types in the plot. (Noting that this labeling is usually not possible until downstream analyses have been performed.)
levels(clusters) <- c("T","CD14+","B","CD34+","NK") structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25, grouping = clusters)
It is also sometimes helpful to visualize the structure in PCA plots which show the projection of the cells onto PCs of the topic proportions:
cluster_colors <- c("gold","forestgreen","dodgerblue","darkmagenta","skyblue") pca_plot(fit,fill = clusters) + scale_fill_manual(values = cluster_colors)
When there are many overlapping points, plotting the density can sometimes show the clusters more clearly:
pca_hexbin_plot(fit,bins = 24)
In Part 1, we performed a DE analysis using the
topic model. (Note that since the de_analysis
output is quite large,
we have made the results of the DE analysis available outside the
package.)
pbmc_facs_file <- tempfile(fileext = ".RData") pbmc_facs_url <- "https://stephenslab.github.io/fastTopics/pbmc_facs.RData" download.file(pbmc_facs_url,pbmc_facs_file) load(pbmc_facs_file)
When the volcano plot shows many overlapping differentially expressed
genes, it can be helpful to explore the results interactively. The
function volcano_plotly
creates an interactive volcano plot that can
be viewed in a Web browser. For example, here is an interactive
volcano plot for the T cells topic:
genes <- pbmc_facs$genes p <- volcano_plotly(de,k = 6,file = "volcano_plot_t_cells.html", labels = genes$symbol,ymax = 100)
This call creates a new file volcano_plot_t_cells.html
. The
interactive volcano plot can also be viewed here, or
by calling print(p)
in R.
This is the version of R and the packages that were used to generate these results.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.