knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.width = 6, fig.height = 4.5 )
We assume that the reader is familiar with the main MetaMarker vignette, which shows how to select meta-markers from a compendium of pancreas dataset. In this vignette, we will focus on using markers for cell type annotation, which is an extremely simple, fast and transparent procedure. In the second part of the vignette, we show how to build a benchmark to identify an optimal number of highly informative markers based on cell type annotation performance.
As in the main vignette, the requirements are: - a list of annotated datasets. - matching cell type names across datasets. - all packages installed in the main vignette.
We load the packages needed for our analysis: MetaMarkers, scRNAseq (package used to access datasets), dplyr (data manipulation, part of the tidyverse), and ggplot2 (figures, part of the tidyverse).
library(MetaMarkers) library(scRNAseq) library(dplyr) library(ggplot2)
We will perform our annotation task on pancreatic cell types. For simplicity, we only consider 3 datasets here, but the analysis can easily be extended to include even more datasets.
First, we load the datasets (in the SingleCellExperiment format) and make sure that gene names are identical across datasets. Here, the Baron dataset uses gene symbols, while the Muraro dataset uses names that combine gene symbols and chromosome locations (to avoid duplicate names). We change gene names to gene symbols, which are contained in the "rowData" slot of the Muraro dataset. Note that MetaMarkers will automatically remove duplicate gene names, so we don’t need to worry.
dataset1 = scRNAseq::BaronPancreasData() dataset2 = scRNAseq::MuraroPancreasData() dataset3 = scRNAseq::SegerstolpePancreasData() rownames(dataset2) = rowData(dataset2)$symbol
Next we match cell type names across datasets.
table(dataset1$label) table(dataset2$label) table(dataset3$`cell type`)
Some cell types are only present in one dataset, but there is a good overlap for the main cell types. We update names of cell types that overlap but have different names.
new_label = dataset1$label new_label[new_label=="activated_stellate"] = "mesenchymal" new_label[new_label=="quiescent_stellate"] = "mesenchymal" dataset1$new_label = new_label new_label = dataset2$label new_label[new_label=="duct"] = "ductal" new_label[new_label=="pp"] = "gamma" dataset2$new_label = new_label new_label = dataset3$`cell type` new_label = gsub(" cell", "", new_label) new_label[new_label=="PSC"] = "mesenchymal" dataset3$new_label = new_label common_labels = Reduce(intersect, list( unique(dataset1$new_label), unique(dataset2$new_label), unique(dataset3$new_label) )) common_labels
For simplicity, we’ll group all the datasets in a named list:
datasets = list(baron = dataset1, muraro = dataset2, seger = dataset3)
Finally, we restrict datasets to common cell types and normalize data. Here we use MetaMarkers’ simple library size normalization procedure, but any normalization procedure may be used:
for (dataset in names(datasets)) { keep_cell = datasets[[dataset]]$new_label %in% common_labels datasets[[dataset]] = datasets[[dataset]][, keep_cell] cpm(datasets[[dataset]]) = convert_to_cpm(counts(datasets[[dataset]])) }
Note that restricting to common cell types is not necessary: if we kept cell types present in only one dataset, the pipeline would run normally but the "meta-markers" for these cell types would be based on a single dataset.
The marker procedure follows the two steps presented in the "MetaMarkers" vignette. First, we select markers for individual datasets:
markers = lapply(datasets, function(dataset) { compute_markers(cpm(dataset), dataset$new_label) })
Second, we build meta-markers by picking replicable markers. Here, we use leave-one-dataset-out cross-validation: we build meta-markers from 2 training datasets and make predictions on the third dataset
meta_markers = lapply(names(markers), function(leave_out) { make_meta_markers(markers[names(markers) != leave_out]) }) names(meta_markers) = names(markers)
Note that the names for the meta-markers list correspond to the dataset that has been left out.
As an example, we start by predicting cell types from the Segerstolpe dataset using the top 100 Baron+Muraro meta-markers. The annotation process typically follows three steps: computing marker scores, deduce cell types and plot results.
First we select the top 100 meta-markers (leaving Segerstolpe out) for each cell type. Here we simply select markers based on their rank (based on DE recurrence across datasets, then AUROC).
test_dataset = "seger" top_markers = filter(meta_markers[[test_dataset]], rank<=100)
To predict and assess cell types in the Segerstolpe dataset, we use the 3 following functions:
ct_scores = score_cells(log1p(cpm(datasets[[test_dataset]])), top_markers) ct_enrichment = compute_marker_enrichment(ct_scores) ct_pred = assign_cells(ct_scores)
"score_cells" simply computes the average marker expression for each marker set in each cell:
t(ct_scores[, 1:3])
Each cell in the Segerstople dataset receives a score for each cell type, corresponding to the average expression of the selected markers. The scores vary from cell to cell and dataset to dataset because of differences in coverage, which is why we recommend also computing marker enrichment, which normalizes marker scores in each cell (by dividing by average marker set expression, which corresponds to enrichment under the null that all sets are expressed equally):
t(ct_enrichment[, 1:3])
Finally, the scores can be used to make simple cell type annotations by assigning cells to the highest marker score (or, equivalently, the highest enrichment):
head(ct_pred, 3)
Results are presented in a table, along with the marker score and enrichment that led to the predictions. We will see how these scores can be used to QC the annotation.
We can compute the accuracy of this simple assignment procedure by comparing to author annotations:
ct_pred = ct_pred %>% mutate(true_label = datasets[[test_dataset]]$new_label) mean(ct_pred$predicted == ct_pred$true_label)
The top 100 markers thus provide very robust annotations, with an accuracy of 95.6%. We can further focus on high-quality predictions by only considering cells with high enrichment values:
enrichment_threshold = seq(1,4,0.1) accuracy = sapply(enrichment_threshold, function(t) { keep_cell = ct_pred$enrichment > t mean(ct_pred$predicted[keep_cell] == ct_pred$true_label[keep_cell]) }) plot(enrichment_threshold, accuracy, type="l")
Next, we can visualize the predictions in UMAP space. If the target dataset does not have a carefully curated UMAP space associated with it (as is the case here), we can use the scran package to select highly variable genes and MetaMarker’s "compute_umap" function to build it:
logcounts(datasets[[test_dataset]]) = log1p(cpm(datasets[[test_dataset]])) dec = scran::modelGeneVar(datasets[[test_dataset]]) hvg = scran::getTopHVGs(dec, n=2000) umap = compute_umap(logcounts(datasets[[test_dataset]])[hvg,])
We now use the "plot_assignments" function to plot the predictions. Since we have the author annotations, we can plot the predictions and the author annotations side-by-side.
plot_assignments(ct_pred, umap[,2:3], 1) ct_true = mutate(ct_pred, predicted = true_label) plot_assignments(ct_true, umap[,2:3], 1)
As we already know from the accuracy, there is excellent correspondance between predictions and author annotations. There are however some notable mistypings, particularly the acinar/ductal boundary. The reason for these systematic mistypings is usually that some marker set (presumable the ductal marker set) contains higher expressed genes, which will naturally favor annotations towards that cell type with our simple assignment procedure. This is usually not an issue when trying to obtain an overview of which cell types are present or assigning a cell type to a cluster, but becomes problematic when accurate annotations for every individual cell in the dataset.
In the following, we focus on a slightly less obvious but more important example, the boundary around the rare epsilon cells:
ct_pred %>% filter(true_label == "epsilon" | predicted == "epsilon") %>% arrange(desc(enrichment))
11 cells are called as epsilon, but only 7 cells were annotated as epsilon. However "true" epsilon cells have markedly higher marker scores and enrichment values, which can help to hand curate cell-level annotations. One way to pick a threshold for scores and enrichment is to look for multimodality in the score and enrichment distributions:
ct_enrichment = compute_marker_enrichment(ct_scores) hist(ct_scores["all|epsilon",], breaks = 20) hist(ct_enrichment["all|epsilon",])
However, a better alternative is to assess cell type predictions across multiple training datasets to compute the optimal enrichment threshold above which a cell should be annotated as epsilon (2.2), as we will show below.
To evaluate cell typing performance, we can use the Segerstolpe annotation as a ground truth. We can ask two separate questions: - how good are markers at separating populations? - how easily can we find a threshold to predict the correct cell type?
For the first question, we don’t need to predict cell types. Instead, for each cell type, we ask how far apart...
true_labels = datasets[[test_dataset]]$new_label auroc = summarize_auroc(ct_enrichment, true_labels) fc = summarize_fold_change(ct_enrichment, true_labels) gplots::heatmap.2(auroc) gplots::heatmap.2(log2(fc))
For the second question, we need to convert enrichment scors to cell type predictions. Here we use the following assignment method: for a given cell, if the enrichment is above a given threshold, the cell is assigned to that cell type. With the following function, we can evaluate the annotation performance at various enrichment thresholds, then compare it to the ground truth annotation:
pr = summarize_precision_recall(ct_enrichment, true_labels, seq(1,5,0.1)) %>% filter(get_cell_type(marker_set) == true_label) pr %>% ggplot(aes(x = score_threshold, y = f1, col = true_label)) + geom_line() + theme_bw()
We repeat the same assessment using each dataset as test dataset:
pr = list() for (test_dataset in c("baron", "muraro", "seger")) { top_markers = filter(meta_markers[[test_dataset]], rank<=100) ct_scores = score_cells(log1p(cpm(datasets[[test_dataset]])), top_markers) ct_enrichment = compute_marker_enrichment(ct_scores) true_labels = datasets[[test_dataset]]$new_label pr[[test_dataset]] = summarize_precision_recall(ct_enrichment, true_labels, seq(1,4,0.1)) %>% filter(get_cell_type(marker_set) == true_label) } pr = bind_rows(pr, .id = "test_dataset")
Now we can compute the average F1 over all test datasets:
mean_pr = pr %>% group_by(true_label, score_threshold) %>% summarize(f1 = mean(f1, na.rm=TRUE)) mean_pr %>% ggplot(aes(x = score_threshold, y = f1, col = true_label)) + geom_line() + theme_bw()
And extract optimal performance and enrichment thresholds:
mean_pr %>% group_by(true_label) %>% filter(f1 == max(f1, na.rm = TRUE)) %>% summarize(score_threshold = median(score_threshold), f1 = first(f1))
For example epsilon 2.2.
To estimate the number of ideal markers, we can repeat the previous assessment at different gene numbers:
n_genes = c(1, 2, 5, 10, 20, 50, 100, 200, 500, 1000) mean_pr = list() for (n in n_genes) { pr = list() for (test_dataset in c("baron", "muraro", "seger")) { true_labels = datasets[[test_dataset]]$new_label top_markers = filter(meta_markers[[test_dataset]], rank<=n) ct_scores = score_cells(log1p(cpm(datasets[[test_dataset]])), top_markers) ct_enrichment = compute_marker_enrichment(ct_scores) pr[[test_dataset]] = summarize_precision_recall(ct_enrichment, true_labels, seq(1,5,0.1)) %>% filter(get_cell_type(marker_set) == true_label) } pr = bind_rows(pr, .id = "test_dataset") mean_pr[[as.character(n)]] = pr %>% group_by(true_label, score_threshold) %>% summarize(f1 = mean(f1, na.rm=TRUE)) %>% group_by(true_label) %>% filter(f1 == max(f1, na.rm = TRUE)) %>% summarize(score_threshold = median(score_threshold), f1 = first(f1)) } mean_pr = bind_rows(mean_pr, .id = "n_genes") %>% mutate(n_genes = as.integer(n_genes))
And plot the results:
mean_pr %>% ggplot(aes(x = n_genes, y = f1, col = true_label)) + geom_line() + theme_bw() + scale_x_log10()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.