knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", fig.width = 6, fig.height = 4.5 )
We assume that the reader is familiar with the main "MetaMarkers" vignette, which shows how to select meta-markers from a compendium of pancreas dataset. We also assume that the reader is somewhat familiar with the "Annotation" vignette, which performs cell type annotation using a flat hierarchy of cell types. In this vignette, we will focus on using hierarchical markers for cell type annotation, and show how to build a benchmark to find highly informative markers.
As in the main vignette, the requirements are: - a list of annotated datasets. - matching cell type names across datasets. - all packages installed in the main vignette.
We load the packages needed for our analysis: MetaMarkers, scRNAseq (package used to access datasets), dplyr (data manipulation, part of the tidyverse), and ggplot2 (figures, part of the tidyverse).
library(MetaMarkers) library(scRNAseq) library(dplyr) library(ggplot2)
The 3 pancreas atlas datasets are built in a similar fashion to the "Annotation" vignette (see vignette for details). Briefly, we start by loading the data:
dataset1 = scRNAseq::BaronPancreasData() dataset2 = scRNAseq::MuraroPancreasData() dataset3 = scRNAseq::SegerstolpePancreasData() rownames(dataset2) = rowData(dataset2)$symbol
We normalize cell type names across datasets:
new_label = dataset1$label new_label[new_label=="activated_stellate"] = "mesenchymal" new_label[new_label=="quiescent_stellate"] = "mesenchymal" dataset1$new_label = new_label new_label = dataset2$label new_label[new_label=="duct"] = "ductal" new_label[new_label=="pp"] = "gamma" dataset2$new_label = new_label new_label = dataset3$`cell type` new_label = gsub(" cell", "", new_label) new_label[new_label=="PSC"] = "mesenchymal" dataset3$new_label = new_label common_labels = Reduce(intersect, list( unique(dataset1$new_label), unique(dataset2$new_label), unique(dataset3$new_label) ))
We restrict datasets to common cell types and normalize data to CPM format:
datasets = list(baron = dataset1, muraro = dataset2, seger = dataset3) for (dataset in names(datasets)) { keep_cell = datasets[[dataset]]$new_label %in% common_labels datasets[[dataset]] = datasets[[dataset]][, keep_cell] cpm(datasets[[dataset]]) = convert_to_cpm(counts(datasets[[dataset]])) }
Finally, we group cell types into 2 classes: endocrine and non-endocrine.
for (dataset in names(datasets)) { class_label = rep("non-endocrine", ncol(datasets[[dataset]])) is_endocrine = datasets[[dataset]]$new_label %in% c("alpha", "beta", "gamma", "delta", "epsilon") class_label[is_endocrine] = "endocrine" datasets[[dataset]]$class = class_label }
We proceed to marker selection for each individual dataset.
class_markers = lapply(datasets, function(dataset) { compute_markers(cpm(dataset), dataset$class) }) names(class_markers) = names(datasets)
We use leave-one-dataset-out cross-validation: we build meta-markers from 2 datasets and predict on the third dataset. Then build meta-markers from invidual datasets marker lists, leaving one dataset out.
class_meta_markers = lapply(names(class_markers), function(leave_out) { make_meta_markers(class_markers[names(class_markers) != leave_out]) }) names(class_meta_markers) = names(class_markers)
ct_markers = lapply(datasets, function(dataset) { compute_markers(cpm(dataset), dataset$new_label, dataset$class) }) names(ct_markers) = names(datasets) ct_meta_markers = lapply(names(ct_markers), function(leave_out) { make_meta_markers(ct_markers[names(ct_markers) != leave_out]) }) names(ct_meta_markers) = names(ct_markers)
As an example, we start by predicting cell types from the Segerstolpe dataset using the top 100 Baron+Muraro meta-markers. First we select the top 100 meta-markers (leaving Segerstolpe out) for each cell type, then predict cell types in the Segerstolpe dataset:
test_dataset = "seger" top_markers = filter(class_meta_markers[[test_dataset]], rank<=100) class_scores = score_cells(log1p(cpm(datasets[[test_dataset]])), top_markers) class_enrichment = compute_marker_enrichment(class_scores) class_pred = assign_cells(class_scores)
To find the ideal enrichment value, we could use the full cell type assessment, but for the class level we will use a simpler method:
hist(class_enrichment["all|endocrine",]) hist(class_enrichment["all|non-endocrine",])
All class assignments are relatively clear, we’ll proceed with predictions at the cell type level assuming that we got all class-level predictions correct.
top_markers = filter(ct_meta_markers[[test_dataset]], rank<=100) ct_scores = score_cells(log1p(cpm(datasets[[test_dataset]])), top_markers) ct_scores[1:6, 1:4]
Note that scores encode group info, e.g. "endocrine|alpha".
ct_enrichment = compute_marker_enrichment(ct_scores, by_group = TRUE) ct_pred = assign_cells(ct_scores, group_assignment = class_pred$predicted)
ct_pred = ct_pred %>% mutate(true_label = datasets[[test_dataset]]$new_label) mean(ct_pred$predicted == ct_pred$true_label)
Accuracy of 98.3%, almost 3 points higher than flat annotation (95.6%).
logcounts(datasets[[test_dataset]]) = log1p(cpm(datasets[[test_dataset]])) dec = scran::modelGeneVar(datasets[[test_dataset]]) hvg = scran::getTopHVGs(dec, n=2000) umap = compute_umap(logcounts(datasets[[test_dataset]])[hvg,])
plot_assignments(class_pred, umap[,2:3], 1)
plot_marker_scores(class_enrichment, umap[,2:3], TRUE)
Small group of cells that shows suspicious co-expression of endocrine and non-endocrine markers.
plot_assignments(ct_pred, umap[,2:3], 1) ct_true = mutate(ct_pred, predicted = true_label) plot_assignments(ct_true, umap[,2:3], 1)
Excellent correspondance, small differences: acinar/ductal, epsilon.
plot_marker_scores(ct_enrichment, umap[,2:3], TRUE)
A potential issue that arises using the naive assignment method is that some marker sets contain higher expressing genes, which may lead to some cell types being overcalled. This is the case for the epsilon cell type for example:
ct_pred %>% filter(true_label == "epsilon" | predicted == "epsilon") %>% arrange(desc(enrichment))
31 cells are called as epsilon, but only 7 cells were annotated as epsilon. However "true" epsilon cells have much higher marker scores and enrichment values, which can help to hand curate cell-level annotations. One way to pick a threshold for scores and enrichment is to look for multimodality in the score and enrichment distributions:
hist(ct_scores["endocrine|epsilon",], breaks = 20) hist(ct_enrichment["endocrine|epsilon",])
However, a better alternative is to assess cell type predictions across multiple training datasets to compute the optimal enrichment threshold above which a cell should be annotated as epsilon (1.8), as we will show below.
true_labels = datasets[[test_dataset]]$new_label keep_cell = datasets[[test_dataset]]$class == "endocrine" keep_marker_set = get_group(rownames(ct_enrichment)) == "endocrine" pr = summarize_precision_recall( ct_enrichment[keep_marker_set, keep_cell], true_labels[keep_cell], seq(1,3,0.1) ) %>% filter(get_cell_type(marker_set) == true_label) pr %>% ggplot(aes(x = score_threshold, y = f1, col = true_label)) + geom_line() + theme_bw()
We repeat the same assessment using each dataset as the test dataset:
test_class = "endocrine" pr = list() for (test_dataset in c("baron", "muraro", "seger")) { ct_markers = filter(ct_meta_markers[[test_dataset]], rank<=100) ct_scores = score_cells(log1p(cpm(datasets[[test_dataset]])), ct_markers) ct_enrichment = compute_marker_enrichment(ct_scores, by_group = TRUE) true_labels = datasets[[test_dataset]]$new_label keep_cell = datasets[[test_dataset]]$class == test_class keep_marker_set = get_group(rownames(ct_enrichment)) == test_class pr[[test_dataset]] = summarize_precision_recall( ct_enrichment[keep_marker_set, keep_cell], true_labels[keep_cell], seq(1,3,0.1) ) %>% filter(get_cell_type(marker_set) == true_label) } pr = bind_rows(pr, .id = "test_dataset")
Now we can compute the average F1 over all test datasets:
mean_pr = pr %>% group_by(true_label, score_threshold) %>% summarize(f1 = mean(f1, na.rm=TRUE)) mean_pr %>% ggplot(aes(x = score_threshold, y = f1, col = true_label)) + geom_line() + theme_bw()
And extract optimal performance and enrichment thresholds:
mean_pr %>% group_by(true_label) %>% filter(f1 == max(f1, na.rm = TRUE)) %>% summarize(score_threshold = median(score_threshold), f1 = first(f1))
For example, for epsilon cells, the optimal threshold is at 1.8, with an expected performance of F1=0.86.
To estimate the number of ideal markers, we can repeat the previous assessment at different gene numbers:
test_class = "endocrine" n_genes = c(1, 2, 5, 10, 20, 50, 100, 200, 500, 1000) mean_pr = list() for (n in n_genes) { pr = list() for (test_dataset in c("baron", "muraro", "seger")) { ct_markers = filter(ct_meta_markers[[test_dataset]], rank<=n) ct_scores = score_cells(log1p(cpm(datasets[[test_dataset]])), ct_markers) ct_enrichment = compute_marker_enrichment(ct_scores, by_group = TRUE) true_labels = datasets[[test_dataset]]$new_label keep_cell = datasets[[test_dataset]]$class == test_class keep_marker_set = get_group(rownames(ct_enrichment)) == test_class pr[[test_dataset]] = summarize_precision_recall( ct_enrichment[keep_marker_set, keep_cell], true_labels[keep_cell], seq(1,3,0.1) ) %>% filter(get_cell_type(marker_set) == true_label) } pr = bind_rows(pr, .id = "test_dataset") mean_pr[[as.character(n)]] = pr %>% group_by(true_label, score_threshold) %>% summarize(f1 = mean(f1, na.rm=TRUE)) %>% group_by(true_label) %>% filter(f1 == max(f1, na.rm = TRUE)) %>% summarize(score_threshold = median(score_threshold), f1 = first(f1)) } mean_pr = bind_rows(mean_pr, .id = "n_genes") %>% mutate(n_genes = as.integer(n_genes))
And plot the results:
mean_pr %>% ggplot(aes(x = n_genes, y = f1, col = true_label)) + geom_line() + theme_bw() + scale_x_log10()
The results suggest that: - 50 to 100 markers yield optimal performance for all cell types. - hierarchical annotation works much better for all cell types (F1 > 0.95) except epsilon cells (F1 ~ 0.85).
Epsilon cells are mostly misannotated in the Baron dataset because of low quality cells.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.