Automated marker-based annotation of cell types
This package uses Conos label propagaton, so you need to install it as a dependency:
install.packages("kharchenkolab/conos")
Next, install CellAnnotatoR:
devtools::install_github("khodosevichlab/CellAnnotatoR")
NOTE: this package is still in the development, and some functionality can be changed.
Assuming that you have path to your marker file in marker_path
, gene count matrix cm
, cell graph graph
, clustering clusters
and embedding emb
.
Examples of graphs are: Seurat so@graphs[[1]]
,
Pagoda 2 p2$graphs[[1]]
or Conos con$graph
.
clf_data <- getClassificationData(cm, marker_path, data.gene.id.type="SYMBOL", marker.gene.id.type="SYMBOL")
ann_by_level <- assignCellsByScores(graph, clf_data, clusters=clusters)
plotAnnotationByLevels(emb, ann_by_level$annotation, clusters=clusters, size=0.2, font.size=c(2, 4), shuffle.colors=T)
For quick start see the vignettes for a Seurat PBMC3k, Pagoda 2 BM or Conos BM+CB alignment.
Also see the Reference Manual for (almost) full list of functions.
If you have an annotated dataset from the same tissue, but no existing markers Automated marker selection based on provided annotation (MCA Lung data). Please, be aware that marker selection algorithm is under development and will be improved. In case you already have some markers, check the section "Improving known list of markers"
Creating annotation de-novo depends a lot on the type of your data and the more prior knowledge you have about the markers the better. Here are some sources where you can get some markers to start with:
After getting some markers for your data, use Garnett specification to create a markup file. Indeed, the annotation process mostly follows the next workflow:
One round of this workflow is shown in the QC vignette. And here are some more tips for these steps:
It depends a lot on your problem and packages you use to work with scRNA-seq data. So, few can be mantioned here in general case. Only that Specificity and ROC AUC metrics really help to select good markers (see Conos walkthrough for an example).
Seurat has its own functions for plotting gene expression, but for general case CellAnnotatoR provides the function plotGeneExpression(genes, embedding, cm, ...)
.
It returns list of plots for individual genes. Note: matrix cm
must be transposed, i.e. have genes as columns and cells as rows.
Example with Pagoda 2 object p2
:
c("Cldn10", "Igfbpl1", "Ccnd2", "Nes") %>%
plotGeneExpression(p2$embeddings$PCA$UMAP, p2$counts)
If you want to use panel of violinplots instead, you can use plotExpressionViolinMap(genes, cm, annotation)
. It suits better for large panels of markers:
c("Cldn10", "Igfbpl1", "Ccnd2", "Nes", "Id4", "Ascl1", "Egfr", "Serpine2", "Dcx", "Tubb3",
"Slc1a3", "Slc1a2", "Meis2", "Dlx5", "Dlx6") %>%
plotExpressionViolinMap(p2$counts, p2$clusters$PCA$leiden)
Two functions above allow you to plot markers, which you just want to test on the dataset. But in case you need to plot the markers, which are already in the markup file, two more functions are provided:
plotTypeMarkers(embedding, count.matrix, cell.type, marker.list, ...)
: plot markers for a specific cell.type
from the marker.list
plotSubtypeMarkers(embedding, count.matrix, parent.type="root", marker.list=NULL, ...)
: plot markers, which separate subtypes within a given cell type parent.type
and the provided marker.list
. Setting parent.type
to "root"
plots markers for all subtypes. Setting max.depth
allow to restrict maximal depth of the subbranch, for which markers are plotted.Adding markers is trivial and described in the Garnett specification. However there are some tricks for running classification.
The simplest way to get the annotation is using the following code:
clf_data <- getClassificationData(cm, marker_path)
ann_by_level <- assignCellsByScores(graph, clf_data, clusters=clusters)
Though if you need to re-run annotation multiple times (and you probably are) some steps can be optimized.
First, getClassificationData
performs TF-IDF normalization inside, which doesn't depend on the marker list. So we can do it only once:
cm_norm <- normalizeTfIdfWithFeatures(cm)
clf_data <- getClassificationData(cm_norm, marker_path, prenormalized=T)
Second, the most time-consuming step for classification is label propagation on graph. It improves classification quality in many cases, but for getting approximate results we can avoid this.
To do so, it's enough to pass NULL
instead of the graph object:
ann_by_level <- assignCellsByScores(NULL, clf_data, clusters=clusters)
So, during marker selection it's recommended to re-run the following code every time you update markers:
clf_data <- getClassificationData(cm_norm, marker_path, prenormalized=T)
ann_by_level <- assignCellsByScores(NULL, clf_data, clusters=clusters)
Validation of the results is crucial for high-quality annotation, so this part is described in the QC vignette. Here is just a list of functions, which can be useful:
plotAssignmentScores(embedding, scores, classification.tree, parent.node)
plotAnnotationByLevels(embedding, annotation.by.level)
plotUncertaintyPerCell(embedding, uncertainty.info)
plotUncertaintyPerClust(uncertainty.per.clust, clusters)
plotAssignmentConfusion(scores, annotation)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.