README.md

CellAnnotatoR

Automated marker-based annotation of cell types

Installation

This package uses Conos label propagaton, so you need to install it as a dependency:

install.packages("kharchenkolab/conos")

Next, install CellAnnotatoR:

devtools::install_github("khodosevichlab/CellAnnotatoR")

Usage

NOTE: this package is still in the development, and some functionality can be changed.

Assuming that you have path to your marker file in marker_path, gene count matrix cm, cell graph graph, clustering clusters and embedding emb. Examples of graphs are: Seurat so@graphs[[1]], Pagoda 2 p2$graphs[[1]] or Conos con$graph.

clf_data <- getClassificationData(cm, marker_path, data.gene.id.type="SYMBOL", marker.gene.id.type="SYMBOL")
ann_by_level <- assignCellsByScores(graph, clf_data, clusters=clusters)

plotAnnotationByLevels(emb, ann_by_level$annotation, clusters=clusters, size=0.2, font.size=c(2, 4), shuffle.colors=T)

For quick start see the vignettes for a Seurat PBMC3k, Pagoda 2 BM or Conos BM+CB alignment.

Also see the Reference Manual for (almost) full list of functions.

Creating annotation file

Extracting markers from a provided annotation

If you have an annotated dataset from the same tissue, but no existing markers Automated marker selection based on provided annotation (MCA Lung data). Please, be aware that marker selection algorithm is under development and will be improved. In case you already have some markers, check the section "Improving known list of markers"

De-novo annotation

Creating annotation de-novo depends a lot on the type of your data and the more prior knowledge you have about the markers the better. Here are some sources where you can get some markers to start with:

After getting some markers for your data, use Garnett specification to create a markup file. Indeed, the annotation process mostly follows the next workflow:

  1. Find marker candidates either with differential expression or using prior knowledge
  2. Plot the markers on your data and ensure that they suit your case
  3. Add the markers to the file and re-run the classification
  4. Check results, find cell types, which are not well-separated or for which you want to increase the annotation resolution
  5. Go to step 1 if there is something to improve

One round of this workflow is shown in the QC vignette. And here are some more tips for these steps:

Step 1

It depends a lot on your problem and packages you use to work with scRNA-seq data. So, few can be mantioned here in general case. Only that Specificity and ROC AUC metrics really help to select good markers (see Conos walkthrough for an example).

Step 2

Seurat has its own functions for plotting gene expression, but for general case CellAnnotatoR provides the function plotGeneExpression(genes, embedding, cm, ...). It returns list of plots for individual genes. Note: matrix cm must be transposed, i.e. have genes as columns and cells as rows.

Example with Pagoda 2 object p2:

c("Cldn10", "Igfbpl1", "Ccnd2", "Nes") %>% 
  plotGeneExpression(p2$embeddings$PCA$UMAP, p2$counts)

If you want to use panel of violinplots instead, you can use plotExpressionViolinMap(genes, cm, annotation). It suits better for large panels of markers:

c("Cldn10", "Igfbpl1", "Ccnd2", "Nes", "Id4", "Ascl1", "Egfr", "Serpine2", "Dcx", "Tubb3",
  "Slc1a3", "Slc1a2", "Meis2", "Dlx5", "Dlx6") %>% 
  plotExpressionViolinMap(p2$counts, p2$clusters$PCA$leiden)

Two functions above allow you to plot markers, which you just want to test on the dataset. But in case you need to plot the markers, which are already in the markup file, two more functions are provided:

Step 3

Adding markers is trivial and described in the Garnett specification. However there are some tricks for running classification.

The simplest way to get the annotation is using the following code:

clf_data <- getClassificationData(cm, marker_path)
ann_by_level <- assignCellsByScores(graph, clf_data, clusters=clusters)

Though if you need to re-run annotation multiple times (and you probably are) some steps can be optimized.

First, getClassificationData performs TF-IDF normalization inside, which doesn't depend on the marker list. So we can do it only once:

cm_norm <- normalizeTfIdfWithFeatures(cm)
clf_data <- getClassificationData(cm_norm, marker_path, prenormalized=T)

Second, the most time-consuming step for classification is label propagation on graph. It improves classification quality in many cases, but for getting approximate results we can avoid this. To do so, it's enough to pass NULL instead of the graph object:

ann_by_level <- assignCellsByScores(NULL, clf_data, clusters=clusters)

So, during marker selection it's recommended to re-run the following code every time you update markers:

clf_data <- getClassificationData(cm_norm, marker_path, prenormalized=T)
ann_by_level <- assignCellsByScores(NULL, clf_data, clusters=clusters)

Step 4

Validation of the results is crucial for high-quality annotation, so this part is described in the QC vignette. Here is just a list of functions, which can be useful:



khodosevichlab/CellAnnotatoR documentation built on June 29, 2022, 9:12 p.m.