In dviraran/SingleR: Reference-based single-cell RNA-seq annotation

knitr::opts_chunk$set(echo = TRUE)

Introduction

SingleR is a package that performs reference-based annotation of single-cell RNA-seq data. Here we show different ways to create SingleR objects. These objects can then be used with visualization functions available in the SingleR package and the SingleR web app (http://comphealth.ucsf.edu/SingleR/).

Basic SingleR function

SingleR provides built-in wrapper functions to run a complete pipeline with one function. SingleR provides support to Seurat (http://satijalab.org/seurat/), but any other scRNA-seq package can be used. These functions are explained in Case 1 and 2. These functions assist in reading the single-cell data, calculating labels using different references and creating an object that can be used by SungleR plotting functions. However, to run SingleR and retrieve labels for each cell the following function can be used:

singler = SingleR(method = "single", sc_data, ref_data, types, clusters = NULL,
  genes = "de", quantile.use = 0.8, p.threshold = 0.05,
  fine.tune = TRUE, fine.tune.thres = 0.05, sd.thres = 1,
  do.pvals = T, numCores = SingleR.numCores)

method can be 'single' or 'cluster'. 'cluster' will annotate each cluster instead of each single cell. The cluster expression is the average of the expression of all the cells in the given cluster. If 'cluster' than ids must be given in the 'clusters' parameters.
sc_data is the single cell matrix. If the data is from full-length method than the counts must be normalized to gene length (this can be achieved by using the built-in TPM function).

warning('Do not use the scaled.data field in Seurat as input. This field represents relative expression across cells, and is not appropriate as input for SingleR. The raw.data and data field are ok, but only if from a non full-length method.')

ref_data is the reference expression matrix. If using one of the built-in reference datasets than will be ref$data (ref is the reference dataset object). The built-in references are immgen or mouse.rnaseq (for mouse) and blueprint_encode or hpca (for human). See below for creating a new reference dataset.
types is a vector of labels for each reference sample. If using one of the built-in reference datasets than will be ref$types or ref$main_types (ref is the reference dataset object).
genes can be 'de', 'sd' or a set of genes, see supplementary information 1 for more details.
fine.tune is TRUE runs fine-tuning. This process may take very long.
fine.tune.thres is the threshold to use for choosing top cell types for each fine-tuning iteration. Reducing this parameter will make the fine-tuning process run faster.

This is the basic SingleR function. To use wrapper functions see the cases below.

Case 1: Counts data, no previous analysis

In this case we have counts data, and don't have any prefered previous analysis of the data.

To create the SingleR object simply run the following function:

singler = CreateSinglerSeuratObject(counts, annot = NULL, project.name,
  min.genes = 200, technology = "10X", species = "Human" (or "Mouse"), citation = "",
  ref.list = list(), normalize.gene.length = F, variable.genes = "de",
  fine.tune = T, reduce.file.size = T, do.signatures = T, min.cells = 2,
  npca = 10, regress.out = "nUMI", do.main.types = T,
  reduce.seurat.object = T, numCores = SingleR.numCores)
save(singler,file='singler_object.RData')

counts.file may be a tab delimited text file (with the prefix '.txt'), a matrix of the counts or a 10X directory. Importantly, the rownames must be gene symbols. To combine multiple 10X datasets we provide the function Combine.Multiple.10X.Datasets.
annot can be a tab delimited text file or a data.frame. Rownames correspond to column names in the counts data.
min.genes is a filter on samples with low number of non-zero genes.
ref.list is the reference that will be used for the annotation. If not supplied, this wrapper function will use predefined reference objects depending on the specie - Mouse: ImmGen and Mouse.RNAseq, Human: HPCA and Blueprint+Encode. It is probably best to start with these references before using more specific references. See below for explanation on how to generate a reference data set object.
normalize.gene.length - set to true if the data is from a full-length method (i.e. Smart-Seq), or FALSE is a 3' method (i.e. Drop-seq).
variable.genes - the method for choosing the genes used for the correlations. 'de' uses pairwise difference between the cell types, 'sd' uses a general standard variation.
fine.tune - performs the fine-tuning step. This step may take long for big datasets, but can improve results significantly if the data contains subtle differences.
do.signatures - this step runs a single-sample gene set enrichment analysis (ssGSEA) for a set of predefined signatures (see the object human.egc or mouse.egc). This step may also take long, and can be set to FALSE to shorten computation time.
min.cells, npca and regress.out are all passed directly to Seurat to create a Seurat object.
do.main.types - compute the main types scores as well.
reduce.seurat.object - removes the raw.data and calc.params from the Seurat object. The size of the object will be significantly smaller (~10-fold).
numCores - number of cores to use in parallel. The default is the number of cores in the system minus 1.

The returned singler object is a list that can be used for further analyses. See below.

Case 2: Already have a single-cell object

In this case we already have a single-cell object with tSNE coordinates and clusters. We want to annotate this object and use those parameters.

To create the SingleR object simply run the following function:

singler = CreateSinglerObject(counts, annot = NULL, project.name, min.genes = 0,
  technology = "10X", species = "Human", citation = "",
  ref.list = list(), normalize.gene.length = F, variable.genes = "de",
  fine.tune = T, do.signatures = T, clusters = NULL, do.main.types = T, 
  reduce.file.size = T, numCores = SingleR.numCores)

singler$seurat = seurat.object # (optional)
singler$meta.data$orig.ident = seurat.object@meta.data$orig.ident # the original identities, if not supplied in 'annot'

## if using Seurat v3.0 and over use:
singler$meta.data$xy = seurat.object@reductions$tsne@cell.embeddings # the tSNE coordinates
singler$meta.data$clusters = seurat.object@active.ident # the Seurat clusters (if 'clusters' not provided)

## if using a previous Seurat version use:
singler$meta.data$xy = seurat.object@dr$tsne@cell.embeddings # the tSNE coordinates
singler$meta.data$clusters = seurat.object@ident # the Seurat clusters (if 'clusters' not provided)

# this example is of course if the previous analysis was performed with Seurat, but any other previous coordinates and clusters can be used.

save(singler,file='singler_object.RData')

All the parameters are similar to case 1.

Creating a new reference data set

We have a reference dataset that we want to use. It contains N samples that can be annotated to n1 main cell types (i.e. macrophages or DCs) and n2 cell states (i.e. alveolar macrophages, interstitial macrophages, pDCs and cDCs).

The gene expression data should be gene-length normalized (TPM, FPKM etc.) and in log2 scale. The rownames must be gene symbols.

This is how we define the reference object:

  name = 'My_reference'
  expr = as.matrix(expr) # the expression matrix
  types = as.character(types) # a character list of the types. Samples from the same type should have the same name.
  main_types = as.character(main_types) # a character list of the main types. 
  ref = list(name=name,data = expr, types=types, main_types=main_types)

  # if using the de method, we can predefine the variable genes
  ref$de.genes = CreateVariableGeneSet(expr,types,200)
  ref$de.genes.main = CreateVariableGeneSet(expr,main_types,300)

  # if using the sd method, we need to define an sd threshold
  sd = apply(expr,1,sd)
  sd.thres = sort(sd, decreasing = T)[4000] # or any other threshold
  ref$sd.thres = sd.thres

  save(ref,file='ref.RData') # it is best to name the object and the file with the same name.

  # we can then use this reference in the previous functions. Multiple references can used.
  singler = CreateSinglerObject(... ref.list = list(immgen, ref, mouse.rnaseq)