In dviraran/SingleR: Reference-based single-cell RNA-seq annotation

knitr::opts_chunk$set(echo = TRUE)
library(SingleR)
library(ggplot2)
library(plotly)
library(ggrastr)
path.data = '~/Documents/SingleR/package/SingleR/manuscript_figures/FiguresData/'

UPDATE

SingleR now contains a function that performs the code in this tutorial.

CreateBigSingleRObject = function(counts,annot=NULL,project.name,xy,clusters,N=10000,
                                     min.genes=200,technology='10X',
                                     species='Human',citation='',
                                     ref.list=list(),normalize.gene.length=F,
                                     variable.genes='de',fine.tune=T,
                                     reduce.file.size=T,do.signatures=F,
                                     do.main.types=T,
                                     temp.dir=getwd(), numCores = SingleR.numCores) {

Analyzing a big data file with SingleR

A common single-cell RNA-seq experiments now yields many thousands of cells. The SingleR current implementation requires reading the whole count matrix to the memory. This is many time impossible.

Below we describe our analysis of the mouse cell atlas (MCA) single-cell data [@Han2018]. This dataset contains 250,000 cells. We run SingleR on 20,000 cells each time, and in the end combine of the objects together.

We started by following the excellent Seurat tutorial for analyzing the MCA data https://satijalab.org/seurat/mca.html. After the analysis was performed we save the Seurat object mca.rds.

load ('~/GSE108097/mca.rds')

sc.datasets <- list.files('~/GSE108097/', pattern='dge')

s = seq(1,length(mca@cell.names),by=20000)
for (i in s) {
  print(i)
  A = seq(i,min(i+20000-1,length(mca@cell.names)))
  annot=mca@meta.data$Tissue[A]
  names(annot) = rownames(mca@meta.data)[A]

  singler = CreateSinglerObject(mca@raw.data[,A], annot = annot, project.name='MCA', 
                                min.genes = 0,  technology = "Microwell-Seq", 
                                species = "Mouse", citation = "Han et al. 2018",
                                do.signatures = F, clusters = mca@ident[A])

  save(singler,file=paste0('~/GSE108097/SingleR/singler.mca.',i,'.RData'))
}

singler.objects.file <- list.files('~/GSE108097/SingleR/', 
                                   pattern='RData',full.names=T)

singler.objects = lapply(singler.objects.file,FUN=function(x) load(x))

singler = SingleR.Combine(singler.objects,order = mca@cell.names,
                          clusters=mca@ident,xy=mca@dr$tsne@cell.embeddings)

This SingleR object is fully functional, but quite big for analysis. We created a small data frame only with the annotations and the tSNE coordinates:

load(file.path(path.data,'mca.singler.rds'))
mca.singler = as.data.frame(mca.singler)
mca.singler$labels = paste0('Annotation: ',mca.singler$Types,'\nTissue: ',mca.singler$Orig.Ident)
mca.singler$FItSNE_1 = as.numeric(levels(mca.singler$FItSNE_1))[mca.singler$FItSNE_1]
mca.singler$FItSNE_2 = as.numeric(levels(mca.singler$FItSNE_2))[mca.singler$FItSNE_2]
p = ggplot(mca.singler)+geom_point_rast(aes(x=FItSNE_1,y=FItSNE_2,color=Main.Types),
                                        size=0.1,alpha=0.5)+
  guides(color=guide_legend(override.aes = list(size=2,alpha=1)))+
  scale_color_manual(values = singler.colors)+
  theme_classic()
p