pipeline

In this tutorial, we annotate cell type for single cell RNA-seq based on bulk RNA-seq reference data.

require(Lightbulb)

load Seurat object

the single cell RNA-seq data should be stored in a Seurat object, with @dr$tsne, @ident and @meta.data filled. For generating your own Seurat object, please refere to the Seurat mannual (https://satijalab.org/seurat/get_started.html)

Exp_Seurat=readRDS("Lightbulb_example_ExpSeurat.rds")

load super cell data

We use super-cell matrix generated by lightbulb's Super_cell_creation, please refer to the super cell creation example for details (https://github.com/Arthurhe/Lightbulb/blob/master/Examples/Lightbulb_example_super_cell_creation_V1.ipynb). You can use other super-cell matrix for input, but make sure that the coloumn of genes are in the same order as bulk samples, and row names of the matrix indicates the cell position in Seurat object.

super_cell_matrix=readRDS(paste0("Lightbulb_example_supercell.rds"))
super_cell_matrix[1:5,1:5]

1110002L01Rik1190002N15Rik1500009L16Rik1700066B19Rik1810058I24Rik 8553.8070402.1943292.9079650.0000006.127443 6120.0000002.1943290.0000000.0000005.716956 18493.0964472.1943290.0000000.0000006.807217 6290.0000002.1943292.9079650.0000006.660153 18000.0000003.4747250.0000003.6965636.253754

load bulk data

bulk RNA-seq data in table form

LLTE_Tcm_Tem=readRDS("Lightbulb_example_d20_Tcell_bulk.rds")
TE_MP=readRDS("Lightbulb_example_d7_Tcell_bulk.rds")

#D30 bulk sample
LLTE_Tcm_Tem[1:5,]

NAMESp.LLTE#1Sp.LLTE#2Sp.TEM#1Sp.TEM#2Sp.TCM#1Sp.TCM#2 171110002L01Rik 15.10543 13.48353 12.23527 17.83553 22.822690 14.491386 451190002N15Rik 12.03903 26.50983 39.89131 34.67106 28.278363 22.302188 551500009L16Rik 18.17182 13.48353 12.23527 20.76345 6.455673 8.810802 2871700066B19Rik 32.27725 15.11182 19.14928 22.95939 25.550526 17.331678 4071810058I24Rik415.57691 401.55861 363.98557 406.51666 339.933657 370.237927

#D7 bulk sample
TE_MP[1:5,]

DESCRIPTIOND7_TED7_MP 9Rb1cc1 21.90129 21.64276 21Cspp1 11.72776 12.16485 22Arfgef1 20.60308 20.29343 28Ncoa2 27.08860 23.19000 29Tram1 541.67319472.38567

get genes to use for annotation

# get the highly variable gene from single cells (need to run FindVariableGenes on Seurat object first)
HVG=head(rownames(Exp_Seurat@hvg.info), 1500)

# which gene to use
candidate_g=Reduce(intersect, list(LLTE_Tcm_Tem$NAME,TE_MP$DESCRIPTION))
common_gene=intersect(candidate_g,HVG)

bulk_list is a 2 level list:

bulk_list ---sample_set1 ------sample1.1=data.frame (each row is one replicate, each column is a gene) ------sample1.2=vector (if no replicate) ---sample_set2 ------sample2.1=data.frame ------sample2.2=data.frame ------sample2.3=data.frame

all the sample dataframe or vector should have the same gene order

bulk_list=list(Spleen_D7=list(TE=TE_MP$D7_TE[match(common_gene,TE_MP$DESCRIPTION)],
                              MP=TE_MP$D7_MP[match(common_gene,TE_MP$DESCRIPTION)]),
               Spleen_D30=list(LLE=t(LLTE_Tcm_Tem[match(common_gene,LLTE_Tcm_Tem$NAME),2:3]),
                               TEM=t(LLTE_Tcm_Tem[match(common_gene,LLTE_Tcm_Tem$NAME),4:5]),
                               TCM=t(LLTE_Tcm_Tem[match(common_gene,LLTE_Tcm_Tem$NAME),6:7])))

#normalize bulk list, and format the data to right structure
bulk_list=bulk_list_normalize(bulk_list)

#We are only interested in the cells from spleen in the single cell data. Select cells for mapping in Seurat object
tag_cell_for_all_cells=which(Exp_Seurat@meta.data$tissue == "Spleen")

#Select super cells that have the spleen label
tag_cell=as.numeric(rownames(super_cell_matrix)) %in% tag_cell_for_all_cells

#create the super cell matrix
sc_mat=super_cell_matrix[tag_cell,common_gene]

dataset=data_preprocessing(bulk_list,sc_mat,common_gene)

Test whether the bulk sample set classification make sense. In the following plot, the color bar on the top and left indicate the bulk sample sets. Bulk samples shoule be highly similar within one set.

options(repr.plot.width=6, repr.plot.height=6)
plot_bulk_bulk_cor(dataset)

png

We need to know how similar are the super cells compare to bulk sets. We can plot the correlation distribution by plot_SC_sampleSet_cor_distribution, and set threshold to pick the right most peak

options(repr.plot.width=6, repr.plot.height=3)
par(mfrow=c(1,2))
plot_SC_sampleSet_cor_distribution(dataset)

png

it seems that 0.75 is a good threshold for both bulk sets, because the right most peak of both distribution start at around 0.7~0.75.

threshold=c(0.75,0.75)

To futhur illustrate how the correlation distribute, we can plot the correlation on TSNE by using plot_SC_sampleSet_cor_TSNE. By setting the thresholds parameter, all cells fall below thresholds in correlation will be shown as gray

options(repr.plot.width=9, repr.plot.height=4.5)
par(mfrow=c(1,2))
plot_SC_sampleSet_cor_TSNE(dataset,Exp_Seurat,thresholds = threshold,color_contrast=1.5)

png

With the threshold picked, we can run the 2nd round of correlation to identify the cell types.

dataset=secondary_cor(dataset,threshold)

We can plot the annotation on TSNE by running plot_group_annotation_TSNE

options(repr.plot.width=9, repr.plot.height=4.5)
par(mfrow=c(1,2))
plot_group_annotation_TSNE(dataset,Exp_Seurat)

png

We can also plot the score for each sample in their corresponding sample set with plot_SC_sampleSet_cor_2nd_TSNE

options(repr.plot.width=12, repr.plot.height=4)
par(mfrow=c(1,3))
plot_SC_sampleSet_cor_2nd_TSNE(dataset,Exp_Seurat,plot_sets=1)
par(mfrow=c(1,3))
plot_SC_sampleSet_cor_2nd_TSNE(dataset,Exp_Seurat,plot_sets=2)

png

We can furthur conduct cross analysis between annotated cells and the clustering result (stored in Exp_Seurat@ident) and other metadata (stored in Exp_Seurat@meta.data) Because we were using the super-cells for annotation, to annotate the single cells we need to label them according to clusters. We can use metadata_reprocessing to get the cells that's relevant for annotation, for example, all the spleen cells in this data set. This function will also store the metadata information into the dataset object.

dataset=metadata_reprocessing(dataset,Exp_Seurat,relavant_cells = which(Exp_Seurat@meta.data$tissue == "Spleen"))

With metadata preocessing done. We can annoate the single cells with the bulk annotation by using dataset. We will annotate an ident cluster by looking at what percentage of the cells in given cluster have been annotated by what bulk samples. If more than min_percent portion of a cluster is similar to a bulk sample, then we associate the name of that bulk sample with the cluster name.

dataset=celltype_annotation_by_cluster(dataset,min_percent = 0.3)

summary(dataset)

                     Length Class      Mode     
bulk_list               2   -none-     list     
bulk_list_centered      2   -none-     list     
bulk_df                 2   -none-     list     
bulk_df_centered        2   -none-     list     
sample_set_mean      3000   -none-     numeric  
sc_mat               1500   data.frame list     
SC_sampleSet_cor     8930   -none-     numeric  
bulk_bulk_cor          64   -none-     numeric  
replicate_df            2   -none-     list     
SC_sampleSet_cor_2nd    2   -none-     list     
celltype_assignment     2   data.frame list     
sc_metadata            11   data.frame list     
cluster_annotation     23   -none-     character

dataset object explained: dataset\$bulk_xx & dataset\$replicate_df stored the bulk RNA-seq data dataset\$sample_set_mean stored the mean expression of each sample set dataset\$sc_mat stored the super-cell matrix dataset\$SC_sampleSet_cor stored the 1st super-cell sample set correlation dataset\$bulk_bulk_cor stored the correlation between bulk samples dataset\$sc_metadata stored the metadata for super-cell dataset\$SC_sampleSet_cor_2nd stored the 2nd super-cell bulk sample correltion within each sample set dataset\$celltype_assignment stored the cell type assignment for each super cell dataset\$cluster_annotation stored the new bulk annotated name for each cluster in Seurat@ident

#options(repr.plot.width=12, repr.plot.height=6)
par(mfrow=c(1,3))
identMatch=names(dataset$cluster_annotation)
names(identMatch)=names(dataset$cluster_annotation)
plot_scMeta_NotOnlySuperCell_byIdent(dataset,Exp_Seurat,identMatch = identMatch,main="cluster id",addtext=T)

identMatch=paste0(names(dataset$cluster_annotation),":",dataset$cluster_annotation)
names(identMatch)=names(dataset$cluster_annotation)
plot_scMeta_NotOnlySuperCell_byIdent(dataset,Exp_Seurat,identMatch = identMatch,main="cluster id",addtext=F)

plot_scMeta_NotOnlySuperCell_byIdent(dataset,Exp_Seurat,main="annotated cluster",addtext=T)

png

par(mfrow=c(1,3))
identMatch=paste0("cluster",names(dataset$cluster_annotation),": ",dataset$cluster_annotation)
names(identMatch)=names(dataset$cluster_annotation)
plot_scMeta_NotOnlySuperCell_by_category(dataset,Exp_Seurat,identMatch)

png