In this tutorial, we annotate cell type for single cell RNA-seq based on bulk RNA-seq reference data.
require(Lightbulb)
the single cell RNA-seq data should be stored in a Seurat object, with @dr$tsne, @ident and @meta.data filled. For generating your own Seurat object, please refere to the Seurat mannual (https://satijalab.org/seurat/get_started.html)
Exp_Seurat=readRDS("Lightbulb_example_ExpSeurat.rds")
We use super-cell matrix generated by lightbulb's Super_cell_creation, please refer to the super cell creation example for details (https://github.com/Arthurhe/Lightbulb/blob/master/Examples/Lightbulb_example_super_cell_creation_V1.ipynb). You can use other super-cell matrix for input, but make sure that the coloumn of genes are in the same order as bulk samples, and row names of the matrix indicates the cell position in Seurat object.
super_cell_matrix=readRDS(paste0("Lightbulb_example_supercell.rds"))
super_cell_matrix[1:5,1:5]
1110002L01Rik1190002N15Rik1500009L16Rik1700066B19Rik1810058I24Rik
8553.8070402.1943292.9079650.0000006.127443
6120.0000002.1943290.0000000.0000005.716956
18493.0964472.1943290.0000000.0000006.807217
6290.0000002.1943292.9079650.0000006.660153
18000.0000003.4747250.0000003.6965636.253754
bulk RNA-seq data in table form
LLTE_Tcm_Tem=readRDS("Lightbulb_example_d20_Tcell_bulk.rds")
TE_MP=readRDS("Lightbulb_example_d7_Tcell_bulk.rds")
#D30 bulk sample
LLTE_Tcm_Tem[1:5,]
NAMESp.LLTE#1Sp.LLTE#2Sp.TEM#1Sp.TEM#2Sp.TCM#1Sp.TCM#2
171110002L01Rik 15.10543 13.48353 12.23527 17.83553 22.822690 14.491386
451190002N15Rik 12.03903 26.50983 39.89131 34.67106 28.278363 22.302188
551500009L16Rik 18.17182 13.48353 12.23527 20.76345 6.455673 8.810802
2871700066B19Rik 32.27725 15.11182 19.14928 22.95939 25.550526 17.331678
4071810058I24Rik415.57691 401.55861 363.98557 406.51666 339.933657 370.237927
#D7 bulk sample
TE_MP[1:5,]
DESCRIPTIOND7_TED7_MP
9Rb1cc1 21.90129 21.64276
21Cspp1 11.72776 12.16485
22Arfgef1 20.60308 20.29343
28Ncoa2 27.08860 23.19000
29Tram1 541.67319472.38567
# get the highly variable gene from single cells (need to run FindVariableGenes on Seurat object first)
HVG=head(rownames(Exp_Seurat@hvg.info), 1500)
# which gene to use
candidate_g=Reduce(intersect, list(LLTE_Tcm_Tem$NAME,TE_MP$DESCRIPTION))
common_gene=intersect(candidate_g,HVG)
bulk_list is a 2 level list:
bulk_list ---sample_set1 ------sample1.1=data.frame (each row is one replicate, each column is a gene) ------sample1.2=vector (if no replicate) ---sample_set2 ------sample2.1=data.frame ------sample2.2=data.frame ------sample2.3=data.frame
all the sample dataframe or vector should have the same gene order
bulk_list=list(Spleen_D7=list(TE=TE_MP$D7_TE[match(common_gene,TE_MP$DESCRIPTION)],
MP=TE_MP$D7_MP[match(common_gene,TE_MP$DESCRIPTION)]),
Spleen_D30=list(LLE=t(LLTE_Tcm_Tem[match(common_gene,LLTE_Tcm_Tem$NAME),2:3]),
TEM=t(LLTE_Tcm_Tem[match(common_gene,LLTE_Tcm_Tem$NAME),4:5]),
TCM=t(LLTE_Tcm_Tem[match(common_gene,LLTE_Tcm_Tem$NAME),6:7])))
#normalize bulk list, and format the data to right structure
bulk_list=bulk_list_normalize(bulk_list)
#We are only interested in the cells from spleen in the single cell data. Select cells for mapping in Seurat object
tag_cell_for_all_cells=which(Exp_Seurat@meta.data$tissue == "Spleen")
#Select super cells that have the spleen label
tag_cell=as.numeric(rownames(super_cell_matrix)) %in% tag_cell_for_all_cells
#create the super cell matrix
sc_mat=super_cell_matrix[tag_cell,common_gene]
dataset=data_preprocessing(bulk_list,sc_mat,common_gene)
Test whether the bulk sample set classification make sense. In the following plot, the color bar on the top and left indicate the bulk sample sets. Bulk samples shoule be highly similar within one set.
options(repr.plot.width=6, repr.plot.height=6)
plot_bulk_bulk_cor(dataset)
We need to know how similar are the super cells compare to bulk sets. We can plot the correlation distribution by plot_SC_sampleSet_cor_distribution, and set threshold to pick the right most peak
options(repr.plot.width=6, repr.plot.height=3)
par(mfrow=c(1,2))
plot_SC_sampleSet_cor_distribution(dataset)
it seems that 0.75 is a good threshold for both bulk sets, because the right most peak of both distribution start at around 0.7~0.75.
threshold=c(0.75,0.75)
To futhur illustrate how the correlation distribute, we can plot the correlation on TSNE by using plot_SC_sampleSet_cor_TSNE. By setting the thresholds parameter, all cells fall below thresholds in correlation will be shown as gray
options(repr.plot.width=9, repr.plot.height=4.5)
par(mfrow=c(1,2))
plot_SC_sampleSet_cor_TSNE(dataset,Exp_Seurat,thresholds = threshold,color_contrast=1.5)
With the threshold picked, we can run the 2nd round of correlation to identify the cell types.
dataset=secondary_cor(dataset,threshold)
We can plot the annotation on TSNE by running plot_group_annotation_TSNE
options(repr.plot.width=9, repr.plot.height=4.5)
par(mfrow=c(1,2))
plot_group_annotation_TSNE(dataset,Exp_Seurat)
We can also plot the score for each sample in their corresponding sample set with plot_SC_sampleSet_cor_2nd_TSNE
options(repr.plot.width=12, repr.plot.height=4)
par(mfrow=c(1,3))
plot_SC_sampleSet_cor_2nd_TSNE(dataset,Exp_Seurat,plot_sets=1)
par(mfrow=c(1,3))
plot_SC_sampleSet_cor_2nd_TSNE(dataset,Exp_Seurat,plot_sets=2)
We can furthur conduct cross analysis between annotated cells and the clustering result (stored in Exp_Seurat@ident) and other metadata (stored in Exp_Seurat@meta.data) Because we were using the super-cells for annotation, to annotate the single cells we need to label them according to clusters. We can use metadata_reprocessing to get the cells that's relevant for annotation, for example, all the spleen cells in this data set. This function will also store the metadata information into the dataset object.
dataset=metadata_reprocessing(dataset,Exp_Seurat,relavant_cells = which(Exp_Seurat@meta.data$tissue == "Spleen"))
With metadata preocessing done. We can annoate the single cells with the bulk annotation by using dataset. We will annotate an ident cluster by looking at what percentage of the cells in given cluster have been annotated by what bulk samples. If more than min_percent portion of a cluster is similar to a bulk sample, then we associate the name of that bulk sample with the cluster name.
dataset=celltype_annotation_by_cluster(dataset,min_percent = 0.3)
summary(dataset)
Length Class Mode
bulk_list 2 -none- list
bulk_list_centered 2 -none- list
bulk_df 2 -none- list
bulk_df_centered 2 -none- list
sample_set_mean 3000 -none- numeric
sc_mat 1500 data.frame list
SC_sampleSet_cor 8930 -none- numeric
bulk_bulk_cor 64 -none- numeric
replicate_df 2 -none- list
SC_sampleSet_cor_2nd 2 -none- list
celltype_assignment 2 data.frame list
sc_metadata 11 data.frame list
cluster_annotation 23 -none- character
dataset object explained: dataset\$bulk_xx & dataset\$replicate_df stored the bulk RNA-seq data dataset\$sample_set_mean stored the mean expression of each sample set dataset\$sc_mat stored the super-cell matrix dataset\$SC_sampleSet_cor stored the 1st super-cell sample set correlation dataset\$bulk_bulk_cor stored the correlation between bulk samples dataset\$sc_metadata stored the metadata for super-cell dataset\$SC_sampleSet_cor_2nd stored the 2nd super-cell bulk sample correltion within each sample set dataset\$celltype_assignment stored the cell type assignment for each super cell dataset\$cluster_annotation stored the new bulk annotated name for each cluster in Seurat@ident
#options(repr.plot.width=12, repr.plot.height=6)
par(mfrow=c(1,3))
identMatch=names(dataset$cluster_annotation)
names(identMatch)=names(dataset$cluster_annotation)
plot_scMeta_NotOnlySuperCell_byIdent(dataset,Exp_Seurat,identMatch = identMatch,main="cluster id",addtext=T)
identMatch=paste0(names(dataset$cluster_annotation),":",dataset$cluster_annotation)
names(identMatch)=names(dataset$cluster_annotation)
plot_scMeta_NotOnlySuperCell_byIdent(dataset,Exp_Seurat,identMatch = identMatch,main="cluster id",addtext=F)
plot_scMeta_NotOnlySuperCell_byIdent(dataset,Exp_Seurat,main="annotated cluster",addtext=T)
par(mfrow=c(1,3))
identMatch=paste0("cluster",names(dataset$cluster_annotation),": ",dataset$cluster_annotation)
names(identMatch)=names(dataset$cluster_annotation)
plot_scMeta_NotOnlySuperCell_by_category(dataset,Exp_Seurat,identMatch)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.