knitr::opts_chunk$set(echo = TRUE)
Simpler single cell RNAseq data clustering (sscClust
) is a package implement the clustering pipeline for simple usage. The single cell RNAseq data clustering is usually consist of variable genes selection, dimension reduction, clustering on reduced data. Currently, this package also wrap other clustering method designed specifically for single cell RNAseq data, including SC3
, ZinbWave
etc.
Here, we would use the Pollen dataset which is included in this package. This data set contains expression data of 249 cells by 6,982 gene, with eleven clusters corresponding to skin cells, pluripotent stem cells, blood cells, neural cells, etc. To calculate the ARI index and record the excuting time of functions, we also load two packages, r CRANpkg("igraph")
and r CRANpkg("tictoc")
.
library("SingleCellExperiment") library("sscClust") library("igraph") library("tictoc") data("sce.Pollen") sce.Pollen
We selected the top 1500 genes with higheast sd for clustering. There are many variable genes identification methods in the single cell RNAseq research community. For example, fit CV^2~mean relationship using a GLM model then the genes significantly higher than the fitted value are defined as highly variable genes. Here the sd method is sufficient.
sce.Pollen <- ssc.variableGene(sce.Pollen,method="sd",sd.n=1500) str(metadata(sce.Pollen)$ssc$variable.gene$sd)
Before clustering, we can use tSNE to visualize the expression profile of the data. This step also facilitate the selection of clustering parameters such number of clusters.
First, we perform PCA and keep only top PCs. By default, top 30 PCs will be kept. On the scree plot which dipict the variation exlained by each PC, sscClust
can automatically find a knee point which can be used to set the number of PC to keep.
sce.Pollen <- ssc.reduceDim(sce.Pollen,method="pca",seed = 9997) ssc.plot.pca(sce.Pollen)
The scree plot suggest top 12 PCs should be kept. If the automatically selected knee point is not optimal, the pca.npc
can be set explicitly.
sce.Pollen <- ssc.reduceDim(sce.Pollen,method="pca",seed = 9997,pca.npc = 12)
By default, when PCA is finished, a t-SNE map is also generated internally. We can use ssc.plot.tsne
to visualize the data.
ssc.plot.tsne(sce.Pollen,columns = "true_labs",reduced.name = "pca.tsne")
Note, a SingleCellExperiment object can contain multile reduced dimension respresentation of the data, which are stored in the reducedDim slot under different names. The reduced data by PCA is stored with name "pca", and the automatically generated t-SNE map is stored with name "pca.tsne".
We will use kmeans algorithm (implemented in function ssc.clust
) to cluster the cells into 11 groups, and use the labels from original analysis of the dataset as true labels. The parameters method
and k.batch
is set to kmeans
and 11
respectively. The dimension reduction by PCA has been done already, so method.reduction
of ssc.clust
is be pca
:
tic("clustering") sce.Pollen <- ssc.clust(sce.Pollen, method.reduction="pca", method="kmeans", k.batch=11,seed = 9997) toc()
We can compare the clustering result with that of original study by ARI index:
compare(as.factor(sce.Pollen$true_labs), as.factor(sce.Pollen$pca.kmeans.k11), method="adjusted.rand")
Also we compare the clustering result by t-SNE:
ssc.plot.tsne(sce.Pollen,columns = c("pca.kmeans.k11","true_labs"),reduced.name = "pca.tsne")
We can run all the steps above in one step, using function ssc.run
:
sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca",method.clust = "kmeans", k.batch=11,seed = 9997)
sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "iCor",method.clust = "kmeans", k.batch=11,seed = 9997) compare(as.factor(sce.Pollen$true_labs), as.factor(sce.Pollen$iCor.kmeans.k11), method="adjusted.rand")
sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca", method.clust = "SNN", SNN.k=10,SNN.method="eigen",seed=9997) compare(as.factor(sce.Pollen$true_labs), as.factor(sce.Pollen$pca.SNN.kauto), method="adjusted.rand")
Use ADPclust, automatically determine peaks
sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca", method.clust = "adpclust", seed=9997) compare(as.factor(sce.Pollen$true_labs), as.factor(sce.Pollen$pca.tsne.adpclust.kauto), method="adjusted.rand")
manually specified peaks
sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca", method.clust = "dpclust", parfile = system.file("extdata/Pollen.par.r",package = "sscClust"), out.prefix = "./Pollen.dpclust", seed=9997) compare(as.factor(sce.Pollen$true_labs), as.factor(sce.Pollen$pca.tsne.dpclust.kauto), method="adjusted.rand")
manually specified peaks, perform second round of clustering on C5:
sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca", method.clust = "dpclust",nIter = 2, parfile = system.file("extdata/Pollen.par.r",package = "sscClust"), out.prefix = "./Pollen.dpclust", seed=9997) compare(as.factor(sce.Pollen$true_labs), as.factor(sce.Pollen$pca.tsne.dpclust.kauto), method="adjusted.rand")
sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,k.batch = 9:13,method.clust = "SC3",seed=9997) compare(as.factor(sce.Pollen$true_labs), as.factor(sce.Pollen$sc3_11_clusters), method="adjusted.rand")
ssc.plot.tsne(sce.Pollen,columns = c("true_labs","pca.kmeans.k11", "iCor.kmeans.k11","pca.SNN.kauto", "pca.tsne.dpclust.kauto","sc3_11_clusters"),reduced.name = "pca.tsne")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.