knitr::opts_chunk$set(echo = TRUE)

Introduction

Simpler single cell RNAseq data clustering (sscClust) is a package implement the clustering pipeline for simple usage. The single cell RNAseq data clustering is usually consist of variable genes selection, dimension reduction, clustering on reduced data. Currently, this package also wrap other clustering method designed specifically for single cell RNAseq data, including SC3, ZinbWave etc.

data description

Here, we would use the Pollen dataset which is included in this package. This data set contains expression data of 249 cells by 6,982 gene, with eleven clusters corresponding to skin cells, pluripotent stem cells, blood cells, neural cells, etc. To calculate the ARI index and record the excuting time of functions, we also load two packages, r CRANpkg("igraph") and r CRANpkg("tictoc").

library("SingleCellExperiment")
library("sscClust")
library("igraph")
library("tictoc")
data("sce.Pollen")
sce.Pollen

Step by step usage

variable gene selection

We selected the top 1500 genes with higheast sd for clustering. There are many variable genes identification methods in the single cell RNAseq research community. For example, fit CV^2~mean relationship using a GLM model then the genes significantly higher than the fitted value are defined as highly variable genes. Here the sd method is sufficient.

sce.Pollen <- ssc.variableGene(sce.Pollen,method="sd",sd.n=1500)
str(metadata(sce.Pollen)$ssc$variable.gene$sd)

dimension reduction

Before clustering, we can use tSNE to visualize the expression profile of the data. This step also facilitate the selection of clustering parameters such number of clusters.

First, we perform PCA and keep only top PCs. By default, top 30 PCs will be kept. On the scree plot which dipict the variation exlained by each PC, sscClust can automatically find a knee point which can be used to set the number of PC to keep.

sce.Pollen <- ssc.reduceDim(sce.Pollen,method="pca",seed = 9997)
ssc.plot.pca(sce.Pollen)

The scree plot suggest top 12 PCs should be kept. If the automatically selected knee point is not optimal, the pca.npc can be set explicitly.

sce.Pollen <- ssc.reduceDim(sce.Pollen,method="pca",seed = 9997,pca.npc = 12)

By default, when PCA is finished, a t-SNE map is also generated internally. We can use ssc.plot.tsne to visualize the data.

ssc.plot.tsne(sce.Pollen,columns = "true_labs",reduced.name = "pca.tsne")

Note, a SingleCellExperiment object can contain multile reduced dimension respresentation of the data, which are stored in the reducedDim slot under different names. The reduced data by PCA is stored with name "pca", and the automatically generated t-SNE map is stored with name "pca.tsne".

clustering

We will use kmeans algorithm (implemented in function ssc.clust) to cluster the cells into 11 groups, and use the labels from original analysis of the dataset as true labels. The parameters method and k.batch is set to kmeans and 11 respectively. The dimension reduction by PCA has been done already, so method.reduction of ssc.clust is be pca:

tic("clustering")
sce.Pollen <- ssc.clust(sce.Pollen, method.reduction="pca", method="kmeans", k.batch=11,seed = 9997)
toc()

We can compare the clustering result with that of original study by ARI index:

compare(as.factor(sce.Pollen$true_labs),
        as.factor(sce.Pollen$pca.kmeans.k11),
        method="adjusted.rand")

Also we compare the clustering result by t-SNE:

ssc.plot.tsne(sce.Pollen,columns = c("pca.kmeans.k11","true_labs"),reduced.name = "pca.tsne")

All in One

We can run all the steps above in one step, using function ssc.run:

sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca",method.clust = "kmeans", k.batch=11,seed = 9997)

various clustering pipelines

data transformation by spearman correlation, than kmeans clustering

sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "iCor",method.clust = "kmeans", k.batch=11,seed = 9997)
compare(as.factor(sce.Pollen$true_labs),
        as.factor(sce.Pollen$iCor.kmeans.k11),
        method="adjusted.rand")

dimension reduction by pca, than clustering based on SNN

sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca",
                      method.clust = "SNN", SNN.k=10,SNN.method="eigen",seed=9997)
compare(as.factor(sce.Pollen$true_labs),
        as.factor(sce.Pollen$pca.SNN.kauto),
        method="adjusted.rand")

dimension by pca and tSNE, run density based clustering on the tSNE map

Use ADPclust, automatically determine peaks

sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca",
                      method.clust = "adpclust", seed=9997)
compare(as.factor(sce.Pollen$true_labs),
        as.factor(sce.Pollen$pca.tsne.adpclust.kauto),
        method="adjusted.rand")

manually specified peaks

sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca",
                      method.clust = "dpclust", 
                      parfile = system.file("extdata/Pollen.par.r",package = "sscClust"),
                      out.prefix = "./Pollen.dpclust", seed=9997)
compare(as.factor(sce.Pollen$true_labs),
        as.factor(sce.Pollen$pca.tsne.dpclust.kauto),
        method="adjusted.rand")

manually specified peaks, perform second round of clustering on C5:

sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,method.reduction = "pca",
                      method.clust = "dpclust",nIter = 2, 
                      parfile = system.file("extdata/Pollen.par.r",package = "sscClust"),
                      out.prefix = "./Pollen.dpclust", seed=9997)
compare(as.factor(sce.Pollen$true_labs),
        as.factor(sce.Pollen$pca.tsne.dpclust.kauto),
        method="adjusted.rand")

wrapper for SC3:

sce.Pollen <- ssc.run(sce.Pollen,method.vgene = "sd",sd.n = 1500,k.batch = 9:13,method.clust = "SC3",seed=9997)
compare(as.factor(sce.Pollen$true_labs),
        as.factor(sce.Pollen$sc3_11_clusters),
        method="adjusted.rand")

visualize the result from various methods

ssc.plot.tsne(sce.Pollen,columns = c("true_labs","pca.kmeans.k11",
                                     "iCor.kmeans.k11","pca.SNN.kauto",
                                     "pca.tsne.dpclust.kauto","sc3_11_clusters"),reduced.name = "pca.tsne")

Session info {.unnumbered}

sessionInfo()


Japrin/sscClust documentation built on May 23, 2024, 7:12 a.m.