In smorabit/scicat:

Cluster stability analysis

In this tutorial, we will analyze and validate cluster assignments in the 10X Genomics 3k PBMC dataset. We will be using Seurat for primary data processing, and scicat for cross validation of cluster assignments.

knitr::include_graphics("~/Desktop/MOBP.png")

Installing scicat

Install the scicat package from github:

library(Seurat)
library(dplyr)

# install scicat from github:
# library(devtools)
# devtools::install_github(repo='smorabit/scicat', force=T)
library(scicat)

primary data processing

First, we are just going to follow the Seurat clustering tutorial so we can have some clustered cells to look at. Download the PBMC dataset from 10X here.

Load the PBMC dataset

pbmc.data <- Read10X(data.dir = "../filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

filter low quality cells, based on mitochondrial reads and number of counts / number of features

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

# filter cells based on QC plots
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Now that we have filtered out low quality data, we can do the rest of the primary data processing. This includes normalizing and scaling the data, finding the most variable features, and running dimensionality reduction.

# processing:
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, nfeatures=2000)
pbmc <- ScaleData(pbmc, features=rownames(pbmc))
pbmc <- RunPCA(pbmc, features=VariableFeatures(pbmc), verbose=F)
pbmc <- FindNeighbors(pbmc, dims=1:10)
pbmc <- FindClusters(pbmc, resolution=0.5)
pbmc <- RunUMAP(pbmc, dims=1:10)

Plot UMAP to show clusters

DimPlot(pbmc, reduction = "umap")

Cluster validation

Here we do a little bit of statistics to validate our clustering. The statistical analysis schema is the following:

Take a random sample of the dataset.
Perform clustering on the sampled dataset.
Compute the intersection between the sampled clusters and the original clusters. Store the highest overlap.
Repeat n times.
Compute the average of the maximum overlaps of each cluster.

The metric for cluster stability that we are computing is "mean scaled intersection" (MSI), which is computed by averaging the maximum overlap between sampled clusters and original clusters for each iteration. We can think about the stability of each cluster using the following rule of thumb:

MSI >= 0.85: For sure a real cluster
0.85 > MSI >= 0.65 Clustering picks up pattern in the data, but not sure if this is the best clustering.
0.50 > MSI: Probably not a real cluster

First we will compute the stability of each cluster and plot the results.

library(reshape2)
library(scales)

bl <- sc_cluster_stability(pbmc, n_cells=round(ncol(pbmc)*9/10), n_bootstrap=15, verbose=F)
cluster_stability_barplot(bl)

A smarter way to select the resolution of your clusters is to check the stability of each cluster across several values for cluster resolution. The sc_cluster_res_stability() function will test a vector of cluster stabilities, which we will show as a heatmap using cluster_res_stability_heatmap.

cluster_resolutions = seq(1:10)/5
n_bootstrap = 5
n_cells = round(ncol(pbmc)*9/10)
resm <- sc_cluster_res_stability(pbmc, cluster_resolutions, n_cells=n_cells, n_bootstrap=n_bootstrap, verbose=F)
cluster_res_stability_heatmap(resm)