In smorabit/scicat:

Installation and data processing {.tabset .tabset-fade .tabset-pills}

In this tutorial, we will analyze and validate cluster assignments in the 10X Genomics 3k PBMC dataset. We will be using Seurat for primary data processing, and scicat for cross validation of cluster assignments.

installing scicat

Install the scicat package from github:

library(Seurat)
library(dplyr)

# install scicat from github:
# library(devtools)
# devtools::install_github(repo='smorabit/scicat', force=T)
library(scicat)

primary data processing

First, we are just going to follow the Seurat clustering tutorial so we can have some clustered cells to look at. Download the PBMC dataset from 10X here.

Load the PBMC dataset

pbmc.data <- Read10X(data.dir = "../filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

filter low quality cells, based on mitochondrial reads and number of counts / number of features

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Primary data processing

# processing:
pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc, nfeatures=2000)
pbmc <- ScaleData(pbmc, features=rownames(pbmc))
pbmc <- RunPCA(pbmc, features=VariableFeatures(pbmc), verbose=F)
pbmc <- FindNeighbors(pbmc, dims=1:10)
pbmc <- FindClusters(pbmc, resolution=0.5)
pbmc <- RunUMAP(pbmc, dims=1:10)

Plot UMAP to show clusters

DimPlot(pbmc, reduction = "umap")

Cluster validation {.tabset .tabset-fade .tabset-pills}

Here we do a little bit of statistics to validate our clustering. The statistical analysis schema is the following:

Take a random sample of the dataset.
Perform clustering on the sampled dataset.
Compute the intersection between the sampled clusters and the original clusters. Store the highest overlap.
Repeat n times.
Compute the average of the maximum overlaps of each cluster.

The metric for cluster stability that we are computing is "mean scaled intersection" (MSI), which is computed by averaging the maximum overlap between sampled clusters and original clusters for each iteration. We can think about the stability of each cluster using the following rule of thumb:

MSI >= 0.85: For sure a real cluster
0.85 > MSI >= 0.65 Clustering picks up pattern in the data, but not sure if this is the best clustering.
0.50 > MSI: Probably not a real cluster

library(reshape2)
library(scales)

# let's check the stability of the current clustering:
bl <- sc_cluster_stability(pbmc, n_cells=round(ncol(pbmc)*9/10), n_bootstrap=15, verbose=F)
cluster_stability_barplot(bl)

# let's test the performance of a couple different cluster resolutions:

cluster_resolutions = seq(1:20)/10
n_bootstrap = 25
n_cells = round(ncol(pbmc)*9/10)
resm <- sc_cluster_res_stability(pbmc, cluster_resolutions, n_cells=n_cells, n_bootstrap=n_bootstrap, verbose=F)
cluster_res_stability_heatmap(resm)