In this tutorial, we will analyze and validate cluster assignments in the 10X Genomics 3k PBMC dataset. We will be using Seurat for primary data processing, and scicat for cross validation of cluster assignments.
Install the scicat package from github:
library(Seurat) library(dplyr) # install scicat from github: # library(devtools) # devtools::install_github(repo='smorabit/scicat', force=T) library(scicat)
First, we are just going to follow the Seurat clustering tutorial so we can have some clustered cells to look at. Download the PBMC dataset from 10X here.
Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../filtered_gene_bc_matrices/hg19/") pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
filter low quality cells, based on mitochondrial reads and number of counts / number of features
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") CombinePlots(plots = list(plot1, plot2)) pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
Primary data processing
# processing: pbmc <- NormalizeData(pbmc) pbmc <- FindVariableFeatures(pbmc, nfeatures=2000) pbmc <- ScaleData(pbmc, features=rownames(pbmc)) pbmc <- RunPCA(pbmc, features=VariableFeatures(pbmc), verbose=F) pbmc <- FindNeighbors(pbmc, dims=1:10) pbmc <- FindClusters(pbmc, resolution=0.5) pbmc <- RunUMAP(pbmc, dims=1:10)
Plot UMAP to show clusters
DimPlot(pbmc, reduction = "umap")
Here we do a little bit of statistics to validate our clustering. The statistical analysis schema is the following:
The metric for cluster stability that we are computing is "mean scaled intersection" (MSI), which is computed by averaging the maximum overlap between sampled clusters and original clusters for each iteration. We can think about the stability of each cluster using the following rule of thumb:
library(reshape2) library(scales) # let's check the stability of the current clustering: bl <- sc_cluster_stability(pbmc, n_cells=round(ncol(pbmc)*9/10), n_bootstrap=15, verbose=F) cluster_stability_barplot(bl)
# let's test the performance of a couple different cluster resolutions: cluster_resolutions = seq(1:20)/10 n_bootstrap = 25 n_cells = round(ncol(pbmc)*9/10) resm <- sc_cluster_res_stability(pbmc, cluster_resolutions, n_cells=n_cells, n_bootstrap=n_bootstrap, verbose=F) cluster_res_stability_heatmap(resm)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.