In this tutorial, we will analyze and validate cluster assignments in the 10X Genomics 3k PBMC dataset. We will be using Seurat for primary data processing, and scicat for cross validation of cluster assignments.
knitr::include_graphics("~/Desktop/MOBP.png")
Install the scicat package from github:
library(Seurat) library(dplyr) # install scicat from github: # library(devtools) # devtools::install_github(repo='smorabit/scicat', force=T) library(scicat)
First, we are just going to follow the Seurat clustering tutorial so we can have some clustered cells to look at. Download the PBMC dataset from 10X here.
Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../filtered_gene_bc_matrices/hg19/") pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
filter low quality cells, based on mitochondrial reads and number of counts / number of features
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt") FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") # filter cells based on QC plots pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
Now that we have filtered out low quality data, we can do the rest of the primary data processing. This includes normalizing and scaling the data, finding the most variable features, and running dimensionality reduction.
# processing: pbmc <- NormalizeData(pbmc) pbmc <- FindVariableFeatures(pbmc, nfeatures=2000) pbmc <- ScaleData(pbmc, features=rownames(pbmc)) pbmc <- RunPCA(pbmc, features=VariableFeatures(pbmc), verbose=F) pbmc <- FindNeighbors(pbmc, dims=1:10) pbmc <- FindClusters(pbmc, resolution=0.5) pbmc <- RunUMAP(pbmc, dims=1:10)
Plot UMAP to show clusters
DimPlot(pbmc, reduction = "umap")
Here we do a little bit of statistics to validate our clustering. The statistical analysis schema is the following:
The metric for cluster stability that we are computing is "mean scaled intersection" (MSI), which is computed by averaging the maximum overlap between sampled clusters and original clusters for each iteration. We can think about the stability of each cluster using the following rule of thumb:
First we will compute the stability of each cluster and plot the results.
library(reshape2) library(scales) bl <- sc_cluster_stability(pbmc, n_cells=round(ncol(pbmc)*9/10), n_bootstrap=15, verbose=F) cluster_stability_barplot(bl)
A smarter way to select the resolution of your clusters is to check the stability of each cluster
across several values for cluster resolution. The sc_cluster_res_stability() function will
test a vector of cluster stabilities, which we will show as a heatmap using cluster_res_stability_heatmap.
cluster_resolutions = seq(1:10)/5 n_bootstrap = 5 n_cells = round(ncol(pbmc)*9/10) resm <- sc_cluster_res_stability(pbmc, cluster_resolutions, n_cells=n_cells, n_bootstrap=n_bootstrap, verbose=F) cluster_res_stability_heatmap(resm)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.