In ncchung/SeuratAddon: Unsupervised evaluation of cell identities in Seurat

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Single cell RNA sequencing (scRNA-seq) allows us to dissect cellular heterogeneity arising from tissue types, spatio-temporal contexts, and environmental stimuli. Cell identities of single cell samples derived from such heterogeneous populations are routinely determined by clustering of scRNA-seq data. Computationally defined cell identities are then used in downstream analysis, feature selection, and visualization. However, how can we examine if individual cell identities are accurately inferred? To this end, we introduce unsupervised non-parametric methods to probabilistically evaluate cell identities by testing cluster memberships of single cell samples. By learning uncertainty in clustering scRAN-seq data, the proposed methods help determine cell identities and dissect cellular heterogeneity.

Introduction

This SeuratAddon package helps the user to apply and visualize unsupervised evaluation of cell identities using a Seurat object, in its idiomatic manner.

The unsupervised evaluation of cell identities is enabled by the jackstraw methods for clustering. Given single cell samples are classified into K subpopulations by clustering, the jackstraw can estimate p-values and posterior inclusion probabilities for single cell samples. The main R package implementing a family of jackstraw methods is available on the jackstraw R package. This SeuratAddon package acts as a bridge between jackstraw and Seurat R packages, such that it extends clustering and visualization functions within its analysis pipeline. A user continues to work with a Seurat object.

The Seurat is a highly popular R package, for scRNA-seq quality control, processing, analysis, and more. From the authors:

A toolkit for quality control, analysis, and exploration of single cell RNA sequencing data. 'Seurat' aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. See Satija R, Farrell J, Gennert D, et al (2015) , Macosko E, Basu A, Satija R, et al (2015) , and Butler A and Satija R (2017) for more details.

The Seurat toolkit has implemented a number of clustering and dimension reduction algorithms. We particularly focus on -- and provide modified functions for the new jackstraw methods -- dimension reduction using principal component analysis (PCA), unsupervised classification of samples using K-means clustering, and visualization using t-Distributed Stochastic Neighbor Embedding (t-SNE). Furthermore, we provide additional clustering algorithms such as K-means++ initialization (Arthur and Vassilvitskii http://dl.acm.org/citation.cfm?id=1283383.1283494) and mini-batch K-means clustering (Sculley 2010 ) for a Seurat object.

Prepare a Seurat object with PBMC data

This vignette closely follows the Guided Tutorial -- 2700 PBMCs, until the clustering steps. If you would like to learn about new functionalitites of SeuratAddon, jump to the sections on Cluster the cells, Evaluate the cell identities, and Visualize the cell identities.

Essentially, we load, process, and normalize a 10X Genomics dataset of 2700 Peripheral Blood Mononuclear Cells (PBMCs), which is publicly available to download. We assume that the user have used the Seurat package previously, and provide minimal commentaries regarding processing and normalization. For more information, see further tutorials and FAQs on Getting Started with Seurat.

Please note that you must place the PMBC dataset (called pbmc3k) into an appropriate directory. Download and uncompress pbmc3k filtered gene matrices. Accordingly, set data.dir in the following:

library(Seurat)
library(SeuratAddon)
library(jackstraw)
# Load the PBMC dataset, change the directory if necessary
pbmc.data <- Read10X(data.dir = "data/pbmc3k/filtered_gene_bc_matrices/hg19/")

# Initialize the Seurat object with the raw (non-normalized data)
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, project = "10X_PBMC")

QC and selecting cells for further analysis

While the CreateSeuratObject imposes a basic minimum gene-cutoff, you may want to filter out cells at this stage based on technical or biological parameters. Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria.

mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)

pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")

pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), 
    low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.05))

Normalizing the data

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

Detection of variable genes across the single cells

Seurat calculates highly variable genes and focuses on these for downstream analysis.

pbmc <- FindVariableGenes(object = pbmc, mean.function = ExpMean, dispersion.function = LogVMR, do.plot=FALSE, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)

Scaling the data and removing unwanted sources of variation

Your single cell dataset likely contains ‘uninteresting’ sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage).

pbmc <- ScaleData(object = pbmc, vars.to.regress = c("nUMI", "percent.mito"))

Perform linear dimensional reduction

Next we perform PCA on the scaled data. By default, the genes in object@var.genes are used as input, but can be defined using pc.genes.

pbmc <- RunPCA(object = pbmc, pc.genes = pbmc@var.genes, do.print = FALSE)

Cluster the cells {#cluster}

The SeuratAddon package provides new methods related to clustering, with an aim to streamline application of the jackstraw methods to evaluate cell identities in an unsupervised manner. With gene expression profiles of many single cells from modern scRNA-seq platforms, unsupervised clustering is routinely used to identify cell identities.

Currently, this package and vignette focus on K-means clustering and mini-batch K-means clustering. K-means clustering is well established, highly efficient, and extremely popular in scRNAseq analyses (e.g., Satija et al. 2015 , Zheng et al. 2017 ). A mini-batch approach renders classic K-means clustering scalable to millions of single cells. Lastly, we adapt K-means++ initilization by default which is shown to improve convergence. With a rapidly increasing number of single cells in a given study or analysis, hierarchical clustering, HDBSCAN, and graph-based community detection are shown to be orders of magnitude slower than K-means clustering. See Single Cell Clustering Comparison.

Generally, we recommend clustering the single cell samples with K-means clustering, even if you would like to use other algorithms later on. ClusterCellsKmeans is based on DoKMeans in the original Seurat package. Run K-means clustering on cell samples with default settings as follow:

# ClusterCellsKmeans is a modified version of Seurat::DoKMeans
pbmc <- ClusterCellsKmeans(pbmc, k.cells = 3)

If K-means clustering is too slow for your system, use a mini-batch algorithm. You can also provide optional arguments for fine controls, such as batch_size, num_init, and max_iters. For details on optional arguments (... to ClusterCellsKmeans), see ? ClusterR::MiniBatchKmeans.

# Use optional arguments, e.g.,batch_size=100, num_init=5, max_iters=100
pbmc <- ClusterCellsKmeans(pbmc, k.cells =  3, minibatch = TRUE)

Evaluate the cell identities {#evaluate}

We are ready to evaluate cell identities, that are estimated by clustering single cell samples. The main function EvaluateIdent, which is a thin wrapper for EvaluateIdentKmeans, requires that an appropriate clustering algorithm is already applied on a Seurat object. Then, use the same algorithm and initial settings for unsupervised evaluation. While some checks are conducted to ensure such compatibility, please note that a object@ident can be set or modified by a number of ways. When in doubt, we recommend running ClusterCellsKmeans on a Seurat object with desired parameters.

pbmc <- EvaluateIdent(pbmc, clustering = "kmeans")

In the default settings, this call would create a new meta data, object@meta.data["ident_prob"] that contains posterior inclusion probabilities (PIPs). This is a posterior probability that a cell is indeed included in its assigned subpopulation (cluster). Alternatively, you can specify an optional argument prob.use = c("pip","p.adjust","pvalue"), which would set object@meta.data["ident_prob"] to corresponding statistics. To adjust p-values for multiple hypotheses, an optional argument for adjustment methods is accessible via p.adjust.methods. For details see ? p.adjust.

For example, instead of PIPs, you may want to use p-values adjusted by a Benjamini-Hochberg procesdure:

pbmc <- EvaluateIdent(pbmc, clustering = "KMeans", prob.use = "p.adj", p.adjust.methods = "BH")

If you have used a minibatch K-means algorithm, please provide previously used parameters:

pbmc <- EvaluateIdent(pbmc, clustering = "MiniBatchKMeans", ...)

By default, the EvaluateIdent function returns a Seurat object which has additional meta data (object@meta.data["ident_prob"]). However, the jackstraw method has more details and controls that could be useful. Instead of returning a Seurat object, you can ask for a list containing observed and null F-statistics, p-values, adjusted p-values, and PIPs via return.jackstraw.

pbmc_jackstraw <- EvaluateIdent(pbmc, clustering = "KMeans", return.jackstraw = TRUE)

Lastly, as EvaluateIdent apply the jackstraw methods on cell identities, it requires a number of synthetic nulls s and a total iterations B. By default, they are set to s=round(0.10*m) and B=round(m*10/s)=100 where m is the number of cells. You can provide them as optional arguments to increase the resolution or to speed up the procedure.

Visualize the cell identities {#visualize}

After processing and clustering cells, we are often interested in visualizing scRNA-seq data using reduced dimensions (e.g. PCA or t-SNE). PIPs from unsupervised evaluation of cell identities can be used in conjunction to improve this visualization task.

The Seurat provide functions based on ggplot2 to create scatterplots, where the cells are colored by cell identities (stored in object@meta.data["ident"]), by default. Run t-SNE and visualize 2-dimensions using a default function from Seurat.

#### Run Non-linear dimensional reduction (tSNE)
pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE)
# note that you can set do.label=T to help label individual clusters
tsne_all <- TSNEPlot(object = pbmc, do.label=T)

The present SeuratAddon package extends the original functions DimPlots, PCAPlot, and TSNEPlot to DimPlots2, PCAPlot2, and TSNEPlot2. These new functions expect PIPs stored in object@meta.data["ident_prob"]. Examine a histogram of object@meta.data["ident_prob"]:

ggplot(pbmc@meta.data["ident_prob"]) + geom_histogram(aes(x=ident_prob))

Now, let's make a t-SNE projection as before, but only plotting cells whose PIPs are greater than 0.9. Out of 2456 cells, 85% (2082 cells) have PIPs > 0.9. You may notice that some cells that seem to be located between multiple clusters are removed:

tsne_piphard <- TSNEPlot2(object = pbmc, do.label=T, ident.threshold=0.9)

Instead of hard-thresholding single cells so that cells with low PIPs are completely removed, we can also soft-threshold single cells in t-SNE visualization. By setting ident.threshold = "soft", we use PIPs as alpha (transparency) levels in a scatterplot. For example, cells with PIP = 0.3 would have 30% transparency.

tsne_pipsoft <- TSNEPlot2(object = pbmc, do.label=T, ident.threshold="soft")

Analogously, we can use PCAPlot2 to make scatterplots using the top 2 PCs.

Resources

jackstraw R package: https://CRAN.R-project.org/package=jackstraw

SeuratAddon R package: http://github.com/ncchung/SeuratAddon

Seurat R package: https://satijalab.org/seurat/get_started.html

ncchung/SeuratAddon documentation built on May 3, 2019, 3:17 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ncchung/SeuratAddon
Unsupervised evaluation of cell identities in Seurat

In ncchung/SeuratAddon: Unsupervised evaluation of cell identities in Seurat

Introduction

Prepare a Seurat object with PBMC data

QC and selecting cells for further analysis

Normalizing the data

Detection of variable genes across the single cells

Scaling the data and removing unwanted sources of variation

Perform linear dimensional reduction

Cluster the cells {#cluster}

Evaluate the cell identities {#evaluate}

Visualize the cell identities {#visualize}

Resources

R Package Documentation

Browse R Packages

We want your feedback!

ncchung/SeuratAddon Unsupervised evaluation of cell identities in Seurat

In ncchung/SeuratAddon: Unsupervised evaluation of cell identities in Seurat

Introduction

Prepare a Seurat object with PBMC data

QC and selecting cells for further analysis

Normalizing the data

Detection of variable genes across the single cells

Scaling the data and removing unwanted sources of variation

Perform linear dimensional reduction

Cluster the cells {#cluster}

Evaluate the cell identities {#evaluate}

Visualize the cell identities {#visualize}

Resources

R Package Documentation

Browse R Packages

We want your feedback!

ncchung/SeuratAddon
Unsupervised evaluation of cell identities in Seurat