Assorted clustering diagnostics

knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)


The r Biocpkg("bluster") package provides a few diagnostics for quantitatively examining the cluster output. These diagnostics We will demonstrate on another dataset from the r Biocpkg("scRNAseq") package, clustered with graph-based methods via the clusterRows() generic as described in the previous vignette.

sce <- GrunPancreasData()

# Quality control to remove bad cells.
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, sub.fields="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]

# Normalization by library size.
sce <- logNormCounts(sce)

# Feature selection.
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

# Dimensionality reduction.
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)

# Clustering.
mat <- reducedDim(sce) <- clusterRows(mat, NNGraphParam(), full=TRUE)
clusters <-$clusters

Computing the silhouette width

The silhouette width is a standard metric to quantify the separation between clusters generated by any procedure. A cell with a large positive width is closer to other cells from the same cluster compared to cells from different clusters. On the other hand, low or negative widths indicate that cells from different clusters are not well separated.

The exact silhouette calculation is rather computationally intensive so r Biocpkg("bluster") implements an approximation instead. This is provided in the approxSilhouette() function, which returns the width for each cell and its closest (non-self) cluster. Clusters consisting of cells with lower widths may warrant some more care during interpretation.

sil <- approxSilhouette(mat, clusters)
boxplot(split(sil$width, clusters))

Computing the neighborhood purity

Another diagnostic uses the percentage of neighbors for each cell that belong to the same cluster. Well-separated clusters should exhibit high percentages (i.e., "purities") as cells from different clusters do not mix. Low purities are symptomatic of overclustering where cluster boundaries become more ambiguous.

r Biocpkg("bluster") computes purities through the neighborPurity() function. This returns the purity of the neighborhood for each cell and the identity of the cluster with the strongest presence. The choice of k determines the scope of the neighborhood; lower values will yield higher purities, so this may require some fiddling to find a value that distinguishes between clusters.

pure <- neighborPurity(mat, clusters)
table(Max=pure$maximum, Assigned=clusters)
boxplot(split(pure$purity, clusters))

Computing graph modularity

For graph-based methods, we can compute the cluster modularity within clusters and between pairs of clusters. Specifically, we examine the ratio of observed to expected edge weights for each pair of clusters (closely related to the modularity score used in many cluster_* functions from r CRANpkg("igraph")). We would usually expect to see high observed weights between cells in the same cluster with minimal weights between clusters, indicating that the clusters are well-separated. Large off-diagonal entries indicate that the corresponding pair of clusters are closely related.

g <-$objects$graph
ratio <- pairwiseModularity(g, clusters, as.ratio=TRUE)

pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,

This may be better visualized with a force-directed layout: <- igraph::graph_from_adjacency_matrix(log2(ratio+1), 
    mode="upper", weighted=TRUE, diag=FALSE)

# Increasing the weight to increase the visibility of the lines.
plot(, edge.width=igraph::E($weight*5,

We can also tune the resolution of the clustering post hoc with the mergeCommunities() function. This will iteratively merge the most closely related pair of clusters together until the desired number of clusters is reached. For example, if we wanted to whittle down the number of clusters to 10, we could do:

merged <- mergeCommunities(g, clusters, number=10)

Comparing two clusterings

To compare two clusterings, the pairwiseRand() function computes the adjusted Rand index (ARI). High ARIs indicate that the two clusterings are similar with respect to how they partition the observations, and an ARI of 1 means that the clusterings are identical.

hclusters <- clusterRows(mat, HclustParam(cut.dynamic=TRUE))
pairwiseRand(clusters, hclusters, mode="index")

Of course, a single number is not particularly useful, so clusterRand() also provides the capability to break down the ARI into its contributions from each cluster or cluster pair. Specifically, for each cluster or cluster pair in a "reference" clustering (here, clusters), we see whether it is preserved in the "alternative" clustering (here, hclusters). Large values on the diagonal indicate that the reference cluster is recapitulated; large values off the diagonal indicate that the separation between the corresponding pair of clusters is also maintained.

ratio <- pairwiseRand(clusters, hclusters, mode="ratio")

pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
    col=viridis::viridis(100), breaks=seq(-1, 1, length=101))

Bootstrapping cluster stability

We can use bootstrapping to evaluate the effect of sampling noise on the stability of a clustering procedure. The bootstrapStability() function will return the ARI of the original clusters against those generated from bootstrap replicates, averaged across multiple bootstrap iterations. High values indicate that the clustering is robust to sample noise.

ari <-bootstrapStability(mat, clusters=clusters, 
    mode="index", BLUSPARAM=NNGraphParam())

Advanced users may also set mode="ratio" to obtain a more detailed breakdown of the effect of noise on each cluster (pair).

ratio <-bootstrapStability(mat, clusters=clusters,
    mode="ratio", BLUSPARAM=NNGraphParam())

pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
    col=viridis::viridis(100), breaks=seq(-1, 1, length=101))

Session information {-}


Try the bluster package in your browser

Any scripts or data that you put into this service are public.

bluster documentation built on Nov. 8, 2020, 8:29 p.m.