Flexible clustering for Bioconductor

knitr::opts_chunk$set(error=FALSE, message=FALSE, warnings=FALSE)


The r Biocpkg("bluster") package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows. At its core is the clusterRows() generic that controls dispatch to different clustering algorithms. We will demonstrate on some single-cell RNA sequencing data from the r Biocpkg("scRNAseq") package; our aim is to cluster cells into cell populations based on their PC coordinates.

sce <- ZeiselBrainData()

# Trusting the authors' quality control, and going straight to normalization.
sce <- logNormCounts(sce)

# Feature selection based on highly variable genes.
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

# Dimensionality reduction for work (PCA) and pleasure (t-SNE).
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
sce <- runUMAP(sce, dimred="PCA")

mat <- reducedDim(sce, "PCA")

Hierarchical clustering

Our first algorithm is good old hierarchical clustering, as implemented using hclust() from the stats package. This automatically sets the cut height to half the dendrogram height.

hclust.out <- clusterRows(mat, HclustParam())
plotUMAP(sce, colour_by=I(hclust.out))

Advanced users can achieve greater control of the procedure by passing more parameters to the HclustParam() constructor. Here, we use Ward's criterion for the agglomeration with a dynamic tree cut from the r CRANpkg("dynamicTreeCut") package.

hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE)
hclust.out <- clusterRows(mat, hp2)
plotUMAP(sce, colour_by=I(hclust.out))

$k$-means clustering

Our next algorithm is $k$-means clustering, as implemented using the kmeans() function. This requires us to pass in the number of clusters, either as a number:

kmeans.out <- clusterRows(mat, KmeansParam(10))
plotUMAP(sce, colour_by=I(kmeans.out))

Or as a function of the number of observations, which is useful for vector quantization purposes:

kp <- KmeansParam(sqrt)
kmeans.out <- clusterRows(mat, kp)
plotUMAP(sce, colour_by=I(kmeans.out))

Graph-based clustering

We can build shared or direct nearest neighbor graphs and perform community detection with r CRANpkg("igraph"). Here, the number of neighbors k controls the resolution of the clusters.

set.seed(101) # just in case there are ties.
graph.out <- clusterRows(mat, NNGraphParam(k=10))
plotUMAP(sce, colour_by=I(graph.out))

It is again straightforward to tune the procedure by passing more arguments such as the community detection algorithm to use.

set.seed(101) # just in case there are ties.
np <- NNGraphParam(k=20, cluster.fun="louvain")
graph.out <- clusterRows(mat, np)
plotUMAP(sce, colour_by=I(graph.out))

Two-phase clustering

We also provide a wrapper for a hybrid "two-step" approach for handling larger datasets. Here, a fast agglomeration is performed with $k$-means to compact the data, followed by a slower graph-based clustering step to obtain interpretable meta-clusters. (This dataset is not, by and large, big enough for this approach to work particularly well.)

set.seed(100) # for the k-means
two.out <- clusterRows(mat, TwoStepParam())
plotUMAP(sce, colour_by=I(two.out))

Each step is itself parametrized by BlusterParam objects, so it is possible to tune them individually:

twop <- TwoStepParam(second=NNGraphParam(k=5))
set.seed(100) # for the k-means
two.out <- clusterRows(mat, TwoStepParam())
plotUMAP(sce, colour_by=I(two.out))

Obtaining full clustering statistics

Sometimes the vector of cluster assignments is not enough. We can obtain more information about the clustering procedure by setting full=TRUE in clusterRows(). For example, we could obtain the actual graph generated by NNGraphParam():

nn.out <- clusterRows(mat, NNGraphParam(), full=TRUE)

The assignment vector is still reported in the clusters entry:


Further comments

clusterRows() enables users or developers to easily switch between clustering algorithms by changing a single argument. Indeed, by passing the BlusterParam object across functions, we can ensure that the same algorithm is used through a workflow. It is also helpful for package functions as it provides diverse functionality without compromising a clean function signature. However, the true power of this framework lies in its extensibility. Anyone can write a clusterRows() method for a new clustering algorithm with an associated BlusterParam subclass, and that procedure is immediately compatible with any workflow or function that was already using clusterRows().

Session information {-}


Try the bluster package in your browser

Any scripts or data that you put into this service are public.

bluster documentation built on Nov. 8, 2020, 8:29 p.m.