knitr::opts_chunk$set(error=FALSE, message=FALSE, warnings=FALSE) library(BiocStyle)

The `r Biocpkg("bluster")`

package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows.
At its core is the `clusterRows()`

generic that controls dispatch to different clustering algorithms.
We will demonstrate on some single-cell RNA sequencing data from the `r Biocpkg("scRNAseq")`

package;
our aim is to cluster cells into cell populations based on their PC coordinates.

library(scRNAseq) sce <- ZeiselBrainData() # Trusting the authors' quality control, and going straight to normalization. library(scuttle) sce <- logNormCounts(sce) # Feature selection based on highly variable genes. library(scran) dec <- modelGeneVar(sce) hvgs <- getTopHVGs(dec, n=1000) # Dimensionality reduction for work (PCA) and pleasure (t-SNE). set.seed(1000) library(scater) sce <- runPCA(sce, ncomponents=20, subset_row=hvgs) sce <- runUMAP(sce, dimred="PCA") mat <- reducedDim(sce, "PCA") dim(mat)

Our first algorithm is good old hierarchical clustering, as implemented using `hclust()`

from the *stats* package.
This automatically sets the cut height to half the dendrogram height.

library(bluster) hclust.out <- clusterRows(mat, HclustParam()) plotUMAP(sce, colour_by=I(hclust.out))

Advanced users can achieve greater control of the procedure by passing more parameters to the `HclustParam()`

constructor.
Here, we use Ward's criterion for the agglomeration with a dynamic tree cut from the `r CRANpkg("dynamicTreeCut")`

package.

hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE) hp2 hclust.out <- clusterRows(mat, hp2) plotUMAP(sce, colour_by=I(hclust.out))

Our next algorithm is $k$-means clustering, as implemented using the `kmeans()`

function.
This requires us to pass in the number of clusters, either as a number:

set.seed(100) kmeans.out <- clusterRows(mat, KmeansParam(10)) plotUMAP(sce, colour_by=I(kmeans.out))

Or as a function of the number of observations, which is useful for vector quantization purposes:

kp <- KmeansParam(sqrt) kp set.seed(100) kmeans.out <- clusterRows(mat, kp) plotUMAP(sce, colour_by=I(kmeans.out))

We can build shared or direct nearest neighbor graphs and perform community detection with `r CRANpkg("igraph")`

.
Here, the number of neighbors `k`

controls the resolution of the clusters.

set.seed(101) # just in case there are ties. graph.out <- clusterRows(mat, NNGraphParam(k=10)) plotUMAP(sce, colour_by=I(graph.out))

It is again straightforward to tune the procedure by passing more arguments such as the community detection algorithm to use.

set.seed(101) # just in case there are ties. np <- NNGraphParam(k=20, cluster.fun="louvain") np graph.out <- clusterRows(mat, np) plotUMAP(sce, colour_by=I(graph.out))

We also provide a wrapper for a hybrid "two-step" approach for handling larger datasets. Here, a fast agglomeration is performed with $k$-means to compact the data, followed by a slower graph-based clustering step to obtain interpretable meta-clusters. (This dataset is not, by and large, big enough for this approach to work particularly well.)

set.seed(100) # for the k-means two.out <- clusterRows(mat, TwoStepParam()) plotUMAP(sce, colour_by=I(two.out))

Each step is itself parametrized by `BlusterParam`

objects, so it is possible to tune them individually:

twop <- TwoStepParam(second=NNGraphParam(k=5)) twop set.seed(100) # for the k-means two.out <- clusterRows(mat, TwoStepParam()) plotUMAP(sce, colour_by=I(two.out))

Sometimes the vector of cluster assignments is not enough.
We can obtain more information about the clustering procedure by setting `full=TRUE`

in `clusterRows()`

.
For example, we could obtain the actual graph generated by `NNGraphParam()`

:

nn.out <- clusterRows(mat, NNGraphParam(), full=TRUE) nn.out$objects$graph

The assignment vector is still reported in the `clusters`

entry:

table(nn.out$clusters)

`clusterRows()`

enables users or developers to easily switch between clustering algorithms by changing a single argument.
Indeed, by passing the `BlusterParam`

object across functions, we can ensure that the same algorithm is used through a workflow.
It is also helpful for package functions as it provides diverse functionality without compromising a clean function signature.
However, the true power of this framework lies in its extensibility.
Anyone can write a `clusterRows()`

method for a new clustering algorithm with an associated `BlusterParam`

subclass,
and that procedure is immediately compatible with any workflow or function that was already using `clusterRows()`

.

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.