In harrysouthworth/conclus: Consensus Clustering

set.seed(19561013)
library(knitr)
opts_chunk$set(out.extra = 'style="display:block; margin: auto"',
               fig.align = "center", dev = "png")

Introduction

According to Kaufman and Rousseeuw [-@kaufmanRousseeuw], "Cluster analysis is the art of finding groups in data." Sometimes known as unsupervised learning, cluster analysis is a common statistical task. We apply algorithms to a dataset in an attempt to find groups of objects that are similar to each other. Objects in the same cluster should be more similar to each other than to those in other clusters. Where it differs from more common statistical classification tasks, or supervised learning, is that we do not know in advance how many clusters exist (and there might be only 1!).

Unfortunately, the book by Kaufman and Rousseeuw is out of print at the time of writing. For an alternative introduction, see Chapter 14 of Hastie, Tibshirani and Friedman [-@esl].

A difficulty in practice is that at some point we need to make a decision about how many clusters we believe there to be (or at least suggest a small number of candidates). Tools such as the gap statistic (see the help file for clusGap in the cluster package [@cluster] in R [@R]) can help with this, but additional tools are welcome.

Another issue is that, because clustering methods are often model-free, many algorithms have been suggested over the decades, and there is no single obvious way to choose between them.

These kinds of considerations lead to the practice being as much art as science, as suggested in the opening quote, above.

A weakness of many clustering algorithms is that they will tend to produce quite different looking results from similar datasets: they are said to be unstable. Similar issues arise with classification trees, for example.

For unstable algorithms in general, a common approach is to take random samples from the available data, fit the model to each sample, and then average over the outputs of the models. Breiman named this approach 'Bootstrap aggregating' [-@bagging] or bagging for short.

The original bagging idea involved repeatedly taking samples, with replacement, of the same size as the original data. Various authors have since discovered that taking random subsamples of 50% (without replacement) leads to similar results at lower computational expense. Stability selection [@meinshausenBuhlmann] is uses the idea to inform variable selection, and the same idea is also part of the random forest [@rf] and stochastic gradient boosting [@stochasticGradientBoosting] algorithms.

Given the success of bagging and subsampling in classification, regression and variable selection, it will be no surprise that similar ideas work in clustering.

Consensus clustering

Consensus clustering was introduced by Monti et al in 2003 [-@consensusClustering]. The method involves the following:

Take a random subsample (say 50%) of the data;
Run a clustering algorithm assuming there to be k=2 clusters;
Save the class memberships of the observations;
Repeat lots of times;
Repeat with k=3, 4, ..., K clusters.

The results can then be summarized in order to provide insight on:

Whether any genuine clusters appear to exist;
How similar are items are within each cluster;
What is the best value of k (or a small number of candidate values of k).

Monti et al suggest various summaries and plots to address these questions.

Note that no particular clustering algorithm is specified: the user can use their own favourite, or try several algorithms to see what different properties they exhibit on the data to hand.

The conclus package

The conclus package for R [@R] is an implementation of consensus clustering that has the following features:

The user can pass in their own clustering function;
Various summaries and plots can be easily produced;
The user interface will seem natural to most R users;
The code forces the user to think about distance metrics (the pain will do you good!);
Additional information is easily extracted from the returned object.

Examples

Firstly, we'll illustrate the main features of conclus, then we'll reproduce some of the results in Monti et al. Finally, we'll take a look at a more interesting example.

Plutons

The pluton dataset is included in the R package cluster [@cluster]: see the help file for pluton for more information.

library(conclus)
?pluton
dim(pluton)
head(pluton)

gather(pluton, pluton, value) %>%
  ggplot(aes(value)) +
  geom_histogram(fill="blue", bins=15) +
  facet_wrap(~pluton, scales="free")

From looking at the histograms, and from reading the help file, it appears the measurement scales are not really comparable in that, for example, the range of observable values of Pu238 does not include any of the observable range of values of Pu239. As such, we might want to scale each variable before attempting any clustering. Note that this procedure might or might not aid the analysis.

As noted previously, conclus forces the user to think about distance metrics. (The main reason is that many applications work with mixed continuous and binomial data, and it's especially important to be clear about how distances between binomial features are to be defined.) Therefore, we can't just stuff the pluton data into conclus. We'll use the daisy function from the cluster package to create the distance matrix. First, we'll proceed without rescaling the data.

pd <- daisy(pluton)
ccp <- conclus(pd, K=10)
ggplot(ccp)

The conclus function defaults to taking 100 subsamples of the data for each choice of the number of clusters k, (k = 2, ..., 10 in this case). It then computes the consensus matrix for each value of k: the proportion of times items i and j were included in the subsample, and were clustered into the same group. In the figure above, deep blue indicates good consensus; values near to 1. Peachpuff (that's right, the color is called 'peachpuff') indicates consensus in the opposite direction; values close to 0 indicating neither items were in the cluster. Colors between deep blue and peachpuff indicate disagreement between the different runs of the algorithm.

The reason the panels in the above figure appear approximately block-diagonal is that the consensus matrices have been clustered themselves, to arrange similar items together. If perfect consensus were obtained, we'd see a perfectly block-diagonal structure. So to decide the number of real clusters in the data, we need to choose a value of k for which we see bunch of blue squares on the diagonal. Maybe 6, 7 or 8 looks ok, but that's quite a lot of clusters for a fairly small dataset. Let's get some more help

sp <- summary(ccp, plot.it=FALSE)
sp

The output of the summary function shows us the area under the cumulative distribution function (the AUC of the CDF). We also show summary statistic Delta, computed as the proportional increase in the AUC resulting from incrementing k. For example, at k=3, we have (0.642 - 0.473) / 0.475 = 0.358 (adjusting for rounding error). As we can see, Δ(k) is still quite large, and AUC is rising quite quickly, up till k=5 or 6, and then they appear to level off.

(Following Monti et al, Section 3.3, Δ(k) is actually calculated as the increase in AUC relative to the maximum AUC for lower values of k.)

The cluster consensus matrix shows the consensus across random subsamples for each cluster for each value of k. Values of 1 indicate perfect consensus - all subsamples identified all pairs of observations within that cluster to be in the same cluster.

We can get a better look at the CDF, its AUC and Δ(k) by plotting the summary object.

ggplot(sp)

Looking at the left-hand panel, the CDFs of the consensus index at each level of k, we see a big gap between the CDFs for k=2 and k=3, and similarly between k=3 and k=4. These features are evident in the right-hand panels representing the AUC and Δ at each k. Looking at Δ in particular, the number of clusters would look to be maybe 6.

Next, let's redo the whole thing, but scaling the margins of the data first. Note that you can do that in daisy using stand=TRUE in the call. However, that uses the mean to center the data and the mean absolute deviation (not standard deviation) to scale the data (if you didn't know that, that's the kind of reason conclus wants you to think about distance measures). I prefer to use the median and median absolute deviation (mad), so conclus provides function rscale as an alternative. For kicks, we'll also use Manhattan distance instead of Euclidean distance.

pp <- apply(pluton, 2, rscale)
ccpp <- conclus(daisy(pp, "manhattan"), K=10)
ggplot(ccpp)

summary(ccpp)

From the plots above, there's little difference between those and the versions resulting from unscaled data and Euclidean distance.

So let's look at what clusters we think we might have discovered, suppose there to be 6.

pluton$clusters <- membership(ccpp, k=6)
library(GGally) # For ggpairs
pluton$clusters <- factor(pluton$clusters) # to keep ggpairs happy
ggpairs(pluton, columns=1:4, aes(color=clusters), upper="blank")

In my humble opinion, that's quite a nice result.

If you've noticed that I've been heavy-handedly stressing the subjectivity involved in choosing the number of clusters, good! It's deliberate, of course. Having to apply subjectivity when we don't know what we're looking for is part of the process.

Finally, notice that earlier in this article, it was stated that there are lots of clustering algorithms available, but the analysis of the plutons data didn't even mention which was used! In fact, it used PAM - Partitioning Around Medoids. That's the default in conclus. Other options are made available, but in the next example, we'll write our own in order to follow Monti et al closely.

Let's now reproduce some of the results in Monti et al.

Uniform1

Monti et al describe various simulated datasets used for evaluating the consensus clustering approach. The first of these they label Uniform1 and is included in the conclus package, having been downloaded from the paper's website (whose URL has been known to change in the past, and don't blame me if it changes in the future). The data are observations from a 600 dimensional Uniform(0, 1) distribution.

Prior to anything else, we'll need to use the same clustering approach as they used: hierarchical clustering with average linkage.

If you look at the help file for conclus, you'll notice that the default clustering function is pamCons. Others available in the package are hclustCons and fannyCons. As the help file states, these are all just simple wrappers around their parent functions, modified to require precisely 2 arguments, x - a dissimilarity matrix, and k - the number of clusters to find, and to return nothing other than an integer vector of cluster memberships.

With that in mind, lets write our own function, then pass it into conclus, mimicking the analysis of Monti et al. Also following Monti et al, we'll use 80% subsampling and 500 samples.

aveHclustCons <- function(x, k){
  cutree(hclust(x, method="average"), k=k)
}

ccu <- conclus(daisy(Uniform1), K=6, cluster=aveHclustCons, subsample=0.8, R=500)
ggplot(ccu, low="white", high="red")

Similar to Figure 2 of Monti et al, the consensus matrix for k=3 is largely a mass of red mess, with a couple of smaller darker red blocks in the upper right corner (upper left corner in the original - the matrices are transposed is all).

Let's look at the summary measures and compare them with the top row of Figure 3 in the original paper.

summary(ccu)

The CDFs look similar to in Monti et al, as does the plot of Δ(k). We show the AUC here, which doesn't appear in Monti et al. Let's now create the histogram of the consensus distribution.

hist(ccu$M[[2]], col="green") # 2 because k=2 is in the first, so k=3 in the second

Again, the plot looks similar to the version in Monti et al.

Let's now move on to the Gaussian3 example.

Gaussian3

Following Monti et al, we'll use the same approach with Gaussian3 as we followed with Uniform1.

ccg <- conclus(daisy(Gaussian3), K=6, cluster=aveHclustCons, subsample=0.8, R=500)
ggplot(ccg, low="white", high="red")

For k=3, we clearly see the separation of the data into the known 3 groups.

summary(ccg)

Again, the plot of Δ(k), and of the CDFs, look like those in Monti et al. Notice that the CDF for k=3 is a perfect step function. In this case, the algorithm has achieved perfect consensus with 3 clusters.

summary(ccg)

From the textual summary, we see that with k=3, there is 100% consensus (i.e. all (i, j) pairs are clustered together in every subsample they both appear in), so the CDF appears to be correct.

We can achieve a similar conclusion by use of the gap statistic.

gs <- clusGap(as.matrix(daisy(Gaussian3)), pam, K.max=10)
plot(gs)

St Jude's leukemia data

The St Jude leukemia data consist of 983 measurements taken on each of 246 patients. There's no particular reason to cluster features - we can just as well cluster patients (and doing so will be faster).

We are told that there are 6 distinct groups in the data. With such a large number of dimensions, it will be pretty fruitless to look at simple plots, so we'll dive in and do the clustering. We'll also use this example to show off the parallelization of conclus (note though that using more than a single core (the default) does not necessarily speed things up).

sstjude <- apply(stjude, 2, rscale)
dj <- daisy(t(sstjude))

ccj <- conclus(dj, K=10, ncores=7)
summary(ccj)

From the output, 6 (or maybe 5 or 7) clusters looks reasonable. So, success!

It would be nice, though, to see if we can figure out how the features are defined. That's beyond the scope of this article, though. Good luck! And have fun with conclus.