set.seed(19561013) library(knitr) opts_chunk$set(out.extra = 'style="display:block; margin: auto"', fig.align = "center", dev = "png")
According to Kaufman and Rousseeuw [-@kaufmanRousseeuw], "Cluster analysis is the art of finding groups in data." Sometimes known as unsupervised learning, cluster analysis is a common statistical task. We apply algorithms to a dataset in an attempt to find groups of objects that are similar to each other. Objects in the same cluster should be more similar to each other than to those in other clusters. Where it differs from more common statistical classification tasks, or supervised learning, is that we do not know in advance how many clusters exist (and there might be only 1!).
Unfortunately, the book by Kaufman and Rousseeuw is out of print at the time of writing. For an alternative introduction, see Chapter 14 of Hastie, Tibshirani and Friedman [-@esl].
A difficulty in practice is that at some point we need to make a decision about
how many clusters we believe there to be (or at least suggest a small number
of candidates). Tools such as the gap statistic (see the help file for
clusGap
in the cluster package [@cluster] in R
[@R]) can help with this, but additional tools are welcome.
Another issue is that, because clustering methods are often model-free, many algorithms have been suggested over the decades, and there is no single obvious way to choose between them.
These kinds of considerations lead to the practice being as much art as science, as suggested in the opening quote, above.
A weakness of many clustering algorithms is that they will tend to produce quite different looking results from similar datasets: they are said to be unstable. Similar issues arise with classification trees, for example.
For unstable algorithms in general, a common approach is to take random samples from the available data, fit the model to each sample, and then average over the outputs of the models. Breiman named this approach 'Bootstrap aggregating' [-@bagging] or bagging for short.
The original bagging idea involved repeatedly taking samples, with replacement, of the same size as the original data. Various authors have since discovered that taking random subsamples of 50% (without replacement) leads to similar results at lower computational expense. Stability selection [@meinshausenBuhlmann] is uses the idea to inform variable selection, and the same idea is also part of the random forest [@rf] and stochastic gradient boosting [@stochasticGradientBoosting] algorithms.
Given the success of bagging and subsampling in classification, regression and variable selection, it will be no surprise that similar ideas work in clustering.
Consensus clustering was introduced by Monti et al in 2003 [-@consensusClustering]. The method involves the following:
The results can then be summarized in order to provide insight on:
Monti et al suggest various summaries and plots to address these questions.
Note that no particular clustering algorithm is specified: the user can use their own favourite, or try several algorithms to see what different properties they exhibit on the data to hand.
The conclus package for R [@R] is an implementation of consensus clustering that has the following features:
Firstly, we'll illustrate the main features of conclus, then we'll reproduce some of the results in Monti et al. Finally, we'll take a look at a more interesting example.
The pluton dataset is included in the R package cluster [@cluster]:
see the help file for pluton
for more information.
library(conclus) ?pluton dim(pluton) head(pluton) gather(pluton, pluton, value) %>% ggplot(aes(value)) + geom_histogram(fill="blue", bins=15) + facet_wrap(~pluton, scales="free")
From looking at the histograms, and from reading the help file, it appears the measurement scales are not really comparable in that, for example, the range of observable values of Pu238 does not include any of the observable range of values of Pu239. As such, we might want to scale each variable before attempting any clustering. Note that this procedure might or might not aid the analysis.
As noted previously, conclus forces the user to think about distance metrics.
(The main reason is that many applications work with mixed continuous and binomial data,
and it's especially important to be clear about how distances between binomial features
are to be defined.) Therefore, we can't just stuff the pluton
data into conclus
.
We'll use the daisy
function from the cluster package to create the distance
matrix. First, we'll proceed without rescaling the data.
pd <- daisy(pluton) ccp <- conclus(pd, K=10) ggplot(ccp)
The conclus
function defaults to taking 100 subsamples of the data for each choice
of the number of clusters k, (k = 2, ..., 10 in this case). It then computes the
consensus matrix for each value of k: the proportion of times items i and j were
included in the subsample,
and were clustered into the same group. In the figure above, deep blue indicates good
consensus; values near to 1. Peachpuff (that's right, the color is called 'peachpuff')
indicates consensus in the opposite direction; values close to 0 indicating neither
items were in the cluster. Colors between deep blue and peachpuff indicate disagreement
between the different runs of the algorithm.
The reason the panels in the above figure appear approximately block-diagonal is that the consensus matrices have been clustered themselves, to arrange similar items together. If perfect consensus were obtained, we'd see a perfectly block-diagonal structure. So to decide the number of real clusters in the data, we need to choose a value of k for which we see bunch of blue squares on the diagonal. Maybe 6, 7 or 8 looks ok, but that's quite a lot of clusters for a fairly small dataset. Let's get some more help
sp <- summary(ccp, plot.it=FALSE) sp
The output of the summary function shows us the area under the cumulative distribution function (the AUC of the CDF). We also show summary statistic Delta, computed as the proportional increase in the AUC resulting from incrementing k. For example, at k=3, we have (0.642 - 0.473) / 0.475 = 0.358 (adjusting for rounding error). As we can see, Δ(k) is still quite large, and AUC is rising quite quickly, up till k=5 or 6, and then they appear to level off.
(Following Monti et al, Section 3.3, Δ(k) is actually calculated as the increase in AUC relative to the maximum AUC for lower values of k.)
The cluster consensus matrix shows the consensus across random subsamples for each cluster for each value of k. Values of 1 indicate perfect consensus - all subsamples identified all pairs of observations within that cluster to be in the same cluster.
We can get a better look at the CDF, its AUC and Δ(k) by plotting the summary object.
ggplot(sp)
Looking at the left-hand panel, the CDFs of the consensus index at each level of k, we see a big gap between the CDFs for k=2 and k=3, and similarly between k=3 and k=4. These features are evident in the right-hand panels representing the AUC and Δ at each k. Looking at Δ in particular, the number of clusters would look to be maybe 6.
Next, let's redo the whole thing, but scaling the margins of the data first. Note that
you can do that in daisy
using stand=TRUE
in the call. However, that uses the mean
to center the data and the mean absolute deviation (not standard deviation) to scale
the data (if you didn't know that, that's the kind of reason conclus wants
you to think about distance measures). I prefer to use the median and median absolute
deviation (mad), so conclus
provides function rscale
as an alternative. For kicks, we'll also use Manhattan
distance instead of Euclidean distance.
pp <- apply(pluton, 2, rscale) ccpp <- conclus(daisy(pp, "manhattan"), K=10) ggplot(ccpp)
summary(ccpp)
From the plots above, there's little difference between those and the versions resulting from unscaled data and Euclidean distance.
So let's look at what clusters we think we might have discovered, suppose there to be 6.
pluton$clusters <- membership(ccpp, k=6) library(GGally) # For ggpairs pluton$clusters <- factor(pluton$clusters) # to keep ggpairs happy ggpairs(pluton, columns=1:4, aes(color=clusters), upper="blank")
In my humble opinion, that's quite a nice result.
If you've noticed that I've been heavy-handedly stressing the subjectivity involved in choosing the number of clusters, good! It's deliberate, of course. Having to apply subjectivity when we don't know what we're looking for is part of the process.
Finally, notice that earlier in this article, it was stated that there
are lots of clustering algorithms available, but the analysis of the plutons data
didn't even mention which was used! In fact, it used PAM - Partitioning Around Medoids.
That's the default in conclus
. Other options are
made available, but in the next example, we'll write our own in order to follow
Monti et al closely.
Let's now reproduce some of the results in Monti et al.
Monti et al describe various simulated datasets used for evaluating the consensus clustering approach. The first of these they label Uniform1 and is included in the conclus package, having been downloaded from the paper's website (whose URL has been known to change in the past, and don't blame me if it changes in the future). The data are observations from a 600 dimensional Uniform(0, 1) distribution.
Prior to anything else, we'll need to use the same clustering approach as they used: hierarchical clustering with average linkage.
If you look at the help file for conclus
, you'll notice that the default clustering
function is pamCons
. Others available in the package are hclustCons
and fannyCons
.
As the help file states, these are all just simple wrappers around their parent functions,
modified to require precisely 2 arguments, x - a dissimilarity matrix, and k - the number
of clusters to find, and to return nothing other than an integer vector of cluster memberships.
With that in mind, lets write our own function, then pass it into conclus
, mimicking
the analysis of Monti et al. Also following Monti et al, we'll use 80% subsampling
and 500 samples.
aveHclustCons <- function(x, k){ cutree(hclust(x, method="average"), k=k) } ccu <- conclus(daisy(Uniform1), K=6, cluster=aveHclustCons, subsample=0.8, R=500) ggplot(ccu, low="white", high="red")
Similar to Figure 2 of Monti et al, the consensus matrix for k=3 is largely a mass of red mess, with a couple of smaller darker red blocks in the upper right corner (upper left corner in the original - the matrices are transposed is all).
Let's look at the summary measures and compare them with the top row of Figure 3 in the original paper.
summary(ccu)
The CDFs look similar to in Monti et al, as does the plot of Δ(k). We show the AUC here, which doesn't appear in Monti et al. Let's now create the histogram of the consensus distribution.
hist(ccu$M[[2]], col="green") # 2 because k=2 is in the first, so k=3 in the second
Again, the plot looks similar to the version in Monti et al.
Let's now move on to the Gaussian3 example.
Following Monti et al, we'll use the same approach with Gaussian3 as we followed with Uniform1.
ccg <- conclus(daisy(Gaussian3), K=6, cluster=aveHclustCons, subsample=0.8, R=500) ggplot(ccg, low="white", high="red")
For k=3, we clearly see the separation of the data into the known 3 groups.
summary(ccg)
Again, the plot of Δ(k), and of the CDFs, look like those in Monti et al. Notice that the CDF for k=3 is a perfect step function. In this case, the algorithm has achieved perfect consensus with 3 clusters.
summary(ccg)
From the textual summary, we see that with k=3, there is 100% consensus (i.e. all (i, j) pairs are clustered together in every subsample they both appear in), so the CDF appears to be correct.
We can achieve a similar conclusion by use of the gap statistic.
gs <- clusGap(as.matrix(daisy(Gaussian3)), pam, K.max=10) plot(gs)
The St Jude leukemia data consist of 983 measurements taken on each of 246 patients. There's no particular reason to cluster features - we can just as well cluster patients (and doing so will be faster).
We are told that there are 6 distinct groups in the data. With such a large number
of dimensions, it will be pretty fruitless to look at simple plots, so we'll
dive in and do the clustering.
We'll also use this example to show off the parallelization
of conclus
(note though that using more than a single core (the default) does not
necessarily speed things up).
sstjude <- apply(stjude, 2, rscale) dj <- daisy(t(sstjude)) ccj <- conclus(dj, K=10, ncores=7) summary(ccj)
From the output, 6 (or maybe 5 or 7) clusters looks reasonable. So, success!
It would be nice, though, to see if we can figure out how the features are defined. That's beyond the scope of this article, though. Good luck! And have fun with conclus.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.