This vignette demonstrates the usage of the clusternomics
package for
context-dependent clustering on a small
simulated dataset.
The goal of context-dependendent clustering is to identify clusters in a set of related datasets. Clusternomics identifies both local clusters that exist at the level of individual datasets, and global clusters that appear across the datasets.
A typical application of the method is the task of cancer subtyping, where we analyse tumour samples. The individual datasets (contexts) are then various features of the tumour samples, such as gene expression data, DNA methylation measurements, miRNA expression etc. The assumption is that we have several measurements of different types describing the same set tumours. Each of the measurements then describes the tumour in a different context.
The clusternomics
algorithm identifies
We start by loading the packages.
# Load dependencies library(plyr) library(magrittr) library(ggplot2) library(clusternomics)
In the simulated scenario, we assume two contexts (datasets). In each context, the data are sampled from two clusters with the following distributions:
Cluster 1: $x \sim \mathcal{N}\left((-1.5, -1.5)^T, (1,1)^T\right)$
Cluster 2: $x \sim \mathcal{N}\left((1.5, 1.5)^T, (1,1)^T\right)$
Number of data points sampled from each cluster in each context is different:
| | Cluster 1 (context 2) | Cluster 2 (context 2) | |-----------|-----------|-----------| | Cluster 1 (context 1) | 50 | 10 | | Cluster 2 (context 1) | 40 | 60 |
Using this setup, there are 160 data points in total. There are two local clusters within each context. At the same time, there are four clusters on the global level, that appear when we combine observations from the local contexts.
For the algorithm setup, we use set the number of clusters to a larger number than the original sampling distribution to examine how the algorithm behaves in this situation. We use set the numbers of clusters to 3 local clusters within each context, and up to 10 global clusters overall.
The inference in the model is done via MCMC sampling. We run the Gibbs sampler for 500 iterations, out of which 200 iterations are the burn-in iterations. Note that in larger non-simulated scenarios, the number of iterations should be larger.
set.seed(1) # Number of elements in each cluster, follows the table given above groupCounts <- c(50, 10, 40, 60) # Centers of clusters means <- c(-1.5,1.5) # Helper function to generate test data testData <- generateTestData_2D(groupCounts, means) datasets <- testData$data
We look at the distribution of samples in the two contexts. The colours represent the four distinct clusters.
qplot(datasets[[1]][,1], datasets[[1]][,2], col=factor(testData$groups)) + geom_point(size=3) + ggtitle("Context 1") + xlab("x") + ylab("y") + scale_color_discrete(name="Cluster")
qplot(datasets[[2]][,1], datasets[[2]][,2], col=factor(testData$groups)) + geom_point(size=3) + ggtitle("Context 2") + xlab("x") + ylab("y") + scale_color_discrete(name="Cluster")
Now we set-up the algorithm. We assume Gaussian distribution with diagonal covariance function in each context. As mentioned above, we pre-specify a larger number of clusters for the algorithm. The number of clusters that we use is not necessarily the number of clusters that the algorithm is going to use to model the data. It only serves as an upper limit on the number of clusters.
# Setup of the algorithm dataDistributions <- 'diagNormal' # Pre-specify number of clusters clusterCounts <- list(global=10, context=c(3,3)) # Set number of iterations # The following is ONLY FOR SIMULATION PURPOSES # Use larger number of iterations for real-life data maxIter <- 300 burnin <- 200 lag <- 2 # Thinning of samples
Finally, we can run the context-dependent clustering algorihm.
# Run context-dependent clustering results <- contextCluster(datasets, clusterCounts, maxIter = maxIter, burnin = burnin, lag = lag, dataDistributions = 'diagNormal', verbose = F) # Extract resulting cluster assignments samples <- results$samples # Extract global cluster assignments for each MCMC sample clusters <- laply(1:length(samples), function(i) samples[[i]]$Global)
We can check convergence of the model by looking at log likelihood values over the MCMC iterations:
logliks <- results$logliks qplot(1:maxIter, logliks) + geom_line() + xlab("MCMC iterations") + ylab("Log likelihood")
As part of the training, we also compute the Deviance Information Criterion (DIC). The DIC is
a Bayesian model selection method that can be used to select the number of clusters.
The DIC value is returned in results$DIC
.
Models that better fit the data will result in lower values of DIC than worse models. For
example, if we fit a model with number of clusters that is too small, we get higher DIC
value than for the original result.
wrongClusterCounts <- list(global=2, context=c(2,1)) worseResults <- contextCluster(datasets, wrongClusterCounts, maxIter = maxIter, burnin = burnin, lag = lag, dataDistributions = 'diagNormal', verbose = F) print(paste('Original model has lower (better) DIC:', results$DIC)) print(paste('Worse model has higher (worse) DIC:', worseResults$DIC))
We can also look at the number of global clusters that were
identified in the datasets. The plot below shows the number of global
clusters across MCMC samples. This is the number of actually occupied global
clusters, which can be smaller than the number of global clusters specified
when running the contextCluster
function.
cc <- numberOfClusters(clusters) qplot(seq(from=burnin, to = maxIter, by=lag), cc) + geom_line() + xlab("MCMC iterations") + ylab("Number of clusters")
Here we look at the posterior sizes of the individual global clusters across the MCMC iterations and then we show a box plot with the estimated sizes. The labels of global clusters represent the corresponding combinations of local clusters.
clusterLabels <- unique(clusters %>% as.vector) sizes <- matrix(nrow=nrow(clusters), ncol=length(clusterLabels)) for (ci in 1:length(clusterLabels)) { sizes[,ci] <- rowSums(clusters == clusterLabels[ci]) } sizes <- sizes %>% as.data.frame colnames(sizes) <- clusterLabels boxplot(sizes,xlab="Global combined clusters", ylab="Cluster size")
There are several approaches to estimate hard cluster assignments from MCMC samples. If the MCMC chain converged to a stable results, it is possible to use one of the samples as the resulting cluster assignment.
clusteringResult <- samples[[length(samples)]]
A more principled approach is to look at which data points were assigned into the same cluster across the MCMC samples. We can explore this using the posterior co-clustering matrix, which estimates the posterior probability that two samples belong to the same cluster.
# Compute the co-clustering matrix from global cluster assignments coclust <- coclusteringMatrix(clusters) # Plot the co-clustering matrix as a heatmap require(gplots) mypalette <- colorRampPalette(rev(c('#d7191c','#fdae61','#ffffbf','#abd9e9','#4395d2')), space = "Lab")(100) h <- heatmap.2( coclust, col=mypalette, trace='none', dendrogram='row', labRow='', labCol='', key = TRUE, keysize = 1.5, density.info=c("none"), main="MCMC co-clustering matrix", scale = "none")
We can then use the posterior co-clustering matrix to compute the resulting hard clustering using hierarchical clustering. Note that for this step, we need to specify the number of clusters that we want to obtain. This step should be guided by the posterior number of clusters estimated by the algorithm (see above).
diag(coclust) <- 1 fit <- hclust(as.dist(1 - coclust)) hardAssignments <- cutree(fit, k=4)
Now we can check if the estimated global cluster assignments correspond to the true assignments that were used to generate the simulated dataset. We use the adjusted Rand index (ARI), which measures how well do two sets of assignments correspond to each other. The following plot shows the ARI across the MCMC iterations, and also the ARI of the result obtained from the co-clustering matrix. ARI equal to 1 represents complete agreement between the estimated assignments and the true clustering, ARI equal to 0 corresponds to random assignments.
aris <- laply(1:nrow(clusters), function(i) mclust::adjustedRandIndex(clusters[i,], testData$groups)) %>% as.data.frame colnames(aris) <- "ARI" aris$Iteration <- seq(from=burnin, to=maxIter, by=lag) coclustAri <- mclust::adjustedRandIndex(hardAssignments, testData$groups) aris$Coclust <- coclustAri ggplot(aris, aes(x=Iteration, y=ARI, colour="MCMC iterations")) + geom_point() + ylim(0,1) + geom_smooth(size=1) + theme_bw() + geom_line(aes(x=Iteration, y=Coclust, colour="Co-clustering matrix"), size=1) + scale_colour_discrete(name="Cluster assignments")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.