In andreamrau/maskmeans: Multi-view Aggregation and Splitting for K-Means Clustering

maskmeans is a package to perform an aggregation/splitting multi-view K-means algorithm, starting with an initial clustering partition or matrix of posterior probabilities. The goal is to refine/improve the clustering obtained on the first, primary view by using additional data views; in addition, views which contain only noise or partially concordant information are down-weighted by the algorithm.

We begin by loading the necessary packages for this vignette.

knitr::opts_chunk$set(dev='png')

library(fclust)
library(mclust)
library(maskmeans)
library(ggraph)
library(viridis)

Data simulation

To illustrate, we begin by simulating some multi-view data (made up of six views) using the mv_simulate function. In the first view of these data, clusters are found in a circular pattern around the origin, with one cluster at the origin. A total of nine true clusters are present in these data. In the second view, a portion of the clustering structure is present but there are now pairs of overlapping clusters. In the third view, a clustering structure is present that has nothing in common with that of the first view. The fourth and fifth views are similar to the second, with the exception of being univariate. Finally, the sixth view includes partially concordant clustering information with the first view. We also calculate a vector containing the dimensions of each view, respectively.

set.seed(12345)
sim <- mv_simulate(beta=5, n=200, K=9, sigma=1.5)
X <- sim$data         ## Simulated data
mv <- c(2,2,2,1,1,2)  ## Size of each view
all.equal(sum(mv), ncol(X))

We can visualize these data, as long as the true cluster labels corresponding to the first view, with the mv_plot function.

mv_plot(mv_data=X, mv=mv, labels=sim$labels[,1])

Now before proceeding, we will obtain an initial hard and soft clustering of the first view; in the case of the former, we will make use of a simple K-means algorithm, and in the case of the latter, we will use the FKM algorithm from the fclust package. We will also create an initialization of the hard clustering with a smaller number of clusters, for use as illustration in the splitting algorithm.

aggregate_hard_init <- kmeans(X[,1:2], centers=15)$cluster
aggregate_soft_init <- FKM(X[,1:2], k=15, maxit=10, RS=10)$U
split_hard_init <- kmeans(X[,1:2], centers=2)$cluster
split_soft_init <- FKM(X[,1:2], k=2, maxit=10, RS=10)$U

Aggregation multi-view algorithm with hard clustering

In this section, we illustrate the output of the aggregation multi-view algorithm with hard clustering intialization.

hard_aggreg <- maskmeans(mv_data=X, mv=mv, clustering_init=aggregate_hard_init, 
                         type = "aggregation", gamma=2)

We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans output.

hard_aggreg$final_K
p <- maskmeans_plot(hard_aggreg)

If desired, we can plot the original data with the new labels superimposed, as well as calculate the adjusted Rand index with the true simulated labels as well as the estimated labels from the original K-means on the first view.

mv_plot(mv_data=X, mv=mv, labels=hard_aggreg$final_classification) +
  scale_color_viridis(discrete=TRUE) +
  scale_fill_viridis(discrete=TRUE)
adjustedRandIndex(hard_aggreg$final_classification, sim$labels[,1])
adjustedRandIndex(aggregate_hard_init, sim$labels[,1])

Aggregation multi-view algorithm with soft clustering

In this section, we illustrate the output of the aggregation multi-view algorithm with soft clustering intialization.

soft_aggreg <- maskmeans(mv_data=X, mv=mv, clustering_init=aggregate_soft_init, 
                         type = "aggregation", gamma=2)

We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans output.

soft_aggreg$final_K
p <- maskmeans_plot(soft_aggreg)

# mv_plot(mv_data=X, mv=mv, labels=soft_aggreg$final_classification)
adjustedRandIndex(soft_aggreg$final_classification, sim$labels[,1])
adjustedRandIndex(apply(aggregate_soft_init, 1, which.max), sim$labels[,1])

Splitting multi-view algorithm with hard clustering

In this section, we illustrate the output of the splitting multi-view algorithm with hard clustering intialization.

hard_split <- maskmeans(mv_data=X, mv=mv, clustering_init=split_hard_init, Kmax=20,
                         type = "splitting", gamma=2, perCluster_mv_weights=FALSE)

We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans output.

hard_split$final_K
p <- maskmeans_plot(hard_split)

mv_plot(mv_data=X, mv=mv, labels=hard_split$final_classification) +
  scale_color_viridis(discrete=TRUE) +
  scale_fill_viridis(discrete=TRUE)
adjustedRandIndex(hard_split$final_classification, sim$labels[,1])
adjustedRandIndex(split_hard_init, sim$labels[,1])

Splitting multi-view algorithm with hard clustering and per-cluster weights

In this section, we illustrate the output of the splitting multi-view algorithm with hard clustering intialization and per-cluster weights.

hard_split_perCluster <- maskmeans(mv_data=X, mv=mv, clustering_init=split_hard_init, Kmax=20,
                         type = "splitting", gamma=2, perCluster_mv_weights=TRUE)

We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans output.

hard_split_perCluster$final_K
p <- maskmeans_plot(hard_split_perCluster, type = c("criterion", "tree"))

This algorithm also outputs per-cluster and per-view weights as splits occur; the relevant per cluster weights at each split can be visualized using the tree_perClusterWeights plot:

hard_split_perCluster$final_K
p <- maskmeans_plot(hard_split_perCluster, type = "tree_perClusterWeights")

The full set of weights can also be plotted using a heatmap as follows:

p <- maskmeans_plot(hard_split_perCluster, type = c("weights"))

If desired we can plot the original data with the new labels superimposed, as well as calculate the adjusted Rand index with the true simulated labels as well as the estimated labels from the original K-means on the first view.

mv_plot(mv_data=X, mv=mv, labels=hard_split_perCluster$final_classification) +
  scale_color_viridis(discrete=TRUE) +
  scale_fill_viridis(discrete=TRUE)
adjustedRandIndex(hard_split_perCluster$final_classification, sim$labels[,1])
adjustedRandIndex(split_hard_init, sim$labels[,1])

Splitting multi-view algorithm with soft clustering

In this section, we illustrate the output of the splitting multi-view algorithm with soft clustering intialization. (TODO)

Splitting multi-view algorithm with soft clustering and per-cluster weights

In this section, we illustrate the output of the splitting multi-view algorithm with soft clustering intialization and per-cluster weights. (TODO)

Other considerations

The maskmeans can also take as input a list of data views, in which case the dimension of each view will be automatically calculated.

Xlist <- list(X[,1:2], X[,3:4], X[,5:6], matrix(X[,7], ncol=1), 
              matrix(X[,8], ncol=1), X[,9:10])
hard_aggreg_list <- maskmeans(mv_data=Xlist, 
                              clustering_init=aggregate_hard_init, 
                              type = "aggregation", gamma=gamma)