maskmeans is a package to perform an aggregation/splitting multi-view K-means algorithm, starting with an initial clustering partition or matrix of posterior probabilities. The goal is to refine/improve the clustering obtained on the first, primary view by using additional data views; in addition, views which contain only noise or partially concordant information are down-weighted by the algorithm.
We begin by loading the necessary packages for this vignette.
knitr::opts_chunk$set(dev='png')
library(fclust) library(mclust) library(maskmeans) library(ggraph) library(viridis)
To illustrate, we begin by simulating some multi-view data (made up of six views) using the mv_simulate
function.
In the first view of these
data, clusters are found in a circular pattern around the origin, with one cluster at the origin. A total of nine
true clusters are present in these data. In the second
view, a portion of the clustering structure is present but there are now pairs of overlapping clusters. In the
third view, a clustering structure is present that has nothing in common with that of the first view. The fourth
and fifth views are similar to the second, with the exception of being univariate. Finally, the sixth view includes
partially concordant clustering information with the first view. We also calculate a vector containing the dimensions
of each view, respectively.
set.seed(12345) sim <- mv_simulate(beta=5, n=200, K=9, sigma=1.5) X <- sim$data ## Simulated data mv <- c(2,2,2,1,1,2) ## Size of each view all.equal(sum(mv), ncol(X))
We can visualize these data, as long as the true cluster labels corresponding to the first view, with the
mv_plot
function.
mv_plot(mv_data=X, mv=mv, labels=sim$labels[,1])
Now before proceeding, we will obtain an initial hard and soft clustering of the first view; in the case of the
former, we will make use of a simple K-means algorithm, and in the case of the latter, we will use the FKM
algorithm
from the fclust package. We will also create an initialization of the hard clustering with a smaller number of clusters,
for use as illustration in the splitting algorithm.
aggregate_hard_init <- kmeans(X[,1:2], centers=15)$cluster aggregate_soft_init <- FKM(X[,1:2], k=15, maxit=10, RS=10)$U split_hard_init <- kmeans(X[,1:2], centers=2)$cluster split_soft_init <- FKM(X[,1:2], k=2, maxit=10, RS=10)$U
In this section, we illustrate the output of the aggregation multi-view algorithm with hard clustering intialization.
hard_aggreg <- maskmeans(mv_data=X, mv=mv, clustering_init=aggregate_hard_init, type = "aggregation", gamma=2)
We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans
output.
hard_aggreg$final_K p <- maskmeans_plot(hard_aggreg)
If desired, we can plot the original data with the new labels superimposed, as well as calculate the adjusted Rand index with the true simulated labels as well as the estimated labels from the original K-means on the first view.
mv_plot(mv_data=X, mv=mv, labels=hard_aggreg$final_classification) + scale_color_viridis(discrete=TRUE) + scale_fill_viridis(discrete=TRUE) adjustedRandIndex(hard_aggreg$final_classification, sim$labels[,1]) adjustedRandIndex(aggregate_hard_init, sim$labels[,1])
In this section, we illustrate the output of the aggregation multi-view algorithm with soft clustering intialization.
soft_aggreg <- maskmeans(mv_data=X, mv=mv, clustering_init=aggregate_soft_init, type = "aggregation", gamma=2)
We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans
output.
soft_aggreg$final_K p <- maskmeans_plot(soft_aggreg)
If desired, we can plot the original data with the new labels superimposed, as well as calculate the adjusted Rand index with the true simulated labels as well as the estimated labels from the original K-means on the first view.
# mv_plot(mv_data=X, mv=mv, labels=soft_aggreg$final_classification) adjustedRandIndex(soft_aggreg$final_classification, sim$labels[,1]) adjustedRandIndex(apply(aggregate_soft_init, 1, which.max), sim$labels[,1])
In this section, we illustrate the output of the splitting multi-view algorithm with hard clustering intialization.
hard_split <- maskmeans(mv_data=X, mv=mv, clustering_init=split_hard_init, Kmax=20, type = "splitting", gamma=2, perCluster_mv_weights=FALSE)
We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans
output.
hard_split$final_K p <- maskmeans_plot(hard_split)
If desired, we can plot the original data with the new labels superimposed, as well as calculate the adjusted Rand index with the true simulated labels as well as the estimated labels from the original K-means on the first view.
mv_plot(mv_data=X, mv=mv, labels=hard_split$final_classification) + scale_color_viridis(discrete=TRUE) + scale_fill_viridis(discrete=TRUE) adjustedRandIndex(hard_split$final_classification, sim$labels[,1]) adjustedRandIndex(split_hard_init, sim$labels[,1])
In this section, we illustrate the output of the splitting multi-view algorithm with hard clustering intialization and per-cluster weights.
hard_split_perCluster <- maskmeans(mv_data=X, mv=mv, clustering_init=split_hard_init, Kmax=20, type = "splitting", gamma=2, perCluster_mv_weights=TRUE)
We can check the number of clusters selected via the DDSE algorithm and plot the relevant plots from the maskmeans
output.
hard_split_perCluster$final_K p <- maskmeans_plot(hard_split_perCluster, type = c("criterion", "tree"))
This algorithm also outputs per-cluster and per-view weights as splits occur; the relevant per cluster weights at each split can be visualized using the tree_perClusterWeights
plot:
hard_split_perCluster$final_K p <- maskmeans_plot(hard_split_perCluster, type = "tree_perClusterWeights")
The full set of weights can also be plotted using a heatmap as follows:
p <- maskmeans_plot(hard_split_perCluster, type = c("weights"))
If desired we can plot the original data with the new labels superimposed, as well as calculate the adjusted Rand index with the true simulated labels as well as the estimated labels from the original K-means on the first view.
mv_plot(mv_data=X, mv=mv, labels=hard_split_perCluster$final_classification) + scale_color_viridis(discrete=TRUE) + scale_fill_viridis(discrete=TRUE) adjustedRandIndex(hard_split_perCluster$final_classification, sim$labels[,1]) adjustedRandIndex(split_hard_init, sim$labels[,1])
In this section, we illustrate the output of the splitting multi-view algorithm with soft clustering intialization. (TODO)
In this section, we illustrate the output of the splitting multi-view algorithm with soft clustering intialization and per-cluster weights. (TODO)
The maskmeans
can also take as input a list of data views, in which case the dimension of each view will be automatically
calculated.
Xlist <- list(X[,1:2], X[,3:4], X[,5:6], matrix(X[,7], ncol=1), matrix(X[,8], ncol=1), X[,9:10]) hard_aggreg_list <- maskmeans(mv_data=Xlist, clustering_init=aggregate_hard_init, type = "aggregation", gamma=gamma)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.