K-means clustering using MUS and other pivotal algorithms"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this vignette we explore the K-means algorithm performed using the MUS algorithm and other pivotal methods through the function piv_KMeans of the pivmet package. First of all, we load the package:

library(pivmet)
library(mvtnorm)

Pivotal algorithms: how they works, and why

We present here a simulated case for applying our procedure. Given $n$ units $y_1,\ldots,y_n$:

We propose four alternative methods for achieving this task. Let $j, \ j=1,\ldots,k$ be the group containing units $\mathcal J_j$, the user may choose ${i^*}\in\mathcal J_j$ that maximizes one of the quantities:

\begin{align} & \sum_{p\in\mathcal J_j} c_{{i^}p} \ & \sum_{p\in\mathcal J_j} c_{{i^}p} - \sum_{p\not\in\mathcal J_j} c_{{i^}p}. \end{align*}

These methods give the unit that maximizes the global within similarity (maxsumint) and the unit that maximizes the difference between global within and between similarities (maxsumdiff), respectively. Alternatively, we may choose $i^{*} \in\mathcal J_j$, which minimizes:

$$\sum_{p\not\in\mathcal J_j} c_{i^{*}p},$$

obtaining the most distant unit among the members that minimize the global dissimilarity between one group and all the others (minsumnoint).

MUS algorithm described in @egidi2018mus is a sequential procedure for extracting identity submatrices of small rank and pivotal units from large and sparse matrices. The procedure has already been satisfactorily applied for solving the label switching problem in Bayesian mixture models [@egidi2018relabelling].

With the function MUS the user may detect pivotal units from a co-association matrix C, obtained through $H$ different partitions, whose units may belong to $k$ groups, expressed by the argument clusters. We remark here that MUS algorithm may be performed only when $k <5$.

#generate some data

set.seed(123)
n  <- 620
centers  <- 3
n1 <- 20
n2 <- 100
n3 <- 500
x  <- matrix(NA, n,2)
truegroup <- c( rep(1,n1), rep(2, n2), rep(3, n3))


x[1:n1,] <- rmvnorm(n1, c(1,5), sigma=diag(2))
x[(n1+1):(n1+n2),] <- rmvnorm(n2, c(4,0), sigma=diag(2))
x[(n1+n2+1):(n1+n2+n3),] <- rmvnorm(n3,c(6,6),sigma=diag(2))

H <- 1000
a <- matrix(NA, H, n)

  for (h in 1:H){
    a[h,] <- kmeans(x,centers)$cluster
  }

#build the similarity matrix
sim_matr <- matrix(NA, n,n)
 for (i in 1:(n-1)){
    for (j in (i+1):n){
      sim_matr[i,j] <- sum(a[,i]==a[,j])/H
      sim_matr[j,i] <- sim_matr[i,j]
    }
  }

cl <- kmeans(x, centers, nstart = 10)$cluster
mus_alg <- MUS(C = sim_matr, clusters = cl, prec_par = 5)

piv_KMeans: k-means clustering via pivotal units

In some situations, classical K-means fails in recognizing the true groups:

 kmeans_res <- kmeans(x, centers, nstart = 10)
colors_cluster <- c("grey", "darkolivegreen3", "coral")
colors_centers <- c("black", "darkgreen", "firebrick")

graphics::plot(x, col = colors_cluster[truegroup]
                 ,bg= colors_cluster[truegroup], pch=21,
                  xlab="y[,1]",
                  ylab="y[,2]", cex.lab=1.5,
                  main="True data", cex.main=1.5)

graphics::plot(x, col = colors_cluster[kmeans_res$cluster], 
      bg=colors_cluster[kmeans_res$cluster], pch=21, xlab="y[,1]",
   ylab="y[,2]", cex.lab=1.5,main="K-means",  cex.main=1.5)
 points(kmeans_res$centers, col = colors_centers[1:centers], 
   pch = 8, cex = 2)

For instance, when the groups are unbalanced or non-spherical shaped, we may need a more robust version of the classical K-means. The pivotal units may be used as initial seeds for K-means method [@kmeans]. The function piv_KMeans works as the kmeans function, with some optional arguments

piv_res <- piv_KMeans(x, centers)
#par(mfrow=c(1,2), pty="s")
colors_cluster <- c("grey", "darkolivegreen3", "coral")
colors_centers <- c("black", "darkgreen", "firebrick")
graphics::plot(x, col = colors_cluster[truegroup],
   bg= colors_cluster[truegroup], pch=21, xlab="y[,1]",
   ylab="y[,2]", cex.lab=1.5,
   main="True data", cex.main=1.5)

graphics::plot(x, col = colors_cluster[piv_res$cluster],
   bg=colors_cluster[piv_res$cluster], pch=21, xlab="y[,1]",
   ylab="y[,2]", cex.lab=1.5,
   main="piv_Kmeans", cex.main=1.5)
points(x[piv_res$pivots[1],1], x[piv_res$pivots[1],2],
   pch=24, col=colors_centers[1],bg=colors_centers[1],
   cex=1.5)
points(x[piv_res$pivots[2],1], x[piv_res$pivots[2],2],
   pch=24,  col=colors_centers[2], bg=colors_centers[2],
   cex=1.5)
points(x[piv_res$pivots[3],1], x[piv_res$pivots[3],2],
   pch=24, col=colors_centers[3], bg=colors_centers[3],
   cex=1.5)
points(piv_res$centers, col = colors_centers[1:centers],
   pch = 8, cex = 2)

The function piv_KMeans has optional arguments:

References



Try the pivmet package in your browser

Any scripts or data that you put into this service are public.

pivmet documentation built on March 7, 2023, 6:34 p.m.