In silverfoxxxx/package1: K Means Clustering on Data Matrix

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(package1)
library(matrixStats)
library(Rcpp)

To use the function k_means_cluster:

x = rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
k = 2
k_means_cluster(x, k)

Note if the data matrix is in higher than 3-dimension, it will need more implementation to visualize. To includes plots if the input data matrix is in 2-dimensions:

output = k_means_cluster(x, k)
plot(output$data, col = output$clusters, main = "plot of clustering on data", xlab = "output$data[,1]", 
     ylab = "output$data[,2]")

Same dataset if choosing a larger nstart value (will sacrifice effiency for robustness):

output = k_means_cluster(x, k, nstart = 25)
plot(output$data, col = output$clusters, main = "plot of clustering with large nstart on data", xlab = "output$data[,1]", 
     ylab = "output$data[,2]")

Same dataset if choosing a smaller nstart value (will sacrifice robustness, not recommended):

output = k_means_cluster(x, k, nstart = 1)
plot(output$data, col = output$clusters, main = "plot of clustering with small nstart on data", xlab = "output$data[,1]", 
     ylab = "output$data[,2]")

Another example simulated data with 3-dimensions:

x = rbind(matrix(rnorm(150, sd = 0.3), ncol = 3), matrix(rnorm(150, mean = 1, sd = 0.3), ncol = 3))
k_means_cluster(x, k)

Another example simulated data with more than 2 clusters:

x = rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2))
k = 3
output = k_means_cluster(x, k)
plot(output$data, col = output$clusters, main = "plot of clustering on data", xlab = "output$data[,1]", 
     ylab = "output$data[,2]")

Same dataset if the best k is not chosen:

k = 2
output = k_means_cluster(x, k)
plot(output$data, col = output$clusters, main = "plot of clustering on data", xlab = "output$data[,1]", 
     ylab = "output$data[,2]")

Comparison

Note that K-Means clustering depends on initial centroids, and the cluster numbers are assigned randomly. The best way to evaluate the package is compare output centroids:

c = kmeans(x, k)$centers
c = c[order(c[,1]),]
my_c = k_means_cluster(x, k)$centroids
all(c - my_c < 1e-5)
system.time(kmeans(x, k))
system.time(k_means_cluster(x, k)$centroids)