README.md
In ktaucenters: Robust Clustering Procedures

ktaucenters package: Robust and efficient Clustering

Juan D. Gonzalez, Victor J. Yohai, Ruben H. Zamar and Douglas Carmona

This package implements a clustering algorithm similar to kmeans, it has two main advantages:

The estimator is resistant to outliers, that means that results of estimator are still correct when there are atypical values in the sample.
The estimator is efficient, roughly speaking, if there are not outliers in the sample (all data is good), results will be similar than those obtained by a classic algorithm (kmeans)

Clustering procedure is carried out by minimizing the overall robust scale so-called tau scale (see Yohai Gonzalez, Yohai and Zamar 2019 arxiv:1906.08198).

How to use the package ktaucenters

Load the package ktaucenters

rm(list=ls())
library(ktaucenters)

Generate synthetic data (three cluster well separated), and apply a classic algorithm (kmeans) and the robust ktaucenters.

# Generate synthetic data (three cluster well separated)
set.seed(1)
Z <- rnorm(600);
mues <- rep(c(-4, 0, 4), 200)
X <-  matrix(Z + mues, ncol=2)

# Applying the ROBUST algortihm
ktau_output <- ktaucenters(X, K = 3, nstart = 10)

# Applying the classic algortihm
kmeans_output <- kmeans(X, centers = 3, nstart = 10)

# Plotting the center results 
plot(X, main = "Efficiency")
points(ktau_output$centers, pch = 19, col = 2, cex = 2)
points(kmeans_output$centers, pch = 17, col = 3, cex = 2)
legend(-6, 6, pch = c(19,17), col = c(2,3), cex = 1, legend = c("ktau centers", "kmeans centers"))

Contaminate the previous data by replacing 60 observations to outliers located in a bounding box that contains the clean data. Then apply kmeans and ktaucenters algorithms.

# Generate 60 synthetic outliers (contamination level 20%)
X[sample(1:300,60), ] <- matrix(runif( 40, 2* min(X), 2 * max(X)),
                                ncol = 2, nrow = 60)

# Applying the ROBUST algortihm
ktau_output <- ktaucenters(X, K = 3, nstart = 10)

# Applying the classic algortihm

kmeans_output <- kmeans(X, centers = 3, nstart = 10)

# Plotting the center results 
plot(X, main = "Robustness")
points(ktau_output$centers, pch = 19, col = 2, cex = 2)
points(kmeans_output$centers, pch = 17, col = 3, cex = 2)
legend(-10, 10, pch = c(19, 17), col = c(2, 3), cex = 1, legend = c("ktau centers", "kmeans centers"))

After running the code, it can be observed that kmeans centers were very influenced by outliers, while ktaucenters results are still razonable.

Continuation from Example 2, for outliers recognition purposes we can see the ktau_output$outliers that indicates the indices that may be considered as outliers, on the other hand, the labels of each cluster found by the algorithm are coded with integers between 1 and K (in this case K=3), the variable ktau_output$clusters contains that information.

plot(X, main = "Estimated clusters and outliers detection")

# Plotting clusters 
for (j in 1:3){
  points(X[ktau_output$cluster==j, ], col=j+1)
}

# Plotting outliers 
points(X[ktau_output$outliers, ], pch = 19, col = 1, cex = 1)
legend(7, 15, pch = c(1,1,1,19), col = c(2,3,4,1), cex = 1,
       legend = c("cluster 1", "cluster 2", "cluster 3", "detected \n outliers"), bg = "gray")

The final figure contains clusters and outliers detected.

Improved-ktaucenters

The algorithm ktaucenters works well under noisy data, but fails when clusters have different size, shape and orientation, an algorithm suitable for this sort of data is improvektaucenters. To show how this algorithm works we use the data set so-called M5data from package tclust: tclust: Robust Trimmed Clustering, tclust. M5 data were generated by three normal bivariate distributions with different scales, one of the components is very overlapped with another one. A 10% background noise is added uniformly.

First, load the data and run the improvedktaucenters function.

# Load non spherical datadata 

library("tclust")
data("M5data")
X <- M5data[,1:2]
true.clusters <- M5data[,3]

# Estimate clusters
improved_output <- improvedktaucenters(X, K = 3, cutoff = 0.95)

Keep the results in the variable improved_output, that is a list that contains the fields outliers and cluster, among others. For example, we can have access to the cluster labeled as 2 by typing

X[improved_output$cluster==2, ].

If we want to know the values of outliers, type

X[improved_output$outliers, ].

By using these commands, it is easy to estimate the original clusters by means of improvedktaucenters routine.

The preprint arxiv:1906.08198 contains comparison with other robust clustering procedures as well as technical details and applications.