====================================================
This package implements a clustering algorithm similar to Expectation Maximization algorithm for multivariate Gaussian finite mixture models by using Robust S estimator. Its main advantage is that The estimator is resistant to outliers, that means that results of estimator are still correct when there are atipycal values in the sample.
The reference is Robust Model-Based Clustering, Juan D. Gonzalez, Ricardo Maronna , Victor J. Yohai, and Ruben H. Zamar arxiv:2102.06851.
Firstly, you should install the package from github, by typing
devtools::install_github("jdgonzalezwork/RMBC")
where the package devtools
must be previously installed.
The following routine generates synthetic data and runs the RMBC
algorithm.
library("RMBC")
# Generate Sintetic data (three normal cluster in two dimension)
# clusters have different shapes and orentation.
# The data is contaminated uniformly (level 20%).
################################################
#### Start data generating process ############
##############################################
# generates base clusters
set.seed(1)
Z1 <- c(rnorm(100,0),rnorm(100,0),rnorm(100,0))
Z2 <- rnorm(300);
X <- matrix(0, ncol=2,nrow=300);
X[,1]=Z1;X[,2]=Z2
true.cluster= c(rep(1,100),rep(2,100),rep(3,100))
# rotate, expand and translate base clusters
theta=pi/3;
aux1=matrix(c(cos(theta),-sin(theta),sin(theta),cos(theta)),nrow=2)
aux2=sqrt(4)*diag(c(1,1/4))
B=aux1%*%aux2%*%t(aux1)
X[true.cluster==3,]=X[true.cluster==3,]%*%aux2%*%aux1 +
matrix(c(15,2), byrow = TRUE,nrow=100,ncol=2)
X[true.cluster==2,2] = X[true.cluster==2,2]
X[true.cluster==1,2] = X[true.cluster==1,2]*0.1
X[true.cluster==1, ] = X[true.cluster==1,]+
matrix(c(-15,-1),byrow = TRUE,nrow=100,ncol=2)
### Generate 60 sintetic outliers (contamination level 20%)
outliers=sample(1:300,60)
X[outliers, ] <- matrix(runif( 40, 2 * min(X), 2 * max(X) ),
ncol = 2, nrow = 60)
###############################################
#### END data generating process ############
#############################################
### APLYING RMBC ALGORITHM
ret = RMBC(Y=X, K=3,max_iter = 82)
cluster = ret$cluster
#############################################
### plotting results ########################
#############################################
oldpar=par(mfrow=c(1,2))
plot(X, main="Actual Clusters")
for (j in 1:3){
points(X[true.cluster==j,],pch=19, col=j+1)
}
points(X[outliers,],pch=19,col=1)
plot(X,main="Clusters Estimation")
for (j in 1:3){
points(X[cluster==j,],pch=19, col=j+1)
}
points(X[ret$outliers,],pch=19,col=1)
par(oldpar)
The graphical result of this routine can be viewed below, on the left image, the actual 3 clusters are in different colors, whereas the atypical values are represented in black. The right image shows how the algorithm was able to detect the three clusters and outliers. Of course this is a small example, but in the reference [1], this routine is applied to several synthetic scenarios as well as to real data.
The preprint [1] contains theoretical results and comparisons with other robust model-based clustering procedures. It also has technical details and applications to real data.
References:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.