Robust Improper Maximum Likelihood Clustering

Share:

Description

rimle searches for G approximately Gaussian-shaped clusters with/without noise/outliers. The method's tuning controlling the noise level is fixed and is to be provided by the user or will be guessed by the function in a rather quick and dirty way (otrimle performs a more sophisticated data-driven choice).

Usage

1
2
3
rimle(data, G, initial=NULL, logicd=NULL,
      npr.max=0.5, erc=50, det.min=.Machine$double.eps, cmstep=TRUE,
      em.iter.max=500, em.tol=1e-6, monitor=1)

Arguments

data

A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables. Categorical variables and NA values are not allowed.

G

An integer specifying the number of clusters.

initial

An integer vector specifying the initial cluster assignment with 0=noise/outliers. If NULL (default), initialization is performed using InitClust.

logicd

A number log(icd), where 0 <= icd < Inf is the value of the improper constant density (icd). This is the RIMLE's tuning for controlling the size of the noise. If logicd=NULL (default), an icd value is guessed based on the data. A pure Gaussian Mixture Model fit is obtained with log(icd)=-Inf.

npr.max

A number in (0,1) specifying the maximum proportion of noise/outliers. This defines the noise proportion constraint.

erc

A number >=1 specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. This defines the eigenratio constraint. erc=1 enforces spherical clusters with equal covariance matrices. A large erc allows for large between-cluster covariance discrepancies. In order to facilitate the setting of erc, it is suggested to scale the columns of data (see scale) whenever measurement units of the different variables are grossly incompatible.

det.min

Lower bound for the minimum determinant of covariance matrices. This is only active if cmstep=FALSE (see Details).

cmstep

A logical value. When set to TRUE the eigenratio constraint is enforced at each M-step of the underlying EM-algorithm (see Details).

em.iter.max

An integer value specifying the maximum number of iterations allowed in the underlying EM-algorithm.

em.tol

Stopping criterion for the the underlying EM-algorithm. An EM iteration stops if two successive improper log-likelihood values are within em.tol.

monitor

Set the verbosity level of tracing messages. Possible values are monitor=0 (no messages), monitor=1 and monitor=2 for increased verbosity.

Details

The rimle function allows to approximate the RIMLE solution with two different versions of the underlying EM-type algorithm.

ECM-algorithm: cmstep=TRUE

The RIMLE solution is obtained based on the ECM-algorithm proposed in Coretto and Hennig (2016). In this case both the eigenratio constraint and the noise proportion constraint are enforced in each conditional M-step of the algorithm.

Approximate EM-algorithm: cmstep=FALSE

This corresponds to the algorithm proposed in Coretto and Hennig (2015). In this case covariance matrices are regularized in each step based on det.min, and the eigenratio constraint is applied only at the end of the EM iteration.

The ECM-algorithm is the default choice. The Approximate EM-algorithm is often slower than the ECM-algorithm by a factor of two. Furthermore the Approximate EM-algorithm is more prone to lead to problems indicated by code=0 (see Value-section below) because of numerical degeneracies connected to a low value of min.det.

There may be datasets for which the function does not provide a solution based on default arguments. This corresponds to code=0 and flag=1 or 2 or 3 in the output (see Value-section below). This usually happens when some (or all) of the following circumstances occur: (i) log(icd) is too large; (ii) erc is too large; (iii) npr.max is too large; (iv) choice of the initial partition. In these cases it is suggested to find a suitable interval of icd values by using the otrimle function. The Details section of otrimle suggests several actions to take whenever a code=0 non-solution occurs.

Value

An S3 object of class 'rimle'. Output components are as follows:

code

An integer indicator for the convergence. code=0 if no solution is found (see Details); code=1 if the EM-algorithm did not converge within em.iter.max; code=2 convergence is fully achieved.

flag

A character string containing one or more flags related to the EM iteration at the optimal icd. flag=1 if it was not possible to prevent the numerical degeneracy of improper posterior probabilities (tau value below). flag=2 if enforcement of the noise proportion constraint failed for numerical reasons. flag=3 if enforcement of the eigenratio constraint failed for numerical reasons. flag=4 if the noise proportion constraint has been successfully applied at least once. flag=5 if the eigenratio constraint has been successfully applied at least once.

iter

Number of iterations performed in the underlying EM-algorithm.

logicd

Value of the log(icd).

iloglik

Value of the improper likelihood.

criterion

Value of the OTRIMLE criterion.

npr

Estimated expected noise proportion.

cpr

Vector of estimated expected cluster proportions (notice that sum(cpr)=1-npr).

mean

A matrix of dimension ncol(data) x G containing the mean parameters of each cluster (column-wise).

cov

An array of size ncol(data) x ncol(data) x G containing the covariance matrices of each cluster.

tau

A matrix of dimension nrow(data) x {1+G} where tau[i, 1+j] is the estimated (improper) posterior probability that the ith observation belongs to the jth cluster. tau[i,1] is the estimated (improper) posterior probability that ith observation belongs to the noise component.

smd

A matrix of dimension nrow(data) x G where smd[i,j] is the squared Mahalanobis distance of data[i,] from mean[,j] according to cov[,,j].

cluster

A vector of integers denoting cluster assignments for each observation. It's 0 for observations assigned to noise/outliers.

size

A vector of integers with sizes (counts) of each cluster.

References

Coretto, P. and C. Hennig (2015). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. To appear on the Journal of the American Statistical Association. arXiv preprint at arXiv:1406.0808 with (supplement).

Coretto, P. and C. Hennig (2016). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. arXiv preprint at arXiv:1309.6895.

See Also

plot.rimle, InitClust, otrimle,

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## -----------------------------------------------------------------------------
## EXAMPLE 1:
## Perform RIMLE with default inputs
## -----------------------------------------------------------------------------
set.seed(1)
a <- rimle(data=x, G=2)
print(a)

## Plot clustering
plot(a, data=x, what="clustering")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(a, what="fit")
plot(a, what="fit", cluster=1)



## -----------------------------------------------------------------------------
## EXAMPLE 2:
## Compare solutions for different choices of logicd
## -----------------------------------------------------------------------------
set.seed(1)

## Case 1: noiseless solution, that is fit a pure Gaussian Mixture Model
b1 <- rimle(data=x, G=2, logicd=-Inf)
plot(b1, data=x, what="clustering")
plot(b1, what="fit")

## Case 2: low noise level
b2 <- rimle(data=x, G=2, logicd=-100)
plot(b2, data=x, what="clustering")
plot(b2, what="fit")

## Case 3: medium noise level
b3 <- rimle(data=x, G=2, logicd=-10)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")

## Case 3: large noise level
b3 <- rimle(data=x, G=2, logicd=5)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")