rimle: Robust Improper Maximum Likelihood Clustering

Description Usage Arguments Details Value References See Also Examples

View source: R/rimle.R

Description

rimle searches for G approximately Gaussian-shaped clusters with/without noise/outliers. The method's tuning controlling the noise level is fixed and is to be provided by the user or will be guessed by the function in a rather quick and dirty way (otrimle performs a more sophisticated data-driven choice).

Usage

1
rimle(data, G, initial=NULL, logicd=NULL, npr.max=0.5, erc=50, iter.max=500, tol=1e-6)

Arguments

data

A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables. Categorical variables and NA values are not allowed.

G

An integer specifying the number of clusters.

initial

An integer vector specifying the initial cluster assignment with 0 denoting noise/outliers. If NULL (default) initialization is performed using InitClust.

logicd

A number log(icd), where 0 <= icd < Inf is the value of the improper constant density (icd). This is the RIMLE's tuning for controlling the size of the noise. If logicd=NULL (default), an icd value is guessed based on the data. A pure Gaussian Mixture Model fit is obtained with log(icd)=-Inf.

npr.max

A number in (0,1) specifying the maximum proportion of noise/outliers. This defines the noise proportion constraint.

erc

A number >=1 specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. This defines the eigenratio constraint. erc=1 enforces spherical clusters with equal covariance matrices. A large erc allows for large between-cluster covariance discrepancies. In order to facilitate the setting of erc, it is suggested to scale the columns of data (see scale) whenever measurement units of the different variables are grossly incompatible.

iter.max

An integer value specifying the maximum number of iterations allowed in the ECM-algorithm (see Details).

tol

Stopping criterion for the underlying ECM-algorithm. An ECM iteration stops if two successive improper log-likelihood values are within tol.

Details

The rimle function computes the RIMLE solution using the ECM-algorithm proposed in Coretto and Hennig (2017).

There may be datasets for which the function does not provide a solution based on default arguments. This corresponds to code=0 and flag=1 or flag=2 in the output (see Value-section below). This usually happens when some (or all) of the following circumstances occur: (i) log(icd) is too large; (ii) erc is too large; (iii) npr.max is too large; (iv) choice of the initial partition. In these cases it is suggested to find a suitable interval of icd values by using the otrimle function. The Details section of otrimle suggests several actions to take whenever a code=0 non-solution occurs.

An earlier approximate version of the algorithm was originally proposed in Coretto and Hennig (2016). Software for the original version of the algorithm can be found in the supplementary materials of Coretto and Hennig (2016).

Value

An S3 object of class 'rimle'. Output components are as follows:

code

An integer indicator for the convergence. code=0 if no solution is found (see Details); code=1 if the EM-algorithm did not converge within em.iter.max; code=2 convergence is fully achieved.

flag

A character string containing one or more flags related to the EM iteration at the optimal icd. flag=1 if it was not possible to prevent the numerical degeneracy of improper posterior probabilities (tau value below). flag=2 if enforcement of the noise proportion constraint failed for numerical reasons. flag=3 if enforcement of the eigenratio constraint failed for numerical reasons. flag=4 if the noise proportion constraint has been successfully applied at least once. flag=5 if the eigenratio constraint has been successfully applied at least once.

iter

Number of iterations performed in the underlying EM-algorithm.

logicd

Value of the log(icd).

iloglik

Value of the improper likelihood.

criterion

Value of the OTRIMLE criterion.

npr

Estimated expected noise proportion.

cpr

Vector of estimated expected cluster proportions (notice that sum(cpr)=1-npr).

mean

A matrix of dimension ncol(data) x G containing the mean parameters of each cluster (column-wise).

cov

An array of size ncol(data) x ncol(data) x G containing the covariance matrices of each cluster.

tau

A matrix of dimension nrow(data) x {1+G} where tau[i, 1+j] is the estimated (improper) posterior probability that the ith observation belongs to the jth cluster. tau[i,1] is the estimated (improper) posterior probability that ith observation belongs to the noise component.

smd

A matrix of dimension nrow(data) x G where smd[i,j] is the squared Mahalanobis distance of data[i,] from mean[,j] according to cov[,,j].

cluster

A vector of integers denoting cluster assignments for each observation. It's 0 for observations assigned to noise/outliers.

size

A vector of integers with sizes (counts) of each cluster.

References

Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. doi: 10.1080/01621459.2015.1100996

Coretto, P. and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. arXiv preprint available at arXiv:1309.6895.

See Also

plot.rimle, InitClust, otrimle,

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Load  Swiss banknotes data
data(banknote)
x <- banknote[,-1]

## -----------------------------------------------------------------------------
## EXAMPLE 1:
## Perform RIMLE with default inputs
## -----------------------------------------------------------------------------
set.seed(1)
a <- rimle(data = x, G = 2)
print(a)

## Plot clustering
plot(a, data = x, what = "clustering")

## P-P plot of the clusterwise empirical weighted squared Mahalanobis
## distances against the target distribution pchisq(, df=ncol(data))
plot(a, what = "fit")
plot(a, what = "fit", cluster = 1)



## -----------------------------------------------------------------------------
## EXAMPLE 2:
## Compare solutions for different choices of logicd
## -----------------------------------------------------------------------------
set.seed(1)

## Case 1: noiseless solution, that is fit a pure Gaussian Mixture Model
b1 <- rimle(data = x, G = 2, logicd = -Inf)
plot(b1, data=x, what="clustering")
plot(b1, what="fit")

## Case 2: low noise level
b2 <- rimle(data = x, G = 2, logicd = -100)
plot(b2, data=x, what="clustering")
plot(b2, what="fit")

## Case 3: medium noise level
b3 <- rimle(data = x, G = 2, logicd = -10)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")

## Case 3: large noise level
b3 <- rimle(data = x, G = 2, logicd = 5)
plot(b3, data=x, what="clustering")
plot(b3, what="fit")

otrimle documentation built on July 4, 2017, 9:24 a.m.

Related to rimle in otrimle...