Robust Initilization for Model-based Clustering Methods

Share:

Description

Computes the initial cluster assignment based on a combination of nearest neighbor based clutter/noise detection, and agglomerative hierarchical clustering based on maximum likelihood criteria for Gaussian mixture models.

Usage

1
2
3
 InitClust(data, G, cpr.min={ncol(data)+1}/nrow(data),
           K=5, nstart.km=50, modelName="VVV", monitor=FALSE)
 

Arguments

data

A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables. Categorical variables and NA values are not allowed.

G

An integer specifying the number of clusters.

cpr.min

The minimum cluster proportion allowed in the initial clustering.

K

An integer specifying the number of considered nearest neighbors per point used for the denoising step (see Details).

nstart.km

An integer specifying the number of random starts for the k-means step.

modelName

A character string indicating the covariance model to be used. Possible models are:
"E": equal variance (one-dimensional)
"V" : spherical, variable variance (one-dimensional)
"EII": spherical, equal volume
"VII": spherical, unequal volume
"EEE": ellipsoidal, equal volume, shape, and orientation
"VVV": ellipsoidal, varying volume, shape, and orientation (default).
See Details.

monitor

A logical value; TRUE means that tracing messages will be produced.

Details

The initialization is described in the supplementary material of Coretto and Hennig (2015). Noise/outliers are removed based on nearest neighbor based clutter/noise detection (NNC) of Byers and Raftery (1998). This step is performed with NNclean. The input argument K is passed as k to NNclean. Based on this step a denoised version of data is obtained. The initial clustering is then obtained based on the following steps. Note that these steps are reported in the code element of the output list (see Value).

Clustering steps:

Step 1: perform the model-based hierarchical clustering (MBHC) proposed in Fraley (1998). This step is performed using hc. The input argument modelName is passed to hc. See Details of hc for more details.

Step 2: if too small clusters (cluster proportions <cpr.min) are found in the previous step, assign small clusters to noise and perform MBHC again on the denoised data.

Step 3: if too small clusters are found in the previous step, assign small clusters to noise and perform k-means on the denoised data.

Step 4: if too small clusters are found in the previous step, then a completely random partition that satisfies cpr.min is returned.

Value

A list with the following components:

code

An integer indicating the step at which the initial clustering has been found (see Details).

cluster

A vector of integers denoting cluster assignments for each observation. cluster=0 for observations assigned to noise/outliers.

References

Fraley, C. (1998). Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20:270-281.

Byers, S. and A. E. Raftery (1998). Nearest-Neighbor Clutter Removal for Estimating Features in Spatial Point Processes, Journal of the American Statistical Association, 93, 577-584.

Coretto, P. and C. Hennig (2015). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. To appear on the Journal of the American Statistical Association. arXiv preprint at arXiv:1406.0808 with (supplement).

See Also

NNclean, hc

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
 ## Load  Swiss banknotes data
 data(banknote)
 x <- banknote[,-1]

 ## Initial clusters with default arguments
 init1 <- InitClust(data=x, G=2)
 print(init1)

 ## Perform otrimle
 a <- otrimle(data=x, initial=init1$cluster)
 plot(a, what="clustering", data=x)