INCAindex

`INCAindex`

helps to estimate the number of clusters in a dataset.

INCAindex(d, pert_clus)

`d` |
a distance matrix or a |

`pert_clus` |
an n-vector that indicates which group each unit belongs to. Note that the expected values of |

Returns an object of class `incaix`

which is a list containing the following components:

`well_class` |
a vector indicating the number of well classified units. |

`Ni_cluster` |
a vector indicating each cluster size. |

`Total` |
percentage of objects well classified in the partition defined by |

For a correct geometrical interpretation it is convenient to verify whether the distance matrix d is Euclidean. It admits the associated methods summary and plot. The first simply returns the percentage of well-classified units and the second offers a barchart with the percentages of well classified units for each group in the given partition.

Itziar Irigoien itziar.irigoien@ehu.eus; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.

Conchita Arenas carenas@ub.edu; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.

Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances.* Contributions to Science*, **2**, 183–191.

Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units.
*Statistics in Medicine*, **27**(15), 2948–2973.

`estW`

, `INCAtest`

#generate 3 clusters, each of them with 20 objects in dimension 5. mu1 <- sample(1:10, 5, replace=TRUE) x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) mu2 <- sample(1:10, 5, replace=TRUE) x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE) mu3 <- sample(1:10, 5, replace=TRUE) x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE) x <- rbind(x1,x2,x3) # Euclidean distance between units. d <- dist(x) # given the right partition, calculate the percentage of well classified objects. partition <- c(rep(1,20), rep(2,20), rep(3,20)) INCAindex(d, partition) # In order to estimate the number of cluster in data, try several # partitions and compare the results library(cluster) T <- rep(NA, 5) for (l in 2:5){ part <- pam(d,l)$clustering T[l] <- INCAindex(d,part)$Total } plot(T, type="b",xlab="Number of clusters", ylab="INCA", xlim=c(1.5, 5.5))

