INCAnumclu: Estimation of Number of Clusters in Data
In ICGE: Estimation of Number of Clusters and Identification of Atypical Units

View source: R/INCAnumclu.R

INCAnumclu

R Documentation

Estimation of Number of Clusters in Data

Description

INCAnumclu helps to estimate the number of clusters in a dataset. The INCA index associated to different partitions with different number of clusters is calculated.

Usage

INCAnumclu(d, K, method = "pam", pert, L= NULL, noise=NULL)

Arguments

`d`	a distance matrix or a `dist` object with distance information between units.
`K`	the maximum number of cluster to be considered. For each k value ( k=2,..,K) a partition with k clusters is calculated.
`method`	character string defining the clustering method in order to obtain the partitions. The hierarchical aglomerative clustering methods are perfomed via `hclust` function in package fastcluster. Other clustering methods are performed via the functions in package cluster, such as: `pam`, `diana` and `fanny`. The available clustering methods are `pam` (default method), `average` (UPGMA), `single` (single linkage), `complete` (complete linkage), `ward.D2` (Ward's method), `ward.D`, `centroid`, `median`, `diana` (hierarchical divisive) and `fanny` (fuzzy clustering). Nevertheless, the user can introduce particular or custom partitions indicating `method="partition"` and specifying the partitions in argument `pert`.
`pert`	only useful when parameter `method`="partition"; it is a matrix and each column contains a partition of the units. That means that each column is an n-vector that indicates which group each unit belongs to. Note that the expected values of each column of `pert` are numbers greater than or equal to 1 (for instance 1,2,3,4..., k).
`L`	default value NULL, but when some units are considered by the user as noise units, `L` must be specified as follows: (a) `L` is greater than or equal to 1 and all units in clusters with a cardinal <= L are considered noise units; (b) `L="custom"` when the user wants to specify which units are considered noise units. These units must be specified in argument `noise`.
`noise`	when `L="custom"`, it is a logical vector indicating the units considered by the user as noise units.

Value

Returns an object of class incanc which is a numeric vector containing the INCA index associated to each of the k (k=2,...,K) partitions. When noise is no null, the function returns a list with the INCA index for each partition, which is calculated without noise units as well as with noise units. The associated plot returns INCA index plot, both, with and without noise.

Author(s)

Itziar Irigoien itziar.irigoien@ehu.eus; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.

Conchita Arenas carenas@ub.edu; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.

References

Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units. Statistics in Medicine, 27(15), 2948–2973.

Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.

Examples

#------- Example 1 --------------------------------------
#generate 3 clusters, each of them with 20 objects in dimension 5.
mu1 <- sample(1:10, 5, replace=TRUE)
x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE)
mu2 <- sample(1:10, 5, replace=TRUE)
x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE)
mu3 <- sample(1:10, 5, replace=TRUE)
x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE)
x <- rbind(x1,x2,x3)

# calculte euclidean distance between them
d <- dist(x)

# calculate the INCA index associated to partitions with k=2, ..., k=5 clusters.
INCAnumclu(d, K=5)
out <- INCAnumclu(d, K=5)
plot(out)

#------- Example 1 cont. --------------------------------
# With hypothetical noise elements
noiseunits <- rep(FALSE, 60)
noiseunits[sample(1:60, 20)] <- TRUE
out <- INCAnumclu(d, K=5, L="custom", noise=noiseunits)
plot(out)

ICGE documentation built on Oct. 17, 2022, 5:10 p.m.