INCAnumclu: Estimation of Number of Clusters in Data

View source: R/INCAnumclu.R

INCAnumcluR Documentation

Estimation of Number of Clusters in Data

Description

INCAnumclu helps to estimate the number of clusters in a dataset. The INCA index associated to different partitions with different number of clusters is calculated.

Usage

INCAnumclu(d, K, method = "pam", pert, L= NULL, noise=NULL)

Arguments

d

a distance matrix or a dist object with distance information between units.

K

the maximum number of cluster to be considered. For each k value ( k=2,..,K) a partition with k clusters is calculated.

method

character string defining the clustering method in order to obtain the partitions. The hierarchical aglomerative clustering methods are perfomed via hclust function in package fastcluster. Other clustering methods are performed via the functions in package cluster, such as: pam, diana and fanny. The available clustering methods are pam (default method), average (UPGMA), single (single linkage), complete (complete linkage), ward.D2 (Ward's method), ward.D, centroid, median, diana (hierarchical divisive) and fanny (fuzzy clustering). Nevertheless, the user can introduce particular or custom partitions indicating method="partition" and specifying the partitions in argument pert.

pert

only useful when parameter method="partition"; it is a matrix and each column contains a partition of the units. That means that each column is an n-vector that indicates which group each unit belongs to. Note that the expected values of each column of pert are numbers greater than or equal to 1 (for instance 1,2,3,4..., k).

L

default value NULL, but when some units are considered by the user as noise units, L must be specified as follows: (a) L is greater than or equal to 1 and all units in clusters with a cardinal <= L are considered noise units; (b) L="custom" when the user wants to specify which units are considered noise units. These units must be specified in argument noise.

noise

when L="custom", it is a logical vector indicating the units considered by the user as noise units.

Value

Returns an object of class incanc which is a numeric vector containing the INCA index associated to each of the k (k=2,...,K) partitions. When noise is no null, the function returns a list with the INCA index for each partition, which is calculated without noise units as well as with noise units. The associated plot returns INCA index plot, both, with and without noise.

Author(s)

Itziar Irigoien itziar.irigoien@ehu.eus; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.

Conchita Arenas carenas@ub.edu; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.

References

Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units. Statistics in Medicine, 27(15), 2948–2973.

Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.

See Also

INCAindex, estW

Examples

#------- Example 1 --------------------------------------
#generate 3 clusters, each of them with 20 objects in dimension 5.
mu1 <- sample(1:10, 5, replace=TRUE)
x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE)
mu2 <- sample(1:10, 5, replace=TRUE)
x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE)
mu3 <- sample(1:10, 5, replace=TRUE)
x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE)
x <- rbind(x1,x2,x3)

# calculte euclidean distance between them
d <- dist(x)

# calculate the INCA index associated to partitions with k=2, ..., k=5 clusters.
INCAnumclu(d, K=5)
out <- INCAnumclu(d, K=5)
plot(out)

#------- Example 1 cont. --------------------------------
# With hypothetical noise elements
noiseunits <- rep(FALSE, 60)
noiseunits[sample(1:60, 20)] <- TRUE
out <- INCAnumclu(d, K=5, L="custom", noise=noiseunits)
plot(out)

ICGE documentation built on Oct. 17, 2022, 5:10 p.m.

Related to INCAnumclu in ICGE...