Gap: Gap statistics

Description Usage Arguments Value References

View source: R/gap.R

Description

Tibshirani's gap statistic for the determination of the number of clusters. It computes the within cluster dispertion of the partition and it compares it with the within cluster dispertion of generated datasets having similar statistics to the original. The within-cluster dispertion is the normalized sum for each cluster of the sum of the distance between each pair in a cluster.

Usage

1
2
Gap(X, maxK, clusterAlg = myKmean, B = 50, null_distrib = "gaussian",
  verbose = TRUE, ...)

Arguments

X

data matrix or data frame of size n x d, n observations and d features

maxK

maximum number of clusters to evaluate.

clusterAlg

clustering algorithm. Its output must be a list having a compoment "cluster" containing the assignation of each observation. For more details, check the formatting of function myKmean.

B

number of reference datasets to generate

null_distrib

type of the null hypothesis. Can either be "gaussian", "uniform" or "uniformity". "gaussian" draws observations from a mulidimensional normal distribution with the same mean and variance as in the original dataset for each feature . "uniform" draws uniformely observations in the range of each feature. "uniformity" draws observation from a uniform distribution as in gap statistics (Tibshirani et al. 2001)

verbose

logical, if TRUE, plots the evolution of the algorithm

...

additional parameters for the clustering algorithm

Value

list of 3 components

kopt

optimal number of clusters

gap

vector of values for the gap statistic

s

empirical standard deviation of the gap statistic

References

Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic.Journal of the Royal Statistical Society Series B, 63:411-423.


mattmail/clusterAnalysis documentation built on Nov. 4, 2019, 6:18 p.m.