gap: The gap statistic
In clusterGenomics: Identifying clusters in genomics data by recursive partitioning

Description Usage Arguments Details Value Author(s) References Examples

Use the Gap statistic to determine the best number of clusters in a data set

1	gap(X,Kmax=10,B=100,ref.gen="PC",cl.lab=NULL,...)

`X`	a numeric data matrix whose rows are to be clustered using a specified clustering algorithm (default is hierarchical clustering with average linkage and Euclidean distance, see below for other options)
`Kmax`	the maximum number of clusters to be evaluated.
`B`	the number of reference data sets to be generated in the calculation of the gap statistic.
`ref.gen`	a text string specifying how the reference data should be generated. Options are "PC" (reference data are generated uniformly over a box aligned with the principal components of the data) and "range" (reference data are generated uniformly over the range of the data). See the referenced paper for more details.
`cl.lab`	optional list of length `Kmax` giving vectors of cluster labels for the rows in `X` when partitioned into `1,..,Kmax` clusters.
`...`	other optional parameters including: `cl.method`: the desired clustering method. Options currently include "hclust" (default) and "kmeans". `linkage`: the desired linkage to be applied if `cl.method="hclust"`. Default is "average", see the parameter `method` in `hclust` for other options. `dist.method`: the desired distance measure to be applied if `cl.method="hclust"`. Default is "euclidean". Other options include those supported by `dist` (under `method`), "sq.euclidean" (squared Euclidean distance) and "cor" (1 minus correlation distance). `cor.method`: the correlation measure to be used if `dist.method="cor"`. Default is "pearson", see the parameter `method` in `cor` for other options. `nstart`: the number of initial center sets to be applied if `cl.method="kmeans"`. Default is 10. See `kmeans` for details on this.

The rows in X are partitioned into k = 1,..,Kmax clusters, and the Gap statistic is calculated for each partition. The best partition, and hence the best number of clusters, is selected using the Gap criterion (see the reference below).

`hatK`	the best number of clusters according to the Gap criterion.
`lab.hatK`	a vector of same length as the number of rows in `X` assigning a group label to each case (row) in `X` based on the best partition as evaluated by Gap.
`gap`	a vector of length `Kmax` giving the Gap statistic for each evaluated partition.
`sk`	a vector of length `Kmax` giving the standard errors of the Gap statistics.
`W`	a vector of length `Kmax` giving the total within-cluster dispersion for each evaluated partition.

Gro Nilsen

Tibshirani et al., "Estimating the number of clusters in a data set via the gap statistic", Journal of the Royal Statistical Society B, 63, 2001

#Load a simulated data set with 5 clusters
data(exData1)
X = exData1$X
groups = exData1$groups

#Run gap (limit the number of reference data sets to decrease computing time):
res <- gap(X, B=10)

#Compare predicted groups to true groups:
cbind(res$lab.hatK, groups)

#Plot the total within-cluster dispersion and the gap-curve +/- standard errors:
par(mfrow=c(2,1))
plot(1:length(res$W), res$W, type="b")
plot(1:length(res$gap), res$gap, type="b", ylim=c(min(res$gap-res$sk),
max(res$gap+res$sk)), pch=19)
points(1:length(res$sk), res$gap+res$sk, cex=0.7, pch=8)
points(1:length(res$sk), res$gap-res$sk, cex=0.7, pch=8)
segments(x0=1:length(res$sk), y0=res$gap-res$sk, y1=res$gap+res$sk)