gap: The gap statistic

Description Usage Arguments Details Value Author(s) References Examples

Description

Use the Gap statistic to determine the best number of clusters in a data set

Usage

1
gap(X,Kmax=10,B=100,ref.gen="PC",cl.lab=NULL,...)

Arguments

X

a numeric data matrix whose rows are to be clustered using a specified clustering algorithm (default is hierarchical clustering with average linkage and Euclidean distance, see below for other options)

Kmax

the maximum number of clusters to be evaluated.

B

the number of reference data sets to be generated in the calculation of the gap statistic.

ref.gen

a text string specifying how the reference data should be generated. Options are "PC" (reference data are generated uniformly over a box aligned with the principal components of the data) and "range" (reference data are generated uniformly over the range of the data). See the referenced paper for more details.

cl.lab

optional list of length Kmax giving vectors of cluster labels for the rows in X when partitioned into 1,..,Kmax clusters.

...

other optional parameters including:

cl.method:

the desired clustering method. Options currently include "hclust" (default) and "kmeans".

linkage:

the desired linkage to be applied if cl.method="hclust". Default is "average", see the parameter method in hclust for other options.

dist.method:

the desired distance measure to be applied if cl.method="hclust". Default is "euclidean". Other options include those supported by dist (under method), "sq.euclidean" (squared Euclidean distance) and "cor" (1 minus correlation distance).

cor.method:

the correlation measure to be used if dist.method="cor". Default is "pearson", see the parameter method in cor for other options.

nstart:

the number of initial center sets to be applied if cl.method="kmeans". Default is 10. See kmeans for details on this.

Details

The rows in X are partitioned into k = 1,..,Kmax clusters, and the Gap statistic is calculated for each partition. The best partition, and hence the best number of clusters, is selected using the Gap criterion (see the reference below).

Value

hatK

the best number of clusters according to the Gap criterion.

lab.hatK

a vector of same length as the number of rows in X assigning a group label to each case (row) in X based on the best partition as evaluated by Gap.

gap

a vector of length Kmax giving the Gap statistic for each evaluated partition.

sk

a vector of length Kmax giving the standard errors of the Gap statistics.

W

a vector of length Kmax giving the total within-cluster dispersion for each evaluated partition.

Author(s)

Gro Nilsen

References

Tibshirani et al., "Estimating the number of clusters in a data set via the gap statistic", Journal of the Royal Statistical Society B, 63, 2001

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#Load a simulated data set with 5 clusters
data(exData1)
X = exData1$X
groups = exData1$groups

#Run gap (limit the number of reference data sets to decrease computing time):
res <- gap(X, B=10)

#Compare predicted groups to true groups:
cbind(res$lab.hatK, groups)

#Plot the total within-cluster dispersion and the gap-curve +/- standard errors:
par(mfrow=c(2,1))
plot(1:length(res$W), res$W, type="b")
plot(1:length(res$gap), res$gap, type="b", ylim=c(min(res$gap-res$sk),
max(res$gap+res$sk)), pch=19)
points(1:length(res$sk), res$gap+res$sk, cex=0.7, pch=8)
points(1:length(res$sk), res$gap-res$sk, cex=0.7, pch=8)
segments(x0=1:length(res$sk), y0=res$gap-res$sk, y1=res$gap+res$sk)

clusterGenomics documentation built on May 2, 2019, 7:04 a.m.