Description Usage Arguments Details Value Author(s) References Examples
Use the Gap statistic to determine the best number of clusters in a data set
1 |
X |
a numeric data matrix whose rows are to be clustered using a specified clustering algorithm (default is hierarchical clustering with average linkage and Euclidean distance, see below for other options) |
Kmax |
the maximum number of clusters to be evaluated. |
B |
the number of reference data sets to be generated in the calculation of the gap statistic. |
ref.gen |
a text string specifying how the reference data should be generated. Options are "PC" (reference data are generated uniformly over a box aligned with the principal components of the data) and "range" (reference data are generated uniformly over the range of the data). See the referenced paper for more details. |
cl.lab |
optional list of length |
... |
other optional parameters including:
|
The rows in X
are partitioned into k = 1,..,Kmax clusters, and the Gap statistic is calculated for each partition. The best partition, and hence the best number of clusters, is selected using the Gap criterion (see the reference below).
hatK |
the best number of clusters according to the Gap criterion. |
lab.hatK |
a vector of same length as the number of rows in |
gap |
a vector of length |
sk |
a vector of length |
W |
a vector of length |
Gro Nilsen
Tibshirani et al., "Estimating the number of clusters in a data set via the gap statistic", Journal of the Royal Statistical Society B, 63, 2001
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #Load a simulated data set with 5 clusters
data(exData1)
X = exData1$X
groups = exData1$groups
#Run gap (limit the number of reference data sets to decrease computing time):
res <- gap(X, B=10)
#Compare predicted groups to true groups:
cbind(res$lab.hatK, groups)
#Plot the total within-cluster dispersion and the gap-curve +/- standard errors:
par(mfrow=c(2,1))
plot(1:length(res$W), res$W, type="b")
plot(1:length(res$gap), res$gap, type="b", ylim=c(min(res$gap-res$sk),
max(res$gap+res$sk)), pch=19)
points(1:length(res$sk), res$gap+res$sk, cex=0.7, pch=8)
points(1:length(res$sk), res$gap-res$sk, cex=0.7, pch=8)
segments(x0=1:length(res$sk), y0=res$gap-res$sk, y1=res$gap+res$sk)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.