clusGap: Gap Statistic for Estimating the Number of Clusters
In pimentel/cluster: Cluster Analysis Extended Rousseeuw et al.

Description Usage Arguments Details Value Author(s) References See Also Examples

clusGap() calculates a goodness of clustering measure, the “gap” statistic. For each number of clusters k, it compares \log(W(k)) with E^*[\log(W(k))] where the latter is defined via bootstrapping, i.e. simulating from a reference distribution.

maxSE(f, SE.f) determines the location of the maximum of f, taking a “1-SE rule” into account for the *SE* methods. The default method "firstSEmax" looks for the smallest k such that its value f(k) is not more than 1 standard error away from the first local maximum. This is similar but not the same as "Tibs2001SEmax", Tibshirani et al's recommendation of determining the number of clusters from the gap statistics and their standard deviations.

 clusGap(x, FUNcluster, K.max, B = 100, verbose = interactive(),
    do_parallel = FALSE, ...)

maxSE(f, SE.f,
      method = c("firstSEmax", "Tibs2001SEmax", "globalSEmax",
                 "firstmax", "globalmax"),
      SE.factor = 1)
## S3 method for class 'clusGap'
print(x, method = "firstSEmax", SE.factor = 1, ...)

`x`	numeric matrix or `data.frame`.
`FUNcluster`	a `function` which accepts as first argument a (data) matrix like `x`, second argument, say k, k >= 2, the number of clusters desired, and returns a `list` with a component named (or shortened to) `cluster` which is a vector of length `n = nrow(x)` of integers in `1:k` determining the clustering or grouping of the `n` observations.
`K.max`	the maximum number of clusters to consider, must be at least two.
`B`	integer, number of Monte Carlo (“bootstrap”) samples.
`verbose`	integer or logical, determining if “progress” output should be printed. The default prints one bit per bootstrap sample.
`do_parallel`	logical. If TRUE, then use `parallel::mclapply`. If FALSE, then use `lapply`. To set the number of cores, set `option(mc.cores = n_cores)`

where n_cores is the number of cores

The main result <res>$Tab[,"gap"] of course is from bootstrapping aka Monte Carlo simulation and hence random, or equivalently, depending on the initial random seed (see set.seed()). On the other hand, in our experience, using B = 500 gives quite precise results such that the gap plot is basically unchanged after an another run.

an object of S3 class "clusGap", basically a list with components

`Tab`	a matrix with `K.max` rows and 4 columns, named "logW", "E.logW", "gap", and "SE.sim", where `gap = E.logW - logW`, and `SE.sim` corresponds to the standard error of `gap`, `SE.sim[k]=`s[k], where s[k] := sqrt(1 + 1/B) sd^(gap[]), and sd^() is the standard deviation of the simulated (“bootstrapped”) gap values.
`n`	number of observations, i.e., `nrow(x)`.
`B`	input `B`
`FUNcluster`	input function `FUNcluster`

This function is originally based on the functions gap of (Bioconductor) package SAGx by Per Broberg, gapStat() from former package SLmisc by Matthias Kohl and ideas from gap() and its methods of package lga by Justin Harrington.

The current implementation is by Martin Maechler.

Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.

Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.

Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. http://home.swipnet.se/pibroberg/expression_hemsida1.html

silhouette for a much simpler less sophisticated goodness of clustering measure.

cluster.stats() in package fpc for alternative measures.

### --- maxSE() methods -------------------------------------------
(mets <- eval(formals(maxSE)$method))
fk <- c(2,3,5,4,7,8,5,4)
sk <- c(1,1,2,1,1,3,1,1)/2
## use plot.clusGap():
plot(structure(class="clusGap", list(Tab = cbind(gap=fk, SE.sim=sk))))
## Note that 'firstmax' and 'globalmax' are always at 3 and 6 :
sapply(c(1/4, 1,2,4), function(SEf)
        sapply(mets, function(M) maxSE(fk, sk, method = M, SE.factor = SEf)))

### --- clusGap() -------------------------------------------------
## ridiculously nicely separated clusters in 3 D :
x <- rbind(matrix(rnorm(150,           sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3),
           matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3))

## Slightly faster way to use pam (see below)
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))

doExtras <- cluster:::doExtras()
## or set it explicitly to TRUE for the following
if(doExtras) {
## Note we use  B = 60 in the following examples to keep them "speedy".
## ---- rather keep the default B = 500 for your analysis!

## note we can  pass 'nstart = 20' to kmeans() :
gskmn <- clusGap(x, FUN = kmeans, nstart = 20, K.max = 8, B = 60)
gskmn #-> its print() method
plot(gskmn, main = "clusGap(., FUN = kmeans, n.start=20, B= 60)")
set.seed(12); system.time(
  gsPam0 <- clusGap(x, FUN = pam, K.max = 8, B = 60)
)
set.seed(12); system.time(
  gsPam1 <- clusGap(x, FUN = pam1, K.max = 8, B = 60)
)
## and show that it gives the same:
stopifnot(identical(gsPam1[-4], gsPam0[-4]))
gsPam1
print(gsPam1, method="globalSEmax")
print(gsPam1, method="globalmax")
}

gs.pam.RU <- clusGap(ruspini, FUN = pam1, K.max = 8, B = 60)
gs.pam.RU
plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data")
mtext("k = 4 is best .. and  k = 5  pretty close")

## This takes a minute..
## No clustering ==> k = 1 ("one cluster") should be optimal:
Z <- matrix(rnorm(256*3), 256,3)
gsP.Z <- clusGap(Z, FUN = pam1, K.max = 8, B = 200)
plot(gsP.Z)
gsP.Z