Description Usage Arguments Details Value Author(s) References See Also Examples

Performs the gap analysis using lga to estimate the number of clusters.

1 2 3 |

`x` |
a numeric matrix. |

`K` |
an integer giving the maximum number of clusters to consider. |

`B` |
an integer giving the number of bootstraps. |

`criteria` |
a character string indicating which criteria to evaluate the gap data. One of ‘“tibshirani”’ (default),‘“DandF”’ or ‘“none”’. Can be abbreviated. |

`nnode` |
an integer of many CPUS to use for parallel processing. Defaults to NULL i.e. no parallel processing. |

`scale` |
logical. Should the data be scaled? |

`...` |
For any other arguments passed from the generic function. |

This code performs the gap analysis using lga. The gap statistic is
defined as the difference between the log of the Residual Orthogonal
Sum of Squared Distances (denoted *log(W_k)*) and its expected
value derived using bootstrapping under the null hypothesis that there
is only one cluster. In this implementation, the reference
distribution used for the bootstrapping is a random uniform hypercube,
transformed by the principal components of the underlying data set.
For further details see Tibshirani et al (2001).

For different criteria, different rules apply. With
‘“tibshirani”’ (ibid) we calculate the gap
statistic for
*k = 1, …, K*, stopping when

*gap(k) >= gap(k+1)
- s_(k+1)*

where *s_(k+1)* is a function of standard deviation of
the bootstrapped estimates.

With the ‘“DandF”’ criteria from Dudoit et al
(2002), we calculate the gap statistic for
all values of *k = 1, …, K*, selecting the number of clusters
as

*khat = smallest k >= 1 such that gap(k) >=
gap(kstar) - s_(kstar)*

where *kstar = argmax_(k
>= 1) gap(k)*.

Finally, for the criteria “none”, no rules are applied, and just the gap data is returned.

As lga is ostensibly unsupervised in this case, the parameter niter is set to 20 to ensure convergence.

This function is parallel computing aware via the `nnode`

argument, and works with the package `snow`

. In order to
use parallel computing, one of MPI (e.g. lamboot) or PVM is necessary.
For further details, see the documentation for `snow`

.

An object of class ‘“gap”’ with components

`finished` |
a logical. For the “tibshirani”, was there a solution found? |

`nclust` |
a integer for the number of clusters estimated. Returns NA if nothing conclusive is found. |

`data` |
the original data set, scaled if specified in the arguments. |

`criteria` |
the criteria used. |

Justin Harrington harringt@stat.ubc.ca

Tibshirani, R. and Walther, G. and Hastie, T. (2001)
‘Estimating the number of clusters in a data set via the gap
statistic’, *J. R. Statist. Soc. B* **63**, 411–423.

Dudoit, S. and Fridlyand, J. (2002) ‘A prediction-based
resampling method for estimating the number of clusters in a
dataset’, *Genome Biology* **3**.

Van Aelst, S. and Wang, X. and Zamar, R. and Zhu, R. (2006)
‘Linear Grouping Using Orthogonal Regression’,
*Computational Statistics \& Data Analysis* **50**,
1287–1312.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ```
## Synthetic example
## Make a dataset with 2 clusters in 2 dimensions
library(MASS)
set.seed(1234)
X <- rbind(mvrnorm(n=100, mu=c(1, -2), Sigma=diag(0.1, 2) + 0.9),
mvrnorm(n=100, mu=c(1, 1), Sigma=diag(0.1, 2) + 0.9))
gap(X, K=4, B=20)
## to run this using parallel processing with 4 nodes, the equivalent
## code would be
## Not run: gap(X, K=4, B=20, nnode=4)
## Quakes data (from package:datasets)
## Including the first two dimensions versus three dimensions
## yields different results
set.seed(1234)
## Not run:
gap(quakes[,1:2], K=4, B=20)
gap(quakes[,1:3], K=4, B=20)
## End(Not run)
library(maps)
lgaout1 <- lga(quakes[,1:2], k=3)
plot(lgaout1)
lgaout2 <- lga(quakes[,1:3], k=2)
plot(lgaout2)
## Let's put this in context
par(mfrow=c(1,2))
map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout1$cluster, col=lgaout1$cluster)
map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout2$cluster, col=lgaout2$cluster)
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.