Description Usage Arguments Details Value Author(s) References See Also Examples
Performs the gap analysis using lga to estimate the number of clusters.
1 2 3 |
x |
a numeric matrix. |
K |
an integer giving the maximum number of clusters to consider. |
B |
an integer giving the number of bootstraps. |
criteria |
a character string indicating which criteria to evaluate the gap data. One of ‘“tibshirani”’ (default),‘“DandF”’ or ‘“none”’. Can be abbreviated. |
nnode |
an integer of many CPUS to use for parallel processing. Defaults to NULL i.e. no parallel processing. |
scale |
logical. Should the data be scaled? |
... |
For any other arguments passed from the generic function. |
This code performs the gap analysis using lga. The gap statistic is defined as the difference between the log of the Residual Orthogonal Sum of Squared Distances (denoted log(W_k)) and its expected value derived using bootstrapping under the null hypothesis that there is only one cluster. In this implementation, the reference distribution used for the bootstrapping is a random uniform hypercube, transformed by the principal components of the underlying data set. For further details see Tibshirani et al (2001).
For different criteria, different rules apply. With ‘“tibshirani”’ (ibid) we calculate the gap statistic for k = 1, …, K, stopping when
gap(k) >= gap(k+1) - s_(k+1)
where s_(k+1) is a function of standard deviation of the bootstrapped estimates.
With the ‘“DandF”’ criteria from Dudoit et al (2002), we calculate the gap statistic for all values of k = 1, …, K, selecting the number of clusters as
khat = smallest k >= 1 such that gap(k) >= gap(kstar) - s_(kstar)
where kstar = argmax_(k >= 1) gap(k).
Finally, for the criteria “none”, no rules are applied, and just the gap data is returned.
As lga is ostensibly unsupervised in this case, the parameter niter is set to 20 to ensure convergence.
This function is parallel computing aware via the nnode
argument, and works with the package snow
. In order to
use parallel computing, one of MPI (e.g. lamboot) or PVM is necessary.
For further details, see the documentation for snow
.
An object of class ‘“gap”’ with components
finished |
a logical. For the “tibshirani”, was there a solution found? |
nclust |
a integer for the number of clusters estimated. Returns NA if nothing conclusive is found. |
data |
the original data set, scaled if specified in the arguments. |
criteria |
the criteria used. |
Justin Harrington harringt@stat.ubc.ca
Tibshirani, R. and Walther, G. and Hastie, T. (2001) ‘Estimating the number of clusters in a data set via the gap statistic’, J. R. Statist. Soc. B 63, 411–423.
Dudoit, S. and Fridlyand, J. (2002) ‘A prediction-based resampling method for estimating the number of clusters in a dataset’, Genome Biology 3.
Van Aelst, S. and Wang, X. and Zamar, R. and Zhu, R. (2006) ‘Linear Grouping Using Orthogonal Regression’, Computational Statistics \& Data Analysis 50, 1287–1312.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ## Synthetic example
## Make a dataset with 2 clusters in 2 dimensions
library(MASS)
set.seed(1234)
X <- rbind(mvrnorm(n=100, mu=c(1, -2), Sigma=diag(0.1, 2) + 0.9),
mvrnorm(n=100, mu=c(1, 1), Sigma=diag(0.1, 2) + 0.9))
gap(X, K=4, B=20)
## to run this using parallel processing with 4 nodes, the equivalent
## code would be
## Not run: gap(X, K=4, B=20, nnode=4)
## Quakes data (from package:datasets)
## Including the first two dimensions versus three dimensions
## yields different results
set.seed(1234)
## Not run:
gap(quakes[,1:2], K=4, B=20)
gap(quakes[,1:3], K=4, B=20)
## End(Not run)
library(maps)
lgaout1 <- lga(quakes[,1:2], k=3)
plot(lgaout1)
lgaout2 <- lga(quakes[,1:3], k=2)
plot(lgaout2)
## Let's put this in context
par(mfrow=c(1,2))
map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout1$cluster, col=lgaout1$cluster)
map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout2$cluster, col=lgaout2$cluster)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.