find.cluster: cluster identification using successive Kmeans
Description
These functions implement the clustering procedure used in
Discriminant Analysis of Principal Components (DAPC, Jombart et
al. 2010). This procedure consists in running successive Kmeans with
an increasing number of clusters (k
), after transforming data
using a principal component analysis (PCA). For each model, a
statistical measure of goodness of fit (by default, BIC) is computed,
which allows to choose the optimal k
. See details
for a
description of how to select the optimal k
and
vignette("adegenetdapc")
for a tutorial.
Optionally, hierarchical clustering can be sought by providing a prior
clustering of individuals (argument clust
). In such case, clusters will
be sought within each prior group.
The Kmeans procedure used in find.clusters
is kmeans
function
from the stats
package. The PCA function is dudi.pca
from the
ade4
package, except for genlight objects which
use the glPca
procedure from adegenet.
find.clusters
is a generic function with methods for the
following types of objects:
 data.frame
(only numeric data)
 matrix
(only numeric data)
 genind
objects (genetic markers)
 genlight
objects (genomewide SNPs)
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  ## S3 method for class 'data.frame'
find.clusters(x, clust=NULL, n.pca=NULL,
n.clust=NULL, stat=c("BIC","AIC", "WSS"),
choose.n.clust=TRUE,criterion=c("diffNgroup",
"min","goesup", "smoothNgoesup", "goodfit"),
max.n.clust=round(nrow(x)/10), n.iter=1e5, n.start=10,
center=TRUE, scale=TRUE, pca.select=c("nbEig","percVar"),
perc.pca=NULL, ..., dudi=NULL)
## S3 method for class 'matrix'
find.clusters(x, ...)
## S3 method for class 'genind'
find.clusters(x, clust=NULL, n.pca=NULL, n.clust=NULL,
stat=c("BIC","AIC", "WSS"), choose.n.clust=TRUE,
criterion=c("diffNgroup", "min","goesup", "smoothNgoesup",
"goodfit"), max.n.clust=round(nrow(x@tab)/10), n.iter=1e5,
n.start=10, scale=FALSE, truenames=TRUE, ...)
## S3 method for class 'genlight'
find.clusters(x, clust=NULL, n.pca=NULL,
n.clust=NULL, stat=c("BIC", "AIC",
"WSS"),choose.n.clust=TRUE, criterion=c("diffNgroup",
"min","goesup","smoothNgoesup", "goodfit"),
max.n.clust=round(nInd(x)/10), n.iter=1e5,n.start=10,
scale=FALSE, pca.select=c("nbEig","percVar"),
perc.pca=NULL,glPca=NULL, ...)

Arguments
x 

clust 
an optional 
n.pca 
an 
n.clust 
an optinal 
stat 
a 
choose.n.clust 
a 
criterion 
a 
max.n.clust 
an 
n.iter 
an 
n.start 
an 
center 
a 
scale 
a 
pca.select 
a 
perc.pca 
a 
truenames 
a 
... 
further arguments to be passed to other functions. For

dudi 
optionally, a multivariate analysis with the class

glPca 
an optional 
Details
=== ON THE SELECTION OF K ===
(where K is the 'optimal' number of clusters)
So far, the analysis of data simulated under various population genetics models (see reference) suggested an ad hoc rule for the selection of the optimal number of clusters. First important result is that BIC seems for efficient than AIC and WSS to select the appropriate number of clusters (see example). The rule of thumb consists in increasing K until it no longer leads to an appreciable improvement of fit (i.e., to a decrease of BIC). In the most simple models (island models), BIC decreases until it reaches the optimal K, and then increases. In these cases, our rule amounts to choosing the lowest K. In other models such as stepping stones, the decrease of BIC often continues after the optimal K, but is much less steep.
An alternative approach is the automatic selection based on a fixed
criterion. Note that, in any case, it is highly recommended to look at
the graph of the BIC for different numbers of clusters as displayed
during the interactive cluster selection.
To use automated selection, set choose.n.clust
to FALSE and specify
the criterion
you want to use, from the following values:
 "diffNgroup": differences between successive values of the summary
statistics (by default, BIC) are splitted into two groups using a
Ward's clustering method (see ?hclust
), to differentiate sharp
decrease from mild decreases or increases. The retained K is the one
before the first group switch. Appears to work well for
island/hierarchical models, and decently for isolation by distance
models, albeit with some unstability. Can be impacted by an initial,
very sharp decrease of the test statistics. IF UNSURE ABOUT THE
CRITERION TO USE, USE THIS ONE.
 "min": the model with the minimum summary statistics (as specified
by stat
argument, BIC by default) is retained. Is likely to
work for simple island model, using BIC. It is likely to fail in
models relating to stepping stones, where the BIC always decreases
(albeit by a small amount) as K increases. In general, this approach
tends to overestimate the number of clusters.
 "goesup": the selected model is the K after which increasing the number of clusters leads to increasing the summary statistics. Suffers from inaccuracy, since i) a steep decrease might follow a small 'bump' of increase of the statistics, and ii) increase might never happen, or happen after negligible decreases. Is likely to work only for clearcut island models.
 "smoothNgoesup": a variant of "goesup", in which the summary statistics is first smoothed using a lowess approach. Is meant to be more accurate than "goesup" as it is less prone to stopping to small 'bumps' in the decrease of the statistics.
 "goodfit": another criterion seeking a good fit with a minimum number of clusters. This approach does not rely on differences between successive statistics, but on absolute fit. It selects the model with the smallest K so that the overall fit is above a given threshold.
Value
The class find.clusters
is a list with the following
components:
Kstat 
a 
stat 
a 
grp 
a 
size 
an 
Author(s)
Thibaut Jombart t.jombart@imperial.ac.uk
References
Jombart T, Devillard S and Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11:94. doi:10.1186/147121561194
See Also
 dapc
: implements the DAPC.
 scatter.dapc
: graphics for DAPC.
 dapcIllus
: dataset illustrating the DAPC and find.clusters
.
 eHGDP
: dataset illustrating the DAPC and find.clusters
.
 kmeans
: implementation of Kmeans in the stat package.
 dudi.pca
: implementation of PCA in the ade4 package.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56  ## Not run:
## THIS ONE TAKES A FEW MINUTES TO RUN ##
data(eHGDP)
## here, n.clust is specified, so that only on K value is used
grp < find.clusters(eHGDP, max.n=30, n.pca=200, scale=FALSE,
n.clust=4) # takes about 2 minutes
names(grp)
grp$Kstat
grp$stat
## to try different values of k (interactive)
grp < find.clusters(eHGDP, max.n=50, n.pca=200, scale=FALSE)
## and then, to plot BIC values:
plot(grp$Kstat, type="b", col="blue")
## ANOTHER SIMPLE EXAMPLE ##
data(sim2pop) # this actually contains 2 pop
## DETECTION WITH BIC (clear result)
foo.BIC < find.clusters(sim2pop, n.pca=100, choose=FALSE)
plot(foo.BIC$Kstat, type="o", xlab="number of clusters (K)", ylab="BIC",
col="blue", main="Detection based on BIC")
points(2, foo.BIC$Kstat[2], pch="x", cex=3)
mtext(3, tex="'X' indicates the actual number of clusters")
## DETECTION WITH AIC (less clearcut)
foo.AIC < find.clusters(sim2pop, n.pca=100, choose=FALSE, stat="AIC")
plot(foo.AIC$Kstat, type="o", xlab="number of clusters (K)",
ylab="AIC", col="purple", main="Detection based on AIC")
points(2, foo.AIC$Kstat[2], pch="x", cex=3)
mtext(3, tex="'X' indicates the actual number of clusters")
## DETECTION WITH WSS (less clearcut)
foo.WSS < find.clusters(sim2pop, n.pca=100, choose=FALSE, stat="WSS")
plot(foo.WSS$Kstat, type="o", xlab="number of clusters (K)", ylab="WSS
(residual variance)", col="red", main="Detection based on WSS")
points(2, foo.WSS$Kstat[2], pch="x", cex=3)
mtext(3, tex="'X' indicates the actual number of clusters")
## TOY EXAMPLE FOR GENLIGHT OBJECTS ##
x < glSim(100,500,500)
x
plot(x)
grp < find.clusters(x, n.pca=100, choose=FALSE, stat="BIC")
plot(grp$Kstat, type="o", xlab="number of clusters (K)",
ylab="BIC",main="find.clusters on a genlight object\n(two groups)")
## End(Not run)
