View source: R/PRE_FATE.speciesClustering_step1.R
PRE_FATE.speciesClustering_step1 | R Documentation |
This script is designed to create clusters of species based on a distance matrix between those species. Several metrics are computed to evaluate these clusters and a graphic is produced to help the user to choose the best number of clusters..
PRE_FATE.speciesClustering_step1(mat.species.DIST, opt.no_clust_max = 15)
mat.species.DIST |
a |
opt.no_clust_max |
(optional) default |
This function allows to obtain dendrograms based on a dissimilarity distance matrix between species.
As for the PRE_FATE.speciesDistance
method, clustering can be
run for data subsets, conditioning that mat.species.DIST
is given as
a list
of dist
objects (instead of a dist
object alone).
The process is as follows :
hierarchical clustering on the dissimilarity matrix is realized with the
hclust
.
Several methods are available for the agglomeration : complete, ward.D, ward.D2, single, average (UPGMA), mcquitty (WPGMA), median (WPGMC) and centroid (UPGMC).
Mouchet et al. (2008) proposed a similarity measure between the input distance and the one obtained with the clustering which must be minimized to help finding the best clustering method :
1 - cor( \text{mat.species.DIST}, \text{clustering.DIST} ) ^ 2
For each agglomeration method, this measure is calculated. The
method that minimizes it is kept and used for further analyses (see
‘.pdf’ output file).
once the hierarchical
clustering is done, the number of clusters to keep should be chosen.
To do that, several metrics are computed :
Dunn index (mdunn
) : ratio of the smallest
distance between observations not in the same cluster to the largest
intra-cluster distance. Value between 0
and \infty
, and
should be maximized.
Meila's Variation of Information index (mVI
) :
measures the amount of information lost and gained in changing
between two clusterings. Should be minimized.
Coefficient of determination (R2
) : value
between 0
and 1
. Should be maximized.
Calinski and Harabasz index (ch
) : the higher
the value, the "better" is the solution.
Corrected rand index (Rand
) : measures the
similarity between two data clusterings. Value between 0
and
1
, with 0
indicating that the two data clusters do not
agree on any pair of points and 1
indicating that the data
clusters are exactly the same.
Average silhouette width (av.sil
) : Observations
with a large s(i)
(almost 1
) are very well clustered, a
small s(i)
(around 0
) means that the observation lies
between two clusters, and observations with a negative s(i)
are
probably placed in the wrong cluster. Should be maximized.
A graphic is produced, giving the values of these metrics in function of the number of clusters used. Number of clusters are highlighted in function of evaluation metrics' values to help the user to make his/her optimal choice : the brighter (yellow-ish) the better (see ‘.pdf’ output file).
Mouchet M., Guilhaumon f., Villeger S., Mason N.W.H., Tomasini J.A. &
Mouillot D., 2008. Towards a consensus for calculating dendrogam-based
functional diversity indices. Oikos, 117, 794-800.
A list
containing one list
, one data.frame
with
the following columns, and two ggplot2
objects :
a list
with as many objects of
class hclust
as data subsets
GROUP
name of data subset
no.clusters
number of clusters used for the clustering
variable
evaluation metrics' name
value
value of evaluation metric
ggplot2
object, representing the different
values of metrics to choose the clustering method
ggplot2
object, representing the different
values of metrics to choose the number of clusters
One ‘PRE_FATE_CLUSTERING_STEP1_numberOfClusters.pdf’ file is created containing two types of graphics :
to account for the chosen clustering method
for decision support, to help the user to choose
the adequate number of clusters to be given to the
PRE_FATE.speciesClustering_step2
function
The function does not return ONE dendrogram (or as many as
given dissimilarity structures) but a LIST with all tested numbers
of clusters. One final dendrogram can then be obtained using this result
as a parameter in the PRE_FATE.speciesClustering_step2
function.
Isabelle Boulangeat, Maya Guéguen
hclust
,
cutree
,
cluster.stats
,
dunn
,
PRE_FATE.speciesDistance
,
PRE_FATE.speciesClustering_step2
## Load example data
Champsaur_PFG = .loadData('Champsaur_PFG', 'RData')
## Species dissimilarity distances (niche overlap + traits distance)
tab.dist = list('Phanerophyte' = Champsaur_PFG$sp.DIST.P$mat.ALL
, 'Chamaephyte' = Champsaur_PFG$sp.DIST.C$mat.ALL
, 'Herbaceous' = Champsaur_PFG$sp.DIST.H$mat.ALL)
str(tab.dist)
as.matrix(tab.dist[[1]])[1:5, 1:5]
## Build dendrograms ---------------------------------------------------------
sp.CLUST = PRE_FATE.speciesClustering_step1(mat.species.DIST = tab.dist)
names(sp.CLUST)
str(sp.CLUST$clust.evaluation)
plot(sp.CLUST$plot.clustMethod)
plot(sp.CLUST$plot.clustNo)
## Not run:
require(foreach)
require(ggplot2)
require(ggdendro)
pp = foreach(x = names(sp.CLUST$clust.dendrograms)) %do%
{
hc = sp.CLUST$clust.dendrograms[[x]]
pp = ggdendrogram(hc, rotate = TRUE) +
labs(title = paste0('Hierarchical clustering based on species distance '
, ifelse(length(names(sp.CLUST$clust.dendrograms)) > 1
, paste0('(group ', x, ')')
, '')))
return(pp)
}
plot(pp[[1]])
plot(pp[[2]])
plot(pp[[3]])
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.