PRE_FATE.speciesClustering_step1: Create clusters based on dissimilarity matrix

Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/PRE_FATE.speciesClustering_step1.R

Description

This script is designed to create clusters of species based on a distance matrix between those species. Several metrics are computed to evaluate these clusters and a graphic is produced to help the user to choose the best number of clusters..

Usage

1

Arguments

mat.species.DIST

a dist object, or a list of dist objects (one for each GROUP value), corresponding to the dissimilarity distance between each pair of species.
Such an object can be obtained with the PRE_FATE.speciesDistance function.

Details

This function allows to obtain dendrograms based on a dissimilarity distance matrix between species.

As for the PRE_FATE.speciesDistance method, clustering can be run for data subsets, conditioning that mat.species.DIST is given as a list of dist objects (instead of a dist object alone).

The process is as follows :

1. Choice of the
optimal
clustering method

hierarchical clustering on the dissimilarity matrix is realized with the hclust.

  • Several methods are available for the agglomeration : complete, ward.D, ward.D2, single, average (UPGMA), mcquitty (WPGMA), median (WPGMC) and centroid (UPGMC).

  • Mouchet et al. (2008) proposed a similarity measure between the input distance and the one obtained with the clustering which must be minimized to help finding the best clustering method :

    1 - cor( \text{mat.species.DIST}, \text{clustering.DIST} ) ^ 2

For each agglomeration method, this measure is calculated. The method that minimizes it is kept and used for further analyses (see ‘.pdf’ output file).

2. Evaluation of the
clustering

once the hierarchical clustering is done, the number of clusters to keep should be chosen.
To do that, several metrics are computed :

  • Dunn index (mdunn) : ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. Value between 0 and , and should be maximized.

  • Meila's Variation of Information index (mVI) : measures the amount of information lost and gained in changing between two clusterings. Should be minimized.

  • Coefficient of determination (R2) : value between 0 and 1. Should be maximized.

  • Calinski and Harabasz index (ch) : the higher the value, the "better" is the solution.

  • Corrected rand index (Rand) : measures the similarity between two data clusterings. Value between 0 and 1, with 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same.

  • Average silhouette width (av.sil) : Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster. Should be maximized.

A graphic is produced, giving the values of these metrics in function of the number of clusters used. Number of clusters with evaluation metrics' values among the 3 best are highlighted to help the user to make his/her optimal choice (see ‘.pdf’ output file).



Mouchet M., Guilhaumon f., Villeger S., Mason N.W.H., Tomasini J.A. & Mouillot D., 2008. Towards a consensus for calculating dendrogam-based functional diversity indices. Oikos, 117, 794-800.

Value

A list containing one list, one data.frame with the following columns, and two ggplot2 objects :

clust.dendrograms

a list with as many objects of class hclust as data subsets

clust.evaluation


GROUP

name of data subset

no.clusters

number of clusters used for the clustering

variable

evaluation metrics' name

value

value of evaluation metric

plot.clustMethod

ggplot2 object, representing the different values of metrics to choose the clustering method

plot.clustNo

ggplot2 object, representing the different values of metrics to choose the number of clusters

One ‘PRE_FATE_CLUSTERING_STEP1_numberOfClusters.pdf’ file is created containing two types of graphics :

clusteringMethod

to account for the chosen clustering method

numberOfClusters

for decision support, to help the user to choose the adequate number of clusters to be given to the PRE_FATE.speciesClustering_step2 function

Note

The function does not return ONE dendrogram (or as many as given dissimilarity structures) but a LIST with all tested numbers of clusters. One final dendrogram can then be obtained using this result as a parameter in the PRE_FATE.speciesClustering_step2 function.

Author(s)

Maya Guéguen

See Also

hclust, cutree, cluster.stats, dunn, PRE_FATE.speciesDistance, PRE_FATE.speciesClustering_step2

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
## Load example data
data(DATASET_Bauges_PFG)

## Species dissimilarity distance (niche overlap + traits distance)
tab.dist = DATASET_Bauges_PFG$dom.dist_total
str(tab.dist)
as.matrix(tab.dist[[1]])[1:5, 1:5]

## Build dendrograms -------------------------------------------------------------------------
sp.CLUST = PRE_FATE.speciesClustering_step1(mat.species.DIST = tab.dist)
names(sp.CLUST)

## Not run: 
require(foreach)
require(ggplot2)
require(ggdendro)
pp = foreach(x = names(sp.CLUST$clust.dendrograms)) %do%
{
  hc = sp.CLUST$clust.dendrograms[[x]]
  pp = ggdendrogram(hc, rotate = TRUE) +
    labs(title = paste0("Hierarchical clustering based on species distance "
                        , ifelse(length(names(sp.CLUST$clust.dendrograms)) > 1
                                 , paste0("(group ", x, ")")
                                 , "")))
  return(pp)
}
plot(pp[[1]])
plot(pp[[2]])
plot(pp[[3]])

## End(Not run)

str(sp.CLUST$clust.evaluation)

plot(sp.CLUST$plot.clustMethod)
plot(sp.CLUST$plot.clustNo)

MayaGueguen/RFate documentation built on Oct. 17, 2020, 8:06 a.m.