optimal_clust: Cluster validation for cluster identification

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/diahclust.R

Description

This function validates the clustering resulting from Variability-based Neighbor Clustering as per vnc by calculating silhouette coefficients and returns the most appropriate clustering of the data as well as potentional outliers.

Usage

1

Arguments

x

an object of class hclust as generated by the vnc function.

y

the distance matrix which was also used as input for vnc (an object of class dist).

Details

The optimal_clust function identifies the optimal number of clusters, i.e., time stages, after Variability-based Neighbor Clustering (VNC, Gries and Hilpert 2008, 2012) has been conducted via the function vnc and also identifies potential outliers. The optimal clustering is identified on the basis of the calculation of silhouette values (Rousseeuw 1987) using R's silhouette function.

Silhouette values provide information about the consistency of clusters by measuring the dissimilarity of an object to the cluster that it is in, compared to its dissimilarity to other clusters. A large silhouette value, i.e., a value close to 1, indicates that the object is clustered well as it is, and a negative value indicates that the object has been assigned to the wrong cluster. The silhouette coefficient of a cluster is moreover defined as the average of silhouettes in a cluster.

The function optimal_clust takes an object of class hclust as generated by vnc and the corresponding distance matrix as calculated via distvnc as input. optimal_clust iterates through all clustering possibilities according to the possible number of clusters (i.e., merges) throughout the clustering process, which are accessed via cutree, and calculates the average of silhouette coefficients of all clusters in a clustering via the silhouette function. Eventually, optimal_clust identifies the clustering with the highest average silhouette coefficient as the best candidate, and returns information about the cluster memberships of data points with respect to the optimal clustering.

optimal_clust is moreover part of the implementation of the DiaHClust methodology (Schätzle and Booth 2019) in the form of the diahclust function which provides an iterative approach to VNC in order to arrive at a suitable number of clusters for the identification of stages in language change.

Value

An object of class optimal_clust, including a list with the following values:

opt_clust

the number of clusters in the clustering with the highest average silhouette coefficient.

final_clust

the clustering with the highest average silhouette coefficient, provides information about cluster memberships.

silcoef

the average silhouette coefficient of the clustering with the highest average silhouette coefficient.

Author(s)

Christin Schätzle

References

Stefan Th. Gries and Martin Hilpert. 2008. The identification of stages in diachronic data: variability-based neighbour clustering. Corpora, 3(1):59–81. Stefan Th. Gries and Martin Hilpert. 2012. Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In Nevalainen Terttu and Elizabeth Closs Traugott, editors, The Oxford Handbook of the History of English, pages 134–144. Oxford University Press, Oxford. Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.

Christin Schätzle and Hannah Booth. 2019. DiaHClust: an iterative hierarchical clustering apprach for identifying stages in language change. to appear.

See Also

silhouette diahclust distvnc vnc

Examples

1
2
3
4
5
6
7
8
9
icelandic=data(icelandic)
icelandic.cor=cor(icelandic[,-1])  #[,-1] because rows are labeled
icelandic.dist=dist(icelandic.cor)

icelandic.vnc=vnc(icelandic.dist, method="average")
plot(icelandic.vnc, hang=-1) #plotting the resulting dendrogram

#cluster validation for identification of time stages
optimal=optimal_clust(icelandic.vnc, icelandic.dist) 

christinschaetzle/DiaHClust documentation built on May 15, 2020, 11:20 p.m.