Description Usage Arguments Details Value Author(s) References See Also Examples
This function validates the clustering resulting from Variability-based Neighbor Clustering as per vnc
by calculating silhouette coefficients and returns the most appropriate clustering of the data as well as potentional outliers.
1 | optimal_clust(x, y)
|
x |
an object of class |
y |
the distance matrix which was also used as input for |
The optimal_clust
function identifies the optimal number of clusters, i.e., time stages, after Variability-based Neighbor Clustering (VNC, Gries and Hilpert 2008, 2012) has been conducted via the function vnc
and also identifies potential outliers. The optimal clustering is identified on the basis of the calculation of silhouette values (Rousseeuw 1987) using R's silhouette
function.
Silhouette values provide information about the consistency of clusters by measuring the dissimilarity of an object to the cluster that it is in, compared to its dissimilarity to other clusters. A large silhouette value, i.e., a value close to 1, indicates that the object is clustered well as it is, and a negative value indicates that the object has been assigned to the wrong cluster. The silhouette coefficient of a cluster is moreover defined as the average of silhouettes in a cluster.
The function optimal_clust
takes an object of class hclust as generated by vnc
and the corresponding distance matrix as calculated via distvnc
as input. optimal_clust
iterates through all clustering possibilities according to the possible number of clusters (i.e., merges) throughout the clustering process, which are accessed via cutree
, and calculates the average of silhouette coefficients of all clusters in a clustering via the silhouette
function. Eventually, optimal_clust
identifies the clustering with the highest average silhouette coefficient as the best candidate, and returns information about the cluster memberships of data points with respect to the optimal clustering.
optimal_clust
is moreover part of the implementation of the DiaHClust methodology (Schätzle and Booth 2019) in the form of the diahclust
function which provides an iterative approach to VNC in order to arrive at a suitable number of clusters for the identification of stages in language change.
An object of class optimal_clust
, including a list with the following values:
opt_clust |
the number of clusters in the clustering with the highest average silhouette coefficient. |
final_clust |
the clustering with the highest average silhouette coefficient, provides information about cluster memberships. |
silcoef |
the average silhouette coefficient of the clustering with the highest average silhouette coefficient. |
Christin Schätzle
Stefan Th. Gries and Martin Hilpert. 2008. The identification of stages in diachronic data: variability-based neighbour clustering. Corpora, 3(1):59–81. Stefan Th. Gries and Martin Hilpert. 2012. Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In Nevalainen Terttu and Elizabeth Closs Traugott, editors, The Oxford Handbook of the History of English, pages 134–144. Oxford University Press, Oxford. Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.
Christin Schätzle and Hannah Booth. 2019. DiaHClust: an iterative hierarchical clustering apprach for identifying stages in language change. to appear.
silhouette
diahclust
distvnc
vnc
1 2 3 4 5 6 7 8 9 | icelandic=data(icelandic)
icelandic.cor=cor(icelandic[,-1]) #[,-1] because rows are labeled
icelandic.dist=dist(icelandic.cor)
icelandic.vnc=vnc(icelandic.dist, method="average")
plot(icelandic.vnc, hang=-1) #plotting the resulting dendrogram
#cluster validation for identification of time stages
optimal=optimal_clust(icelandic.vnc, icelandic.dist)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.