diahclust: DiaHClust: an iterative hierarchical clustering approach for...
In christinschaetzle/DiaHClust: Hierarchical clustering method for identifying stages in language change

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/diahclust.R

This function implements the DiaHClust methodology (Schätzle and Booth 2019) for the identification of stages in historical linguistic change. DiaHClust is based on the iterative application of Variability-based Neighbor Clustering (VNC, Gries and Hilpert 2008, 2012) in combination with a cluster validation process using silhouette values in order to provide a multi-layered perspective on language change, from text-level to broader time stages, while also respecting outliers and genre effects.

1	diahclust(x, y, method = c("single", "complete", "average", "median", "ward.D", "ward.D2", "mcquitty", "centroid"))

`x`	an object of class optimal_clust.
`y`	a data matrix containing vectors of syntactic change (or other numerical data) as rows. The names of the vectors have to consist of a year date followed by a dot and a string of characters (e.g., the text name).
`method`	the agglomeration method to be used. All methods from `hclust` are available for `diahclust`, i.e., "single", "complete", "average", "median", "ward.D", "ward.D2", "mcquitty", and "centroid".

When the results of the optimal_clust function indicate that an initial clustering via vnc yields 10 or more clusters, the clustering process can be continued via the diahclust function. diahclust takes as input the original data matrix which is aggregated after each iteration and the output created by the optimal_clust function after the first clustering. Thus, before applying diahclust, clustering via vnc has to be performed and validated by optimal_clust.

In diahclust, data points which belong to a single cluster are aggregated by averaging the corresponding syntactic vectors in the underlying dataset. This is done via aggregate_data. To keep track of the texts and time stages which form clusters across the iterations, the names of the new vectors consist of the sequence of the names of the aggregated vectors. The previously applied process of VNC with respect to the new dataset is repeated by calling cor, dist, and vnc in diahclust. Moreover, an agglomeration method has to be specified. When the agglomeration method chosen for VNC and DiaHClust is not “average”, a different aggregation method, e.g., the minimum with single linkage clustering, should be applied.

diahclust automatically plots the clustering as a dendrogram. The labels on the dendrogram are abbreviated for better visibility, representing the range of previously aggregated vectors, with the oldest and the youngest text in the range connected via a hyphen. The resulting clustering is again evaluated using the optimal_clust function, which returns the cluster memberships listing the full range of texts in the clusters. The application of this process is repeated until the final evaluation arrives at an optimal number of clusters less than 10. In this iterative process, the clusters, i.e., time stages, can be inspected at each step of the iteration, allowing one to track the composition of the clusters with respect to the individual texts from the first iteration onwards.

After the last iteration, diahclust returns object of class hclust, describing the tree produced by the final clustering process. The object returns a list containing the following elements (information mostly taken from hclust help page, see also vnc):

`merge`	an n-1 by 2 matrix. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.
`height`	a set of n-1 real values (non-decreasing for ultrametric trees). The clustering height: that is, the value of the criterion associated with the clustering method for the particular agglomeration.
`order`	a vector giving the permutation of the original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches. The permutations of the observations are adjusted with `vnc` in order to maintain the diachronic ordering.
`labels`	labels for each of the objects being clustered.
`call`	the call which produced the result.
`method`	the cluster method that has been used.
`dist.method`	the distance that has been used to create d (only returned if the distance object has a "method" attribute).

Christin Schätzle

Stefan Th. Gries and Martin Hilpert. 2008. The identification of stages in diachronic data: variability-based neighbour clustering. Corpora, 3(1):59–81. Stefan Th. Gries and Martin Hilpert. 2012. Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In Nevalainen Terttu and Elizabeth Closs Traugott, editors, The Oxford Handbook of the History of English, pages 134–144. Oxford University Press, Oxford. Christin Schätzle and Hannah Booth. 2019. DiaHClust: an iterative hierarchical clustering apprach for identifying stages in language change. to appear.

hclust diahclust distvnc optimal_clust

icelandic=data(icelandic)
icelandic.cor=cor(icelandic[,-1])  #[,-1] because rows are labeled
icelandic.dist=dist(icelandic.cor)

icelandic.vnc=vnc(icelandic.dist, method="average")
plot(icelandic.vnc, hang=-1) #plotting the resulting dendrogram

#cluster validation for identification of time stages
optimal=optimal_clust(icelandic.vnc, icelandic.dist) 

#Iterative clustering with DiaHClust methodology
icelandic.diahclust=diahclust(optimal, icelandic[,-1], method="average")