vnc: Variability-based Neighbor Clustering
In christinschaetzle/DiaHClust: Hierarchical clustering method for identifying stages in language change

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/diahclust.R

Implementation of the Variability-based Neighbor Clustering (VNC) algorithm developed by Gries and Hilpert (2008, 2012) for the identification of stages in language change which is based on hclust as part of the DiaHClust methodology (Schätzle and Booth 2019).

1	vnc(d, method = c("single", "complete", "average", "median", "ward.D", "ward.D2", "mcquitty", "centroid"))

`d`	a distance matrix as produced by `dist`.
`method`	the agglomeration method to be used. All methods from `hclust` are available for `vnc`, i.e., "single", "complete", "average", "median", "ward.D", "ward.D2", "mcquitty", and "centroid".

The VNC approach is implemented by manipulating individual steps in the workflow behind R's standard agglomerative hierarchical clustering function hclust. In the vector-based approach to VNC by Gries and Hilpert (2008, 2012), a correlation statistic is calculated before clustering the data. This is also often done when clustering vectorial data with hclust (see, e.g., Baayen 2008). Thus, a correlation matrix should be calculated via cor which is then turned into a distance matrix via dist as input for vnc. For the analysis of syntactic change, the correlation matrix should be calculated based on a data matrix where each column represents a vector containing the changing syntactic features extracted from a text, see, e.g., the icelandic data set in the examples below. In the data matrix, the vectors have to be ordered from left to right according to the time stamp of the text. The time stamp should be encoded in the vector name, i.e., the name of the corresponding column in the data matrix. For the application of the full DiaHClust methodology, the vector name has to begin with a four digit year date followed by a dot and the text name, e.g., “1250.STURLUNGA”, allowing one to easily identify individual texts in the clustering (This roughly corresponds to token IDs in Penn-style treebanks).

hclust usually begins by clustering together the two most similar vectors, i.e., the data points with the smallest distance to one another, merging these two data points. This process continues until all data points have been clustered. The method chosen for clustering with hclust represents the method of agglomeration. For example, when method="average" is chosen for agglomeration, cluster similarity between two clusters is assessed based on the average of the data points in the clusters. Moreover, the two data points with the smallest distance are merged into a new data point by averaging the corresponding values after each iteration. This corresponds to the idea behind the amalgamation method in VNC (see Gries and Hilpert 2008, 2012). In general, all agglomeration methods available with hclust are available with vnc. It is recommended to use averages when applying vnc – following Gries and Hilpert (2008, 2012) – since, in quantitative corpus linguistics, (co-)occurence frequencies are usually assessed by averaging frequencies over texts/time periods.

The vnc function takes the same input as hclust, but manipulates the distance matrix so that only temporally adjacent data points (i.e. texts) are allowed to cluster with one another via the application of the distvnc function. distvnc sets all distance values which describe distances between non-temporally adjacent data points to the value which equals the maximum value of the distance matrix. As similarity is measured in terms of the minimum distance, it is highly unlikely that two data points which have these maximized distances to one another will be merged in the clustering process. This in turn allows for the application of hclust – which is called inside vnc after the manipulation of the distance matrix – for clustering according to the ideas of VNC, instead of having to implement a separate clustering algorithm. Thus, vnc returns an object of class hclust. Moreover, vnc adjusts the permutations of the data points which arise during the merging process in order to guarantee the diachronic ordering of data points for plotting via plot. For more details, see also Schätzle and Booth (2019).

An object of class hclust, describing the tree produced by the clustering process. The object returns a list containing the following elements (information mostly taken from hclust help page):

`merge`	an n-1 by 2 matrix. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.
`height`	a set of n-1 real values (non-decreasing for ultrametric trees). The clustering height: that is, the value of the criterion associated with the clustering method for the particular agglomeration.
`order`	a vector giving the permutation of the original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches. The permutations of the observations are adjusted with `vnc` in order to maintain the diachronic ordering.
`labels`	labels for each of the objects being clustered.
`call`	the call which produced the result.
`method`	the cluster method that has been used.
`dist.method`	the distance that has been used to create d (only returned if the distance object has a "method" attribute).

Christin Schätzle

R. Harald Baayen. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge University Press, Cambridge. Stefan Th. Gries and Martin Hilpert. 2008. The identification of stages in diachronic data: variability-based neighbour clustering. Corpora, 3(1):59–81. Stefan Th. Gries and Martin Hilpert. 2012. Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In Nevalainen Terttu and Elizabeth Closs Traugott, editors, The Oxford Handbook of the History of English, pages 134–144. Oxford University Press, Oxford. Christin Schätzle and Hannah Booth. 2019. DiaHClust: an iterative hierarchical clustering apprach for identifying stages in language change. to appear.

hclust diahclust distvnc

icelandic=data(icelandic)
icelandic.cor=cor(icelandic[,-1])  #[,-1] because rows are labeled
icelandic.dist=dist(icelandic.cor)

icelandic.vnc=vnc(icelandic.dist, method="average")
plot(icelandic.vnc, hang=-1) #plotting the resulting dendrogram