linkage: Extended Agglomerative Hierarchical Clustering

View source: R/linkage.R

linkageR Documentation

Extended Agglomerative Hierarchical Clustering

Description

Agglomerative hierarchical clustering on a dataset of distances or similarities, returning a multifurcated dendrogram or multidendrogram. Descriptive measures to analyze the resulting dendrogram are additionally provided.

Usage

linkage(prox, type.prox = "distance", digits = NULL,
        method = "arithmetic", par.method = 0, weighted = FALSE,
        group = "variable")

descval(prox, type.prox = "distance", digits = NULL,
        method = "versatile", par.method = c(-1,0,+1), weighted = FALSE,
        group = "variable", measure = "cor")

descplot(prox, ..., type.prox = "distance", digits = NULL,
         method = "versatile", par.method = c(-1,0,+1), weighted = FALSE,
         group = "variable", measure = "cor", slope = 10)

Arguments

prox

A structure of class "dist" containing non-negative proximity data (distances or similarities). All the linkage methods are used with non-squared proximity data as input, except for method "centroid", which is meant to be used with squared Euclidean distances.

type.prox

A character string to indicate whether the proximity data represent "distance" (default) or "similarity" between objects. Methods "ward" and "centroid" cannot be used with similarity data as input, while the rest of the linkage methods can be used with both distances and similarities.

digits

An integer value specifying the precision, i.e. the number of significant decimal digits to be used for the comparisons between proximity data. This is an important parameter, since equal proximity data at a certain precision may become different by increasing its value. Thus, it may be responsible of the existence of tied proximity data. If the value of this parameter is negative or NULL (default), then the precision is automatically set to that of the input proximity value with the largest number of significant decimal digits.

method

A character string specifying the linkage method to be used. For linkage(), this should be one of: "single", "complete", "arithmetic", "geometric", "harmonic", "versatile", "ward", "centroid" or "flexible". Methods "versatile" and "flexible" are the only two methods that can be used in descval() and descplot(). See the Details section.

par.method

A real value, in the case of linkage(), or a vector of real values, in the case of descval() and descplot(), required as parameter for the methods "versatile" and "flexible". The range of possible values is [-Inf, +Inf] for "versatile", and [-1, +1] for "flexible". See the Details section.

weighted

A logical value to choose between the weighted and the unweighted (default) versions of some linkage methods. Weighted linkage gives merging branches in a dendrogram equal weight regardless of the number of objects carried on each branch. Such a procedure weights objects unequally, contrasting with unweighted linkage that gives equal weight to each object in the clusters. This parameter has no effect on the "single" and "complete" linkages.

group

A character string to choose a grouping criterion between the "variable"-group approach (default) that returns a multifurcated dendrogram (m-ary tree), and the "pair"-group approach that returns a bifurcated dendrogram (binary tree). See the Details section.

measure

A character string specifying the descriptive measure to be plotted. This should be one of: "cor", for cophenetic correlation coefficient; "sdr", for space distortion ratio; "ac", for agglomerative coefficient; "cc", for chaining coefficient; or "tb", for tree balance.

slope

A real value representing the slope of a sigmoid function to map the "versatile" linkage unbounded interval (-Inf, +Inf) onto the bounded interval (-1, +1). It can be used to improve the distribution of points along the x axis.

...

Graphical parameters (see par) may also be supplied and are passed to plot.default.

Details

Starting from a matrix of proximity data (distances or similarities), linkage() calculates its dendrogram with the most commonly used agglomerative hierarchical clustering methods, i.e. single linkage, complete linkage, arithmetic linkage (also known as average linkage) and Ward's method. Importantly, it contains a new parameterized method named versatile linkage (Fernandez and Gomez, 2020), which includes single linkage, complete linkage and average linkage as particular cases, and which naturally defines two new methods, geometric linkage and harmonic linkage.

The difference between the available hierarchical clustering methods rests in the way the proximity between two clusters is defined from the proximity between their constituent objects:

  • "single": the proximity between clusters equals the minimum distance or the maximum similarity between objects.

  • "complete": the proximity between clusters equals the maximum distance or the minimum similarity between objects.

  • "arithmetic": the proximity between clusters equals the arithmetic mean proximity between objects. Also known as average linkage, WPGMA (weighted version) or UPGMA (unweighted version).

  • "geometric": the proximity between clusters equals the geometric mean proximity between objects.

  • "harmonic": the proximity between clusters equals the harmonic mean proximity between objects.

  • "versatile": the proximity between clusters equals the generalized power mean proximity between objects. It depends on the value of par.method, with the following linkage methods as particular cases: "complete" (par.method=+Inf), "arithmetic" (par.method=+1), "geometric" (par.method=0), "harmonic" (par.method=-1) and "single" (par.method=-Inf).

  • "ward": the distance between clusters is a weighted squared Euclidean distance between the centroids of each cluster. This method is available only for distance data.

  • "centroid": the distance between clusters equals the square of the Euclidean distance between the centroids of each cluster. Also known as WPGMC (weighted version) or UPGMC (unweighted version). This method is available only for distance data, which are meant to be squared Euclidean distances. Note that both centroid versions, weighted and unweighted, may yield inversions that make dendrograms difficult to interpret.

  • "flexible": the proximity between clusters is a weighted sum of the proximity between clusters in the previous iteration. It depends on the value of par.method, in the range [-1, +1], and it is equivalent to "arithmetic" linkage when par.method=0.

With the argument group, users can choose between a variable-group approach (default) that returns a multifurcated dendrogram or multidendrogram, and a pair-group approach that returns a bifurcated dendrogram. Multidendrograms were introduced (Fernandez and Gomez, 2008) to solve the non-uniqueness problem that arises when two or more minimum proximity values between different clusters are equal during the agglomerative process. Multidendrograms group more than two clusters when tied proximity values occur, what produces a uniquely determined solution that does not depend on the order of the input data. When there are no tied proximity values, the variable-group approach gives the same result as the pair-group one.

descval() and descplot() can be used with methods "versatile" and "flexible" to analyze the variation of any descriptive measure as a function of the corresponding method parameter. Both functions return a vector with the numerical values of the descriptive measure evaluated at the points contained in the parameter par.method. Function descplot(), in addition, draws the corresponding plot.

Value

An object of class "linkage" that describes the multifurcated dendrogram obtained. The object is a list with the following components:

call

The call that produced the result.

digits

Number of significant decimal digits used as precision. It is given by the user or automatically set to that of the input proximity value with the largest number of significant decimal digits.

merger

A list of vectors of integer that describes the merging of clusters at each step of the clustering. If a number j in a vector is negative, then singleton cluster -j was merged at this stage. If j is positive, then the merge was with the cluster formed at stage j of the algorithm.

height

A vector with the proximity values between merging clusters (for the particular agglomeration) at the successive stages.

range

A vector with the range (the maximum minus the minimum) of proximity values between merging clusters. It is equal to 0 for binary clusters.

order

A vector giving a permutation of the original observations to allow for plotting, in the sense that the branches of a clustering tree will not cross.

coph

Object of class "dist" containing the cophenetic (or ultrametric) proximity data in the output dendrogram, sorted in the same order as the input proximity data in prox.

binary

A logical value indicating whether the output dendrogram is a binary tree or, on the contrary, it contains an agglomeration of more than two clusters due to the existence of tied proximity data. Its value is always TRUE when the "pair" grouping criterion is used.

cor

Cophenetic correlation coefficient (Sokal and Rohlf, 1962), defined as the Pearson correlation coefficient between the output cophenetic proximity data and the input proximity data. It is a measure of how faithfully the dendrogram preserves the pairwise proximity between objects.

sdr

Space distortion ratio (Fernandez and Gomez, 2020), calculated as the difference between the maximum and minimum cophenetic proximity data, divided by the difference between the maximum and minimum initial proximity data. Space dilation occurs when the space distortion ratio is greater than 1.

ac

Agglomerative coefficient (Rousseeuw, 1986), a number between 0 and 1 measuring the strength of the clustering structure obtained.

cc

Chaining coefficient (Williams et al., 1966), a number between 0 and 1 measuring the tendency for clusters to grow by the addition of clusters much smaller rather than by fusion with other clusters of comparable size.

tb

Tree balance (Fernandez and Gomez, 2020), a number between 0 and 1 measuring the equality in the number of leaves in the branches concerned at each fusion in the hierarchical tree.

Class "linkage" has methods for the following generic functions: print, summary, plot (see plot.linkage), as.dendrogram, as.hclust and cophenetic.

Note

Except for the cases containing tied proximity data, the following equivalences hold between function linkage() in package mdendro, function hclust() in package stats, and function agnes() in package cluster. When relevant, weighted (W) or unweighted (U) versions of the linkage methods and the value for par.method (\beta) are indicated:

linkage() hclust() agnes()
================== ============ ===================
"single" "single" "single"
"complete" "complete" "complete"
"arithmetic", U "average" "average"
"arithmetic", W "mcquitty" "weighted"
"ward" "ward.D2" "ward"
"centroid", U "centroid" --------
"centroid", W "median" --------
"flexible", U, \beta -------- "gaverage", \beta
"flexible", W, \beta -------- "flexible", (1-\beta)/2

Author(s)

Alberto Fernandez alberto.fernandez@urv.cat and Sergio Gomez sergio.gomez@urv.cat.

References

Fernandez, A.; Gomez, S. (2008). Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. Journal of Classification, 25, 43–65.

Fernandez, A.; Gomez, S. (2020). Versatile linkage: a family of space-conserving strategies for agglomerative hierarchical clustering. Journal of Classification, 37, 584–597.

Rousseeuw, P.J. (1986). A visual display for hierarchical classification. In E. Diday et al. (eds.) Data Analysis and Informatics 4, pp. 743–748. Amsterdam: North-Holland.

Sokal, R.R.; Rohlf, F.J. (1962). The comparison of dendrograms by objective methods. Taxon, 11, 33–40.

Williams, W.T.; Lambert, J.M.; Lance, G.N. (1966). Multivariate methods in plant ecology: V. Similarity analyses and information-analysis. Journal of Ecology, 54, 427–445.

See Also

plot.linkage, dist, dendrogram, hclust, agnes.

Examples

## Plot and summary of unweighted arithmetic linkage (UPGMA) dendrogram
lnk1 <- linkage(UScitiesD)
plot(lnk1)
summary(lnk1)

## Linkage of similarity data (non-negative correlations)
sim <- as.dist(cor(EuStockMarkets))
lnk2 <- linkage(sim, type.prox = "similarity")
plot(lnk2)

## Use function as.dendrogram to plot with package dendextend
d <- dist(scale(mtcars))  # distances of standardized data
lnk <- linkage(d, digits = 1, method = "complete")
lnk.dend <- as.dendrogram(lnk)
plot(dendextend::set(lnk.dend, "branches_k_color", k = 4),
     nodePar = list(cex = 0.4, lab.cex = 0.5))

## Plot heatmap containing multidendrograms
heatmap(scale(mtcars), hclustfun = linkage)

## Plot of different versatile linkages as we increase the method parameter
d = as.dist(matrix(c( 0,  7, 16, 12,
                      7,  0,  9, 19,
                     16,  9,  0, 12,
                     12, 19, 12, 0), nrow = 4))
par(mfrow = c(2, 3))
vals <- c(-Inf, -1, 0, +1, +Inf)
names <- c("single", "harmonic", "geometric", "arithmetic", "complete")
for (i in 1:length(vals)) {
  lnk <- linkage(d, digits = 1, method = "versatile", par.method = vals[i])
  plot(lnk, main = paste0("versatile (", vals[i], ") = ", names[i]),
       ylim = c(0, 20), cex = 0.6)
}

## Analyze how descriptive measures depend on versatile linkage parameter
par(mfrow = c(2, 3))
measures <- c("cor", "sdr", "ac", "cc", "tb")
vals <- c(-Inf, (-20:+20), +Inf)
for (measure in measures) {
  descplot(UScitiesD, method = "versatile", par.method = vals,
           measure = measure,  main = measure, type = "o", col = "blue")
}

mdendro documentation built on Oct. 18, 2023, 5:08 p.m.