cl_consensus | R Documentation |

Compute the consensus clustering of an ensemble of partitions or hierarchies.

cl_consensus(x, method = NULL, weights = 1, control = list())

`x` |
an ensemble of partitions or hierarchies, or something
coercible to that (see |

`method` |
a character string specifying one of the built-in
methods for computing consensus clusterings, or a function to be
taken as a user-defined method, or |

`weights` |
a numeric vector with non-negative case weights.
Recycled to the number of elements in the ensemble given by |

`control` |
a list of control parameters. See |

Consensus clusterings “synthesize” the information in the elements of a cluster ensemble into a single clustering, often by minimizing a criterion function measuring how dissimilar consensus candidates are from the (elements of) the ensemble (the so-called “optimization approach” to consensus clustering).

The most popular criterion functions are of the form *L(x) = ∑
w_b d(x_b, x)^p*, where *d* is a suitable dissimilarity measure
(see `cl_dissimilarity`

), *w_b* is the case weight
given to element *x_b* of the ensemble, and *p ≥ 1*. If
*p = 1* and minimization is over all possible base clusterings, a
consensus solution is called a *median* of the ensemble; if
minimization is restricted to the elements of the ensemble, a
consensus solution is called a *medoid* (see
`cl_medoid`

). For *p = 2*, we obtain *least
squares* consensus partitions and hierarchies (generalized means).
See also Gordon (1999) for more information.

If all elements of the ensemble are partitions, the built-in consensus
methods compute consensus partitions by minimizing a criterion of the
form *L(x) = ∑ w_b d(x_b, x)^p* over all hard or soft
partitions *x* with a given (maximal) number *k* of classes.
Available built-in methods are as follows.

`"SE"`

a fixed-point algorithm for obtaining

*soft*least squares Euclidean consensus partitions (i.e., for minimizing*L*with Euclidean dissimilarity*d*and*p = 2*over all soft partitions with a given maximal number of classes).This iterates between individually matching all partitions to the current approximation to the consensus partition, and computing the next approximation as the membership matrix closest to a suitable weighted average of the memberships of all partitions after permuting their columns for the optimal matchings of class ids.

The following control parameters are available for this method.

`k`

an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.

`maxiter`

an integer giving the maximal number of iterations to be performed. Defaults to 100.

`nruns`

an integer giving the number of runs to be performed. Defaults to 1.

`reltol`

the relative convergence tolerance. Defaults to

`sqrt(.Machine$double.eps)`

.`start`

a matrix with number of rows equal to the number of objects of the cluster ensemble, and

*k*columns, to be used as a starting value, or a list of such matrices. By default, suitable random membership matrices are used.`verbose`

a logical indicating whether to provide some output on minimization progress. Defaults to

`getOption("verbose")`

.

In the case of multiple runs, the first optimum found is returned.

This method can also be referred to as

`"soft/euclidean"`

.`"GV1"`

the fixed-point algorithm for the “first model” in Gordon and Vichi (2001) for minimizing

*L*with*d*being GV1 dissimilarity and*p = 2*over all soft partitions with a given maximal number of classes.This is similar to

`"SE"`

, but uses GV1 rather than Euclidean dissimilarity.Available control parameters are the same as for

`"SE"`

.`"DWH"`

an extension of the greedy algorithm in Dimitriadou, Weingessel and Hornik (2002) for (approximately) obtaining soft least squares Euclidean consensus partitions. The reference provides some structure theory relating finding the consensus partition to an instance of the multiple assignment problem, which is known to be NP-hard, and suggests a simple heuristic based on successively matching an individual partition

*x_b*to the current approximation to the consensus partition, and compute the memberships of the next approximation as a weighted average of those of the current one and of*x_b*after permuting its columns for the optimal matching of class ids.The following control parameters are available for this method.

`k`

an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.

`order`

a permutation of the integers from 1 to the size of the ensemble, specifying the order in which the partitions in the ensemble should be aggregated. Defaults to using a random permutation (unlike the reference, which does not permute at all).

`"HE"`

a fixed-point algorithm for obtaining

*hard*least squares Euclidean consensus partitions (i.e., for minimizing*L*with Euclidean dissimilarity*d*and*p = 2*over all hard partitions with a given maximal number of classes.)Available control parameters are the same as for

`"SE"`

.This method can also be referred to as

`"hard/euclidean"`

.`"SM"`

a fixed-point algorithm for obtaining

*soft*median Manhattan consensus partitions (i.e., for minimizing*L*with Manhattan dissimilarity*d*and*p = 1*over all soft partitions with a given maximal number of classes).Available control parameters are the same as for

`"SE"`

.This method can also be referred to as

`"soft/manhattan"`

.`"SM"`

a fixed-point algorithm for obtaining

*hard*median Manhattan consensus partitions (i.e., for minimizing*L*with Manhattan dissimilarity*d*and*p = 1*over all hard partitions with a given maximal number of classes).Available control parameters are the same as for

`"SE"`

.This method can also be referred to as

`"hard/manhattan"`

.`"GV3"`

a SUMT algorithm for the “third model” in Gordon and Vichi (2001) for minimizing

*L*with*d*being co-membership dissimilarity and*p = 2*. (See`sumt`

for more information on the SUMT approach.) This optimization problem is equivalent to finding the membership matrix*m*for which the sum of the squared differences between*C(m) = m m'*and the weighted average co-membership matrix*∑_b w_b C(m_b)*of the partitions is minimal.Available control parameters are

`method`

,`control`

,`eps`

,`q`

, and`verbose`

, which have the same roles as for`sumt`

, and the following.`k`

an integer giving the number of classes to be used for the least squares consensus partition. By default, the maximal number of classes in the ensemble is used.

`nruns`

an integer giving the number of runs to be performed. Defaults to 1.

`start`

a matrix with number of rows equal to the size of the cluster ensemble, and

*k*columns, to be used as a starting value, or a list of such matrices. By default, a membership based on a rank*k*approximation to the weighted average co-membership matrix is used.

In the case of multiple runs, the first optimum found is returned.

`"soft/symdiff"`

a SUMT approach for minimizing

*L = ∑ w_b d(x_b, x)*over all soft partitions with a given maximal number of classes, where*d*is the Manhattan dissimilarity of the co-membership matrices (coinciding with symdiff partition dissimilarity in the case of hard partitions).Available control parameters are the same as for

`"GV3"`

.`"hard/symdiff"`

an exact solver for minimizing

*L = ∑ w_b d(x_b, x)*over all hard partitions (possibly with a given maximal number of classes as specified by the control parameter`k`

), where*d*is symdiff partition dissimilarity (so that soft partitions in the ensemble are replaced by their closest hard partitions), or equivalently, Rand distance or pair-bonds (Boorman-Arabie*D*) distance. The consensus solution is found via mixed linear or quadratic programming.

By default, method `"SE"`

is used for ensembles of partitions.

If all elements of the ensemble are hierarchies, the following built-in methods for computing consensus hierarchies are available.

`"euclidean"`

an algorithm for minimizing

*L(x) = ∑ w_b d(x_b, x) ^ 2*over all dendrograms, where*d*is Euclidean dissimilarity. This is equivalent to finding the best least squares ultrametric approximation of the weighted average*d = ∑ w_b u_b*of the ultrametrics*u_b*of the hierarchies*x_b*, which is attempted by calling`ls_fit_ultrametric`

on*d*with appropriate control parameters.This method can also be referred to as

`"cophenetic"`

.`"manhattan"`

a SUMT for minimizing

*L = ∑ w_b d(x_b, x)*over all dendrograms, where*d*is Manhattan dissimilarity.Available control parameters are the same as for

`"euclidean"`

.`"majority"`

a hierarchy obtained from an extension of the majority consensus tree of Margush and McMorris (1981), which minimizes

*L(x) = ∑ w_b d(x_b, x)*over all dendrograms, where*d*is the symmetric difference dissimilarity. The unweighted*p*-majority tree is the*n*-tree (hierarchy in the strict sense) consisting of all subsets of objects contained in more than*100 p*percent of the*n*-trees*T_b*induced by the dendrograms, where*1/2 ≤ p < 1*and*p = 1/2*(default) corresponds to the standard majority tree. In the weighted case, it consists of all subsets*A*for which*∑_{b: A \in T_b} w_b > W p*, where*W = ∑_b w_b*. We also allow for*p = 1*, which gives the*strict consensus tree*consisting of all subsets contained in each of the*n*-trees. The majority dendrogram returned is a representation of the majority tree where all splits have height one.The fraction

*p*can be specified via the control parameter`p`

.

By default, method `"euclidean"`

is used for ensembles of
hierarchies.

If a user-defined consensus method is to be employed, it must be a
function taking the cluster ensemble, the case weights, and a list of
control parameters as its arguments, with formals named `x`

,
`weights`

, and `control`

, respectively.

Most built-in methods use heuristics for solving hard optimization problems, and cannot be guaranteed to find a global minimum. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the methods.

The consensus partition or hierarchy.

E. Dimitriadou, A. Weingessel and K. Hornik (2002).
A combination scheme for fuzzy clustering.
*International Journal of Pattern Recognition and Artificial
Intelligence*, **16**, 901–912.

doi: 10.1142/S0218001402002052.

A. D. Gordon and M. Vichi (2001).
Fuzzy partition models for fitting a set of partitions.
*Psychometrika*, **66**, 229–248.
doi: 10.1007/BF02294837.

A. D. Gordon (1999).
*Classification* (2nd edition).
Boca Raton, FL: Chapman & Hall/CRC.

T. Margush and F. R. McMorris (1981).
Consensus *n*-trees.
*Bulletin of Mathematical Biology*, **43**, 239–244.
doi: 10.1007/BF02459446.

`cl_medoid`

,
`consensus`

## Consensus partition for the Rosenberg-Kim kinship terms partition ## data based on co-membership dissimilarities. data("Kinship82") m1 <- cl_consensus(Kinship82, method = "GV3", control = list(k = 3, verbose = TRUE)) ## (Note that one should really use several replicates of this.) ## Value for criterion function to be minimized: sum(cl_dissimilarity(Kinship82, m1, "comem") ^ 2) ## Compare to the consensus solution given in Gordon & Vichi (2001). data("Kinship82_Consensus") m2 <- Kinship82_Consensus[["JMF"]] sum(cl_dissimilarity(Kinship82, m2, "comem") ^ 2) ## Seems we get a better solution ... ## How dissimilar are these solutions? cl_dissimilarity(m1, m2, "comem") ## How "fuzzy" are they? cl_fuzziness(cl_ensemble(m1, m2)) ## Do the "nearest" hard partitions fully agree? cl_dissimilarity(as.cl_hard_partition(m1), as.cl_hard_partition(m2)) ## Consensus partition for the Gordon and Vichi (2001) macroeconomic ## partition data based on Euclidean dissimilarities. data("GVME") set.seed(1) ## First, using k = 2 classes. m1 <- cl_consensus(GVME, method = "GV1", control = list(k = 2, verbose = TRUE)) ## (Note that one should really use several replicates of this.) ## Value of criterion function to be minimized: sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2) ## Compare to the consensus solution given in Gordon & Vichi (2001). data("GVME_Consensus") m2 <- GVME_Consensus[["MF1/2"]] sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2) ## Seems we get a slightly better solution ... ## But note that cl_dissimilarity(m1, m2, "GV1") ## and that the maximal deviation of the memberships is max(abs(cl_membership(m1) - cl_membership(m2))) ## so the differences seem to be due to rounding. ## Do the "nearest" hard partitions fully agree? table(cl_class_ids(m1), cl_class_ids(m2)) ## And now for k = 3 classes. m1 <- cl_consensus(GVME, method = "GV1", control = list(k = 3, verbose = TRUE)) sum(cl_dissimilarity(GVME, m1, "GV1") ^ 2) ## Compare to the consensus solution given in Gordon & Vichi (2001). m2 <- GVME_Consensus[["MF1/3"]] sum(cl_dissimilarity(GVME, m2, "GV1") ^ 2) ## This time we look much better ... ## How dissimilar are these solutions? cl_dissimilarity(m1, m2, "GV1") ## Do the "nearest" hard partitions fully agree? table(cl_class_ids(m1), cl_class_ids(m2))

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.