optimal_anticlustering: Optimal ("exact") algorithms for anticlustering

View source: R/optimal_anticlustering.R

optimal_anticlusteringR Documentation

Optimal ("exact") algorithms for anticlustering

Description

Wrapper function that gives access to all optimal algorithms for anticlustering that are available in anticlust.

Usage

optimal_anticlustering(x, K, objective, solver = NULL, time_limit = NULL)

Arguments

x

The data input. Can be one of two structures: (1) A feature matrix where rows correspond to elements and columns correspond to variables (a single numeric variable can be passed as a vector). (2) An N x N matrix dissimilarity matrix; can be an object of class dist (e.g., returned by dist or as.dist) or a matrix where the entries of the upper and lower triangular matrix represent pairwise dissimilarities.

K

How many anticlusters should be created or alternatively: (a) A vector describing the size of each group (the latter currently only works for objective = "dispersion").

objective

The anticlustering objective, can be "diversity", "variance", "kplus" or "dispersion".

solver

Optional. The solver used to obtain the optimal method. Currently supports "glpk", "symphony", "lpSolve" and "gurobi". See details.

time_limit

Time limit in seconds, given to the solver. Default is there is no time limit.

Details

This is a wrapper for all optimal methods supported in anticlust (currently and in the future). As compared to anticlustering, it allows to specify the solver to obtain an optimal solution, a time limit, and it can be used to obtain optimal solutions for all supported anticlustering objectives (variance, diversity, k-plus, dispersion). For the objectives "variance", "diversity" and "kplus", the optimal ILP method in Papenberg and Klau (2021) is used, which maximizes the sum of all pairwise intra-cluster distances (given user specified number of clusters, for equal-sized clusters). To employ k-means anticlustering (i.e. set objective = "variance"), the squared Euclidean distance is used. For k-plus anticlustering, the squared Euclidean distance based on the extended k-plus data matrix is used (see kplus_moment_variables). For the diversity (and the dispersion), the Euclidean distance is used by default, but any user-defined dissimilarity matrix is possible. The equivalence of the k-means/k-plus objectives and the diversity, which enables using the ILP approach by Papenberg and Klau for the k-means/k-plus objectives, was discussed in Papenberg, Breuer, et al. (2025). Also see the examples.

The dispersion is solved optimal using the algorithm OptDispF presented in Papenberg, Breuer, et al. (2025). Also see optimal_dispersion.

The optimal methods make use of "solvers" that actually implement the algorithm for finding optimal solutions. The package anticlust supports four solvers:

  • The default solver lpSolve (<https://sourceforge.net/projects/lpsolve/>).

  • GNU linear programming kit (<http://www.gnu.org/software/glpk/>), available from the package Rglpk and requested using solver = "glpk". The R package Rglpk has to be installed manually if this solver should be used.

  • The Symphony solver (<https://github.com/coin-or/SYMPHONY>), available from the package Rsymphony and requested using solver = "symphony". (The package Rsymphony has to be installed manually if this solver should be used).

  • The commercial gurobi solver, see https://www.gurobi.com/downloads/.

In general, the commercial gurobi solver is best at solving exact anticlustering problems. If you have a license to use it, use it! Among the open source alternatives, for the maximum dispersion problem, it seems that the Symphony solver is fastest, while the lpSolve solver seems to be good for maximum diversity. However, note that in general the dispersion can be solved optimally for much larger data sets than the diversity.

If a time_limit is set and the function cannot find in the optimal objective in the given time, it will throw an error.

Value

A vector of length N that assigns a group (i.e, a number between 1 and K) to each input element.

Author(s)

Martin Papenberg martin.papenberg@hhu.de

References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods, 26(2), 161–174. https://doi.org/10.1037/met0000301.

Papenberg, M., Breuer, M., Diekhoff, M., Tran, N. K., & Klau, G. W. (2025). Extending the Bicriterion Approach for Anticlustering: Exact and Hybrid Approaches. Psychometrika. Advance online publication. https://doi.org/10.1017/psy.2025.10052

Examples


data <- matrix(rnorm(24), ncol = 2)

# These calls are equivalent for k-means anticlustering:
optimal_anticlustering(data, K = 2, objective = "variance")
optimal_anticlustering(dist(data)^2, K = 2, objective = "diversity")

# These calls are equivalent for k-plus anticlustering:
optimal_anticlustering(data, K = 2, objective = "kplus")
optimal_anticlustering(dist(kplus_moment_variables(data, 2))^2, K = 2, objective = "diversity")


anticlust documentation built on Nov. 5, 2025, 7:09 p.m.