View source: R/optimal_anticlustering.R
optimal_anticlustering | R Documentation |
Wrapper function that gives access to all optimal algorithms for anticlustering that are available in anticlust.
optimal_anticlustering(x, K, objective, solver = NULL, time_limit = NULL)
x |
The data input. Can be one of two structures: (1) A feature
matrix where rows correspond to elements and columns correspond
to variables (a single numeric variable can be passed as a
vector). (2) An N x N matrix dissimilarity matrix; can be an
object of class |
K |
How many anticlusters should be created or alternatively:
(a) A vector describing the size of each group (the latter
currently only works for |
objective |
The anticlustering objective, can be "diversity", "variance", "kplus" or "dispersion". |
solver |
Optional. The solver used to obtain the optimal method. Currently supports "glpk", "symphony", "lpSolve" and "gurobi". See details. |
time_limit |
Time limit in seconds, given to the solver. Default is there is no time limit. |
This is a wrapper for all optimal methods supported in anticlust
(currently and in the future). As compared to
anticlustering
, it allows to specify the solver to
obtain an optimal solution and it can be used to obtain optimal
solutions for all supported anticlustering objectives (variance,
diversity, k-plus, dispersion). For the objectives "variance",
"diversity" and "kplus", the optimal ILP method in Papenberg and
Klau (2021) is used, which maximizes the sum of all pairwise
intra-cluster distances (given user specified number of clusters,
for equal-sized clusters). To employ k-means anticlustering
(i.e. set objective = "variance"
), the squared Euclidean
distance is used. For k-plus anticlustering, the squared Euclidean
distance based on the extended k-plus data matrix is used (see
kplus_moment_variables
). For the diversity (and the
dispersion), the Euclidean distance is used by default, but any
user-defined dissimilarity matrix is possible.
The dispersion is solved optimal using the approach described in
optimal_dispersion
.
The optimal methods make use of "solvers" that actually implement the algorithm for finding optimal solutions. The package anticlust supports three solvers:
The default solver lpSolve (<https://sourceforge.net/projects/lpsolve/>).
GNU linear programming kit (<http://www.gnu.org/software/glpk/>),
available from the package Rglpk and requested using solver = "glpk"
.
The R package Rglpk has to be installed manually if this solver should be used.
The Symphony solver (<https://github.com/coin-or/SYMPHONY>),
available from the package Rsymphony and requested using solver = "symphony"
.
(The package Rsymphony has to be installed manually if this solver should be used).
The commercial gurobi solver, see https://www.gurobi.com/downloads/.
For the maximum dispersion problem, it seems that the Symphony solver is fastest, while the lpSolve solver seems to be good for maximum diversity. However, note that in general the dispersion can be solved optimally for much larger data sets than the diversity.
If a time_limit
is set and the function cannot find in the optimal
objective in the given time, it will throw an error.
A vector of length N that assigns a group (i.e, a number
between 1 and K
) to each input element.
Martin Papenberg martin.papenberg@hhu.de
data <- matrix(rnorm(24), ncol = 2)
# These calls are equivalent for k-means anticlustering:
optimal_anticlustering(data, K = 2, objective = "variance")
optimal_anticlustering(dist(data)^2, K = 2, objective = "diversity")
# These calls are equivalent for k-plus anticlustering:
optimal_anticlustering(data, K = 2, objective = "kplus")
optimal_anticlustering(dist(kplus_moment_variables(data, 2))^2, K = 2, objective = "diversity")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.