#' Create balanced clusters of equal size
#'
#' @param x The data input. Can be one of two structures: (1) A feature
#' matrix where rows correspond to elements and columns correspond
#' to variables (a single numeric variable can be passed as a
#' vector). (2) An N x N matrix dissimilarity matrix; can be an
#' object of class \code{dist} (e.g., returned by
#' \code{\link{dist}} or \code{\link{as.dist}}) or a \code{matrix}
#' where the entries of the upper and lower triangular matrix
#' represent pairwise dissimilarities.
#' @param K How many clusters should be created.
#' @param method One of "centroid" or "ilp". See Details.
#' @param solver Optional. The solver used to obtain the optimal method
#' if \code{method = "ilp"}. Currently supports "glpk" and "symphony".
#' Is ignored for \code{method = "centroid"}.
#'
#' @return An integer vector representing the cluster affiliation of
#' each data point
#'
#' @details
#'
#' This function partitions a set of elements into \code{K}
#' equal-sized clusters. The function offers two methods: a heuristic
#' and an exact method. The heuristic (\code{method = "centroid"})
#' first computes the centroid of all data points. If the input is a
#' feature matrix, the centroid is defined as the mean vector of all
#' columns. If the input is a dissimilarity matrix, the most central
#' element acts as the centroid; the most central element is defined
#' as the element having the minimum maximal distance to all other
#' elements. After identifying the centroid, the algorithm proceeds as
#' follows: The element having the highest distance from the centroid
#' is clustered with its \code{(N/K) - 1} nearest neighbours
#' (neighbourhood is defined according to the Euclidean distance if
#' the data input is a feature matrix). From the remaining elements,
#' again the element farthest to the centroid is selected and
#' clustered with its \code{(N/K) - 1} neighbours; the procedure is
#' repeated until all elements are part of a cluster.
#'
#' An exact method (\code{method = "ilp"}) can be used to solve
#' equal-sized weighted cluster editing optimally (implements the
#' integer linear program described in Papenberg and Klau, 2020;
#' (8) - (10), (12) - (13)). The cluster editing objective is the
#' sum of pairwise distances
#' within clusters; clustering is accomplished by minimizing this
#' objective. If the argument \code{x} is a features matrix, the
#' Euclidean distance is computed as the basic unit of the cluster
#' editing objective. If another distance measure is preferred, users
#' may pass a self-computed dissimiliarity matrix via the argument
#' \code{x}.
#'
#' The optimal \code{method = "ilp"} uses a "solver" to optimize
#' the clustering objective. See \code{\link{optimal_anticlustering}}
#' for an overview of the solvers that are available.
#'
#' @source
#'
#' The centroid method was originally developed and contributed by
#' Meik Michalke. It was later rewritten by Martin Papenberg, who
#' also implemented the integer linear programming method.
#'
#' @export
#'
#' @author
#' Martin Papenberg \email{martin.papenberg@@hhu.de}
#'
#' Meik Michalke \email{meik.michalke@@hhu.de}
#'
#' @examples
#'
#' # Cluster a data set and visualize results
#' N <- 1000
#' lds <- data.frame(f1 = rnorm(N), f2 = rnorm(N))
#' cl <- balanced_clustering(lds, K = 10)
#' plot_clusters(lds, clusters = cl)
#'
#' # Repeat using a distance matrix as input
#' cl2 <- balanced_clustering(dist(lds), K = 10)
#' plot_clusters(lds, clusters = cl2)
#'
#' @references
#'
#' Grötschel, M., & Wakabayashi, Y. (1989). A cutting plane algorithm
#' for a clustering problem. Mathematical Programming, 45, 59–96.
#'
#' Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition
#' data sets into equivalent parts. Psychological Methods, 26(2),
#' 161–174. https://doi.org/10.1037/met0000301.
#'
balanced_clustering <- function(x, K, method = "centroid", solver = NULL) {
input_validation_anticlustering(x, K, "distance", method, TRUE, NULL, NULL)
validate_input_optimal_anticlustering(x, K, "diversity", solver, NULL)
data <- to_matrix(x)
if (method == "ilp") {
solver <- ifelse(argument_exists(solver), solver, find_ilp_solver())
return(balanced_cluster_editing(data, K, solver))
}
nn_centroid_clustering(data, NROW(data) / K)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.