hierarchical_clustering: Hierarchical size-constrained clustering In scclust: Size-Constrained Clustering

Description

`hierarchical_clustering` serves two purposes. Its primary use is to refine existing clusters subject to clustering constraints. That is, given a (non-optimal) clustering satisfying some constraints, the function splits clusters so to decrease within-cluster distances without violating the constraints. The function can also be used to derive size-constrained clusterings from scratch. In both cases, it uses a hierarchical clustering algorithm.

Usage

 ```1 2``` ```hierarchical_clustering(distances, size_constraint, batch_assign = TRUE, existing_clustering = NULL) ```

Arguments

 `distances` a `distances` object with distances between the data points. `size_constraint` an integer with the required minimum cluster size. `batch_assign` a logical indicating whether data points should be assigned in batches when spliting clusters (see below for details). `existing_clustering` a `scclust` object containing a non-empty clustering to refine. If `NULL`, the function derives a clustering from scratch.

Details

While `hierarchical_clustering` can be used to derive size-constrained clusters from scratch, its main purpose is to be used together with `sc_clustering`. The clusters produced by the main clustering function are guaranteed to be close to optimal (in particular, within a constant factor of the optimal solution). However, it is occasionally possible to refine the clustering. In particular, `sc_clustering` tends produce large clusters in regions of the metric space with many data points. In some cases, it is beneficial to divide these clusters into smaller groups. `hierarchical_clustering` splits these larger clusters so that all within-cluster distances weakly decrease while respecting the overall the size constraint.

`hierarchical_clustering` implements a divisive hierarchical clustering algorithm that respect size constraints. Starting from any clustering satisfying the size constraints (which may be a single cluster containing all data points), the function searches for clusters that can be broken into two or more new clusters without violating the constraints. When such a cluster is found, it breaks the cluster into two new clusters. It continues in this fashion until all remaining clusters are unbreakable.

Breakable clusters are broken in three stages. First, it tries to find two data points as far as possible from each other. The two points are called centers, and they are the starting points for the new clusters. The remaining data points in the old cluster will be assigned to one of the centers. In the second stage, each center picks the closest data points so that the new two clusters exactly satisfy the size constraint. In the last stage, data points that still are in the old cluster are assigned to the cluster containing the closest center. The final stage is done either for each point one-by-one or in batches (see below). When all data points are assigned to a cluster, the old cluster is removed and the two new are added to the clustering. If one or both of the new clusters are breakable, they will go through the same procedure again.

In some applications, it is desirable to avoid clusters that contain a number of data points that are not multiples of `size_constraint`. After the second stage described in the previous paragraph, both partial clusters are exact multiples of the size constraint. By assigning remaining data points in the third stage in batches of `size_constraint`, the function ensures, to the greatest extent possible, that the size of the final clusters are multiples of the size constraint.

Value

Returns a `scclust` object with the derived clustering.

`sc_clustering` is the main clustering function in the package.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18``` ```# Make example data my_data <- data.frame(x1 = rnorm(10000), x2 = rnorm(10000), x3 = rnorm(10000)) # Construct distance metric my_dist <- distances(my_data) # Make clustering with `sc_clustering` my_clustering <- sc_clustering(my_dist, 3) # Refine clustering with `hierarchical_clustering` my_refined_clustering <- hierarchical_clustering(my_dist, size_constraint = 3, existing_clustering = my_clustering) # Make clustering from scratch with `hierarchical_clustering` my_other_clustering <- hierarchical_clustering(my_dist, 3) ```