hierarchical_clustering serves two purposes. Its primary use is to
refine existing clusters subject to clustering constraints. That is, given
a (non-optimal) clustering satisfying some constraints, the function splits
clusters so to decrease within-cluster distances without violating the
constraints. The function can also be used to derive size-constrained
clusterings from scratch. In both cases, it uses a hierarchical clustering
an integer with the required minimum cluster size.
a logical indicating whether data points should be assigned in batches when spliting clusters (see below for details).
hierarchical_clustering can be used to derive size-constrained
clusters from scratch, its main purpose is to be used together with
sc_clustering. The clusters produced by the main clustering
function are guaranteed to be close to optimal (in particular, within a
constant factor of the optimal solution). However, it is occasionally
possible to refine the clustering. In particular,
sc_clustering tends produce large clusters in regions of the
metric space with many data points. In some cases, it is beneficial to divide
these clusters into smaller groups.
these larger clusters so that all within-cluster distances weakly decrease
while respecting the overall the size constraint.
hierarchical_clustering implements a divisive hierarchical clustering
algorithm that respect size constraints. Starting from any clustering
satisfying the size constraints (which may be a single cluster containing
all data points), the function searches for clusters that can be broken into
two or more new clusters without violating the constraints. When such a
cluster is found, it breaks the cluster into two new clusters. It continues
in this fashion until all remaining clusters are unbreakable.
Breakable clusters are broken in three stages. First, it tries to find two data points as far as possible from each other. The two points are called centers, and they are the starting points for the new clusters. The remaining data points in the old cluster will be assigned to one of the centers. In the second stage, each center picks the closest data points so that the new two clusters exactly satisfy the size constraint. In the last stage, data points that still are in the old cluster are assigned to the cluster containing the closest center. The final stage is done either for each point one-by-one or in batches (see below). When all data points are assigned to a cluster, the old cluster is removed and the two new are added to the clustering. If one or both of the new clusters are breakable, they will go through the same procedure again.
In some applications, it is desirable to avoid clusters that contain a number
of data points that are not multiples of
size_constraint. After the
second stage described in the previous paragraph, both partial clusters are
exact multiples of the size constraint. By assigning remaining data points
in the third stage in batches of
size_constraint, the function ensures,
to the greatest extent possible, that the size of the final clusters are
multiples of the size constraint.
scclust object with the derived clustering.
sc_clustering is the main clustering function in the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# Make example data my_data <- data.frame(x1 = rnorm(10000), x2 = rnorm(10000), x3 = rnorm(10000)) # Construct distance metric my_dist <- distances(my_data) # Make clustering with `sc_clustering` my_clustering <- sc_clustering(my_dist, 3) # Refine clustering with `hierarchical_clustering` my_refined_clustering <- hierarchical_clustering(my_dist, size_constraint = 3, existing_clustering = my_clustering) # Make clustering from scratch with `hierarchical_clustering` my_other_clustering <- hierarchical_clustering(my_dist, 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.