In EmilHvitfeldt/celery: A Common API to Clustering

#| include: false
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Setup

library(workflows)
library(parsnip)

Load libraries:

#| label: setup
library(tidyclust)
library(tidyverse)
library(tidymodels)
set.seed(838383)

Load and clean a dataset:

data("penguins", package = "modeldata")

penguins <- penguins |>
  select(bill_length_mm, bill_depth_mm) |>
  drop_na()

# shuffle rows
penguins <- penguins |>
  sample_n(nrow(penguins))

If you have not yet read the k_means vignette, we recommend reading that first; functions that are used in this vignette are explained in more detail there.

A brief introduction to hierarchical clustering

Hierarchical Clustering, sometimes called Agglomerative Clustering, is a method of unsupervised learning that produces a dendrogram, which can be used to partition observations into clusters.

The hierarchical clustering process begins with each observation in it's own cluster; i.e., n clusters for n observations.

#| echo: false
#| fig-alt: "scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e."
fake_dat <- tibble(
  x = sort(runif(5)),
  y = runif(5),
  lab = letters[1:5]
)

fake_dat |>
  ggplot(aes(x, y)) +
  geom_point(shape = fake_dat$lab, size = 4) +
  geom_point(shape = 1, size = 7, stroke = 1, color = "dark grey") +
  theme_minimal() +
  ylim(c(-0.1, 1.1)) +
  xlim(c(-0.1, 1.1))

The closest two observations are then joined together into a single cluster.

#| echo: false
#| fig-alt: "scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e. One of the circles are replacing 2 of the previous circles."
fake_dat_2 <- bind_rows(
  fake_dat[-c(1:2), -3],
  summarize_all(fake_dat[1:2, -3], mean)
) |>
  mutate(
    size = c(rep(1, 3), suppressWarnings(dist(fake_dat)[1]))
  )

fake_dat |>
  ggplot(aes(x, y)) +
  geom_point(shape = fake_dat$lab, size = 4) +
  geom_point(
    data = fake_dat_2,
    aes(x = x, y = y),
    shape = 1,
    size = 7 / fake_dat_2$size,
    stroke = 1,
    color = "dark grey"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(c(-0.1, 1.1)) +
  xlim(c(-0.1, 1.1))

This process continues, with the closest two clusters being joined (or "agglomerated") at each step.

#| echo: false
#| fig-alt: "scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e. One of the circles are replacing 2 of the previous circles."
fake_dat_3 <- bind_rows(
  fake_dat[-c(1:3), -3],
  summarize_all(fake_dat[1:3, -3], mean)
) |>
  mutate(
    size = c(rep(1, 2), 0.09)
  )

p1 <- fake_dat |>
  ggplot(aes(x, y)) +
  geom_point(shape = fake_dat$lab, size = 4) +
  geom_point(
    data = fake_dat_3,
    aes(x = x, y = y),
    shape = 1,
    size = 7 / fake_dat_3$size,
    stroke = 1,
    color = "dark grey"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(c(-0.1, 1.1)) +
  xlim(c(-0.1, 1.1))


fake_dat_4 <- bind_rows(
  summarize_all(fake_dat[1:3, -3], mean),
  summarize_all(fake_dat[4:5, -3], mean),
) |>
  mutate(
    size = c(0.09, 0.09)
  )

p2 <- fake_dat |>
  ggplot(aes(x, y)) +
  geom_point(shape = fake_dat$lab, size = 4) +
  geom_point(
    data = fake_dat_4,
    aes(x = x, y = y),
    shape = 1,
    size = 7 / fake_dat_4$size,
    stroke = 1,
    color = "dark grey"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(c(-0.1, 1.1)) +
  xlim(c(-0.1, 1.1))

library(patchwork)
p1 + p2

The result of the process is a dendrogram, which shows the joining of clusters in tree form:

#| echo: false
#| fig-alt: "Dendrogram chart. With 5 observations."
hc <- hclust(dist(fake_dat))

plot(hc, labels = fake_dat$lab)

Clusters from dendrogram

To produce a partition-style cluster assignment from the dendrogram, one must "cut" the tree at a chosen height:

#| echo: false
#| fig-alt: "Dendrogram chart. With 5 observations. A horizontal like at 0.6 cuts the dendrogram into 3 clusters."
plot(hc, labels = fake_dat$lab)
abline(h = 0.6, lty = 2, col = "dark grey")

The observations that remain joined in the dendrogram below the cut height are considered to be in a cluster together:

#| echo: false
tibble(
  observation = fake_dat$lab,
  cluster_assignment = cutree(hc, h = 0.6)
)

Methods of agglomeration

At every step of the agglomeration, we measure distances between current clusters. With each cluster containing (possibly) multiple points, what does it mean to measure distance?

There are four common approaches to cluster-cluster distancing, aka "linkage":

single linkage: The distance between two clusters is the distance between the two closest observations.
average linkage: The distance between two clusters is the average of all distances between observations in one cluster and observations in the other.
complete linkage: The distance between two clusters is the distance between the two furthest observations.
centroid method: The distance between two clusters is the distance between their centroids (geometric mean or median).
Ward's method: The distance between two clusters is proportional to the increase in error sum of squares (ESS) that would result from joining them. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.

It is also worth mentioning the McQuitty method, which retains information about previously joined clusters to measure future linkage distance. This method is currently supported for model fitting, but not for prediction, in tidyclust.

`hier_clust` specification in {tidyclust}

To specify a hierarchical clustering model in tidyclust, simply choose a value of num_clusters and (optionally) a linkage method:

hc_spec <- hier_clust(
  num_clusters = 3,
  linkage_method = "average"
)

hc_spec

Currently, the only supported engine is stats::hclust(). The default linkage

Fitting `hier_clust` models

We fit the model to data in the usual way:

hc_fit <- hc_spec |>
  fit(~ bill_length_mm + bill_depth_mm,
    data = penguins
  )

hc_fit |>
  summary()

To produce a dendrogram plot, access the engine fit: (Although as we see below, dendrograms are often not very informative for moderate to large size datasets.)

#| fig-alt: "Dendrogram chart. With too many observations to be able to clearly see anything"
hc_fit$fit |> plot()

We can also extract the standard tidyclust summary list:

hc_summary <- hc_fit |> extract_fit_summary()

hc_summary |> str()

Note that, although the hierarchical clustering algorithm is not focused on cluster centroids in the same way $k$-means is, we are still able to compute the geometric mean over the predictors for each cluster:

hc_fit |> extract_centroids()

Prediction

To predict the cluster assignment for a new observation, we find the closest cluster. How we measure "closeness" is dependent on the specified type of linkage in the model:

single linkage: The new observation is assigned to the same cluster as its nearest observation from the training data.
complete linkage: The new observation is assigned to the cluster with the smallest maximum distances between training observations and the new observation.
average linkage: The new observation is assigned to the cluster with the smallest average distances between training observations and the new observation.
centroid method: The new observation is assigned to the cluster with the closest centroid, as in prediction for k_means.
Ward's method: The new observation is assigned to the cluster with the smallest increase in error sum of squares (ESS) due to the new addition. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.

hc_preds <- hc_fit |> predict(penguins)

hc_preds

It's important to note that there is no guarantee that predict() on the training data will produce the same results as extract_cluster_assignments(). The process by which clusters are created during the agglomerations results in a particular partition; but if a training observation is treated as new data, it is predicted in the same manner as truly new information.

bind_cols(
  hc_preds,
  extract_cluster_assignment(hc_fit)
)

Reconciling partitions

Suppose we have produced cluster assignments from two models: a hierarchical clustering model with three clusters (as above) and a $k$-means clustering model with five clusters (below). How can we combine these assignments?

km_spec <- k_means(num_clusters = 5)
km_fit <- km_spec |>
  fit(~., data = penguins)

km_preds <- predict(km_fit, penguins, prefix = "KM_")
hc_preds <- predict(hc_fit, penguins, prefix = "HC_")

We notice that the three-cluster assignments from hier_clust do not line up perfectly with the five-cluster assignments from k_means.

tibble(
  hc = hc_preds$.pred_cluster,
  km = km_preds$.pred_cluster
) |>
  count(hc, km)

However, they are not fully unrelated assignments. For example, all of KM_2 in the $k$-means assignment fell inside HC_1 for the hierarchical assignments.

Our goal is to relabel the five $k$-means clusters to match the three cluster names in the hierarchical output. This can be accomplished with reconcile_clusterings_mapping().

This function expects two vectors of cluster labels as input. The first is the label that will be matched, and the second is the label that will be recoded to the first.

If we are not trying to simply match names across two same-size clusterings, the option one_to_one must be set to FALSE.

reconcile_clusterings_mapping(
  primary = hc_preds$.pred_cluster,
  alternative = km_preds$.pred_cluster,
  one_to_one = FALSE
)

In this example, we can see that KM_1, KM_2, KM_5 have been matched to HC_1; and KM_3 and KM_4 have been matched to HC_2. Notice that no clusters from the KM set were matched to HC_3; evidently, this is a small cluster that did not manifest clearly in the $k$-means clustering.