knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(workflows) library(parsnip)
Load libraries:
library(tidyclust) library(tidyverse) library(tidymodels) set.seed(838383)
Load and clean a dataset:
data("penguins", package = "modeldata") penguins <- penguins %>% select(bill_length_mm, bill_depth_mm) %>% drop_na() # shuffle rows penguins <- penguins %>% sample_n(nrow(penguins))
If you have not yet read the k_means
vignette, we recommend reading that first;
functions that are used in this vignette are explained in more detail there.
Hierarchical Clustering, sometimes called Agglomerative Clustering, is a method of unsupervised learning that produces a dendrogram, which can be used to partition observations into clusters.
The hierarchical clustering process begins with each observation in it's own cluster; i.e., n clusters for n observations.
#| fig-alt: "scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e." fake_dat <- tibble( x = sort(runif(5)), y = runif(5), lab = letters[1:5] ) fake_dat %>% ggplot(aes(x, y)) + geom_point(shape = fake_dat$lab, size = 4) + geom_point(shape = 1, size = 7, stroke = 1, color = "dark grey") + theme_minimal() + ylim(c(-0.1, 1.1)) + xlim(c(-0.1, 1.1))
The closest two observations are then joined together into a single cluster.
#| fig-alt: "scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e. One of the circles are replacing 2 of the previous circles." fake_dat_2 <- bind_rows( fake_dat[-c(1:2), -3], summarize_all(fake_dat[1:2, -3], mean) ) %>% mutate( size = c(rep(1, 3), suppressWarnings(dist(fake_dat)[1])) ) fake_dat %>% ggplot(aes(x, y)) + geom_point(shape = fake_dat$lab, size = 4) + geom_point( data = fake_dat_2, aes(x = x, y = y), shape = 1, size = 7 / fake_dat_2$size, stroke = 1, color = "dark grey" ) + theme_minimal() + theme(legend.position = "none") + ylim(c(-0.1, 1.1)) + xlim(c(-0.1, 1.1))
This process continues, with the closest two clusters being joined (or "agglomerated") at each step.
#| fig-alt: "scatter chart. 5 circles are randomly located, and labeled a, b, c, d, and e. One of the circles are replacing 2 of the previous circles." fake_dat_3 <- bind_rows( fake_dat[-c(1:3), -3], summarize_all(fake_dat[1:3, -3], mean) ) %>% mutate( size = c(rep(1, 2), 0.09) ) p1 <- fake_dat %>% ggplot(aes(x, y)) + geom_point(shape = fake_dat$lab, size = 4) + geom_point( data = fake_dat_3, aes(x = x, y = y), shape = 1, size = 7 / fake_dat_3$size, stroke = 1, color = "dark grey" ) + theme_minimal() + theme(legend.position = "none") + ylim(c(-0.1, 1.1)) + xlim(c(-0.1, 1.1)) fake_dat_4 <- bind_rows( summarize_all(fake_dat[1:3, -3], mean), summarize_all(fake_dat[4:5, -3], mean), ) %>% mutate( size = c(0.09, 0.09) ) p2 <- fake_dat %>% ggplot(aes(x, y)) + geom_point(shape = fake_dat$lab, size = 4) + geom_point( data = fake_dat_4, aes(x = x, y = y), shape = 1, size = 7 / fake_dat_4$size, stroke = 1, color = "dark grey" ) + theme_minimal() + theme(legend.position = "none") + ylim(c(-0.1, 1.1)) + xlim(c(-0.1, 1.1)) library(patchwork) p1 + p2
The result of the process is a dendrogram, which shows the joining of clusters in tree form:
#| fig-alt: "Dendrogram chart. With 5 observations." hc <- hclust(dist(fake_dat)) plot(hc, labels = fake_dat$lab)
To produce a partition-style cluster assignment from the dendrogram, one must "cut" the tree at a chosen height:
#| fig-alt: "Dendrogram chart. With 5 observations. A horizontal like at 0.6 cuts the dendrogram into 3 clusters." plot(hc, labels = fake_dat$lab) abline(h = 0.6, lty = 2, col = "dark grey")
The observations that remain joined in the dendrogram below the cut height are considered to be in a cluster together:
tibble( observation = fake_dat$lab, cluster_assignment = cutree(hc, h = 0.6) )
At every step of the agglomeration, we measure distances between current clusters. With each cluster containing (possibly) multiple points, what does it mean to measure distance?
There are four common approaches to cluster-cluster distancing, aka "linkage":
single linkage: The distance between two clusters is the distance between the two closest observations.
average linkage: The distance between two clusters is the average of all distances between observations in one cluster and observations in the other.
complete linkage: The distance between two clusters is the distance between the two furthest observations.
centroid method: The distance between two clusters is the distance between their centroids (geometric mean or median).
Ward's method: The distance between two clusters is proportional to the increase in error sum of squares (ESS) that would result from joining them. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.
It is also worth mentioning the McQuitty method, which retains information
about previously joined clusters to measure future linkage distance. This
method is currently supported for model fitting, but not for prediction, in
tidyclust
.
hier_clust
specification in {tidyclust}To specify a hierarchical clustering model in tidyclust
,
simply choose a value of num_clusters
and (optionally) a linkage method:
hc_spec <- hier_clust( num_clusters = 3, linkage_method = "average" ) hc_spec
Currently, the only supported engine is stats::hclust()
. The default linkage
hier_clust
modelsWe fit the model to data in the usual way:
hc_fit <- hc_spec %>% fit(~ bill_length_mm + bill_depth_mm, data = penguins ) hc_fit %>% summary()
To produce a dendrogram plot, access the engine fit: (Although as we see below, dendrograms are often not very informative for moderate to large size datasets.)
#| fig-alt: "Dendrogram chart. With too many observations to be able to clearly see anything" hc_fit$fit %>% plot()
We can also extract the standard tidyclust
summary list:
hc_summary <- hc_fit %>% extract_fit_summary() hc_summary %>% str()
Note that, although the hierarchical clustering algorithm is not focused on cluster centroids in the same way $k$-means is, we are still able to compute the geometric mean over the predictors for each cluster:
hc_fit %>% extract_centroids()
To predict the cluster assignment for a new observation, we find the closest cluster. How we measure "closeness" is dependent on the specified type of linkage in the model:
single linkage: The new observation is assigned to the same cluster as its nearest observation from the training data.
complete linkage: The new observation is assigned to the cluster with the smallest maximum distances between training observations and the new observation.
average linkage: The new observation is assigned to the cluster with the smallest average distances between training observations and the new observation.
centroid method: The new observation is assigned to the cluster with the
closest centroid, as in prediction for k_means
.
Ward's method: The new observation is assigned to the cluster with the smallest increase in error sum of squares (ESS) due to the new addition. The ESS is computed as the sum of squared distances between observations in a cluster, and the centroid of the cluster.
hc_preds <- hc_fit %>% predict(penguins) hc_preds
It's important to note that there is no guarantee that predict()
on the
training data will produce the same results as extract_cluster_assignments()
.
The process by which clusters are created during the agglomerations results in
a particular partition; but if a training observation is treated as new data, it
is predicted in the same manner as truly new information.
bind_cols( hc_preds, extract_cluster_assignment(hc_fit) )
Suppose we have produced cluster assignments from two models: a hierarchical clustering model with three clusters (as above) and a $k$-means clustering model with five clusters (below). How can we combine these assignments?
km_spec <- k_means(num_clusters = 5) km_fit <- km_spec %>% fit(~., data = penguins) km_preds <- predict(km_fit, penguins, prefix = "KM_") hc_preds <- predict(hc_fit, penguins, prefix = "HC_")
We notice that the three-cluster assignments from hier_clust
do not line up
perfectly with the five-cluster assignments from k_means
.
tibble( hc = hc_preds$.pred_cluster, km = km_preds$.pred_cluster ) %>% count(hc, km)
However, they are not fully unrelated assignments. For example, all of
KM_2
in the $k$-means assignment fell inside HC_1
for the hierarchical
assignments.
Our goal is to relabel the five $k$-means clusters to match the three cluster names
in the hierarchical output. This can be accomplished with reconcile_clusterings_mapping()
.
This function expects two vectors of cluster labels as input. The first is the label that will be matched, and the second is the label that will be recoded to the first.
If we are not trying to simply match names across two same-size clusterings,
the option one_to_one
must be set to FALSE
.
reconcile_clusterings_mapping( primary = hc_preds$.pred_cluster, alternative = km_preds$.pred_cluster, one_to_one = FALSE )
In this example, we can see that KM_1
, KM_2
, KM_5
have been matched to
HC_1
; and KM_3
and KM_4
have been matched to HC_2
. Notice that no
clusters from the KM
set were matched to HC_3
; evidently, this is a small
cluster that did not manifest clearly in the $k$-means clustering.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.