In kbodwin/celery: What the Package Does (One Line, Title Case)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(celery)
library(tidyverse)
library(palmerpenguins)
library(tidymodels)
set.seed(838383)

A brief introduction to the k-means algorithm

k-means is a method of unsupervised learning that produces a partitioning of observations into k unique clusters. The goal of k-means is to minimize the sum of squared Euclidian distances between observations in a cluster and the centroid, or geometric mean, of that cluster.

In k-means clustering, observed variables (columns) are considered to be locations on orthogonal axes in multidimensional space. For example, in the plot below, each point represents an observation of one penguin, and the location in 2-dimensional space is determined by the bill length and bill depth of that penguin.

penguins %>%
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 
  geom_point()

A k-means cluster assignment is achieved by iterating to convergence from random initial conditions. The algorithm proceeds as follows:

Choose k random observations in the dataset. These locations in space are declared to be the initial centroids.

init <- penguins %>%
  sample_n(3)

centroids = factor(1:3)

  ggplot() + 
  geom_point(aes(x = penguins$bill_length_mm, y = penguins$bill_depth_mm)) +
  geom_point(aes(x = init$bill_length_mm, y = init$bill_depth_mm, color = clusters), size = 5)

Assign each observation to the nearest centroid.

[pic - we should have automatic functions for this. if I could predict with pre-supplied centroids that would be cool.]

Compute the new centroids of each cluster.
Repeat steps 2 and 3 until the centroids do not change.

[gif?]

Things to note

Because k-means relies on random initial conditions, the procedure may not result in identical cluster assignments on subsequent runs.
Because k-means is a greedy algorithm, it is not guaranteed to achieve a globally optimal solution.
k-means produces a partition: each observation is assigned to exactly one cluster
k-means is a non-stochastic approach; no probability model is assumed regarding the selection of the observations or the values of the variables.

k-means specification in {celery}

To specify a k-means model in celery, simply choose a value of $k$ and an engine:

kmeans_spec <- k_means(k = 3) %>%
  set_engine_celery("stats") 

kmeans_spec

Once specified, a model may be "fit" to a dataset by providing a formula and data frame. Note that unlike in supervised modeling, the formula should not include a response variable.

kmeans_spec_fit <- kmeans_spec %>%
  fit(~ bill_length_mm + bill_depth_mm, data = penguins)

kmeans_spec_fit

To access the only the results produced by the engine - in this case, stats::kmeans - simply extract the fit from the fitted model object:

kmeans_spec_fit$fit

Cluster assignments and predictions

Of the information provided from the model fit, the primary objective is typically the cluster assignments of each observation. These can be accessed via the extract_cluster_assignment() function:

kmeans_spec_fit %>%
  extract_cluster_assignment()

Note that this function renames clusters in accordance with the standard celery naming convention and ordering: clusters are named "Cluster_1", "Cluster_2", etc. and are numbered by the order they appear in the rows of the training dataset.

Similarly, you can "predict" the cluster membership of new data using the predict_cluster() function:

new_penguin <- tibble(
  bill_length_mm = 40,
  bill_depth_mm = 15
)

kmeans_spec_fit %>%
  predict_cluster(new_penguin)

In the case of kmeans, the cluster assignment is predicted by finding the closest final centroid to the new observation.

Augmenting datasets

To attach cluster assignments or predictions to a dataset, use augment_cluster():

### add this

Cluster centroids

A cluster is typically characterized by the location of its final centroid. These can be accessed by:

kmeans_spec_fit %>%
  extract_clusters()

[interpretation]

Workflows

penguins_recipe_1 <- recipe(~ bill_length_mm + bill_depth_mm,
                            data = penguins)

penguins_recipe_2 <- recipe(species ~ bill_length_mm + bill_depth_mm,
                            data = penguins)

# wflow_1 <- workflow() %>%
#   add_celery_model(kmeans_spec) %>%
#   add_recipe(penguins_recipe_1)