A brief introduction to the k-means algorithm

k-means is a method of unsupervised learning that produces a partitioning of observations into k unique clusters. The goal of k-means is to minimize the sum of squared Euclidian distances between observations in a cluster and the centroid, or geometric mean, of that cluster.

In k-means clustering, observed variables (columns) are considered to be locations on orthogonal axes in multidimensional space. For example, in the plot below, each point represents an observation of one penguin, and the location in 2-dimensional space is determined by the bill length and bill depth of that penguin.

penguins %>%
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + 

A k-means cluster assignment is achieved by iterating to convergence from random initial conditions. The algorithm proceeds as follows:

  1. Choose k random observations in the dataset. These locations in space are declared to be the initial centroids.
init <- penguins %>%

centroids = factor(1:3)

  ggplot() + 
  geom_point(aes(x = penguins$bill_length_mm, y = penguins$bill_depth_mm)) +
  geom_point(aes(x = init$bill_length_mm, y = init$bill_depth_mm, color = clusters), size = 5)
  1. Assign each observation to the nearest centroid.

  1. Compute the new centroids of each cluster.

  2. Repeat steps 2 and 3 until the centroids do not change.


Things to note

k-means specification in {celery}

To specify a k-means model in celery, simply choose a value of $k$ and an engine:

kmeans_spec <- k_means(k = 3) %>%


Once specified, a model may be "fit" to a dataset by providing a formula and data frame. Note that unlike in supervised modeling, the formula should not include a response variable.

kmeans_spec_fit <- kmeans_spec %>%
  fit(~ bill_length_mm + bill_depth_mm, data = penguins)


To access the only the results produced by the engine - in this case, stats::kmeans - simply extract the fit from the fitted model object:


Cluster assignments and predictions

Of the information provided from the model fit, the primary objective is typically the cluster assignments of each observation. These can be accessed via the extract_cluster_assignment() function:

kmeans_spec_fit %>%

Note that this function renames clusters in accordance with the standard celery naming convention and ordering: clusters are named "Cluster_1", "Cluster_2", etc. and are numbered by the order they appear in the rows of the training dataset.

Similarly, you can "predict" the cluster membership of new data using the predict_cluster() function:

new_penguin <- tibble(
  bill_length_mm = 40,
  bill_depth_mm = 15

kmeans_spec_fit %>%

In the case of kmeans, the cluster assignment is predicted by finding the closest final centroid to the new observation.

Augmenting datasets

To attach cluster assignments or predictions to a dataset, use augment_cluster():

Cluster centroids

A cluster is typically characterized by the location of its final centroid. These can be accessed by:

kmeans_spec_fit %>%



penguins_recipe_1 <- recipe(~ bill_length_mm + bill_depth_mm,
                            data = penguins)

penguins_recipe_2 <- recipe(species ~ bill_length_mm + bill_depth_mm,
                            data = penguins)

