knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(celery) library(tidyverse) library(palmerpenguins) library(tidymodels) set.seed(838383)
k-means is a method of unsupervised learning that produces a partitioning of observations into k unique clusters. The goal of k-means is to minimize the sum of squared Euclidian distances between observations in a cluster and the centroid, or geometric mean, of that cluster.
In k-means clustering, observed variables (columns) are considered to be locations on orthogonal axes in multidimensional space. For example, in the plot below, each point represents an observation of one penguin, and the location in 2-dimensional space is determined by the bill length and bill depth of that penguin.
penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point()
A k-means cluster assignment is achieved by iterating to convergence from random initial conditions. The algorithm proceeds as follows:
init <- penguins %>% sample_n(3) centroids = factor(1:3) ggplot() + geom_point(aes(x = penguins$bill_length_mm, y = penguins$bill_depth_mm)) + geom_point(aes(x = init$bill_length_mm, y = init$bill_depth_mm, color = clusters), size = 5)
[pic - we should have automatic functions for this. if I could predict with pre-supplied centroids that would be cool.]
Compute the new centroids of each cluster.
Repeat steps 2 and 3 until the centroids do not change.
[gif?]
Because k-means relies on random initial conditions, the procedure may not result in identical cluster assignments on subsequent runs.
Because k-means is a greedy algorithm, it is not guaranteed to achieve a globally optimal solution.
k-means produces a partition: each observation is assigned to exactly one cluster
k-means is a non-stochastic approach; no probability model is assumed regarding the selection of the observations or the values of the variables.
To specify a k-means model in celery, simply choose a value of $k$ and an
engine:
kmeans_spec <- k_means(k = 3) %>% set_engine_celery("stats") kmeans_spec
Once specified, a model may be "fit" to a dataset by providing a formula and data frame. Note that unlike in supervised modeling, the formula should not include a response variable.
kmeans_spec_fit <- kmeans_spec %>% fit(~ bill_length_mm + bill_depth_mm, data = penguins) kmeans_spec_fit
To access the only the results produced by the engine - in this case, stats::kmeans -
simply extract the fit from the fitted model object:
kmeans_spec_fit$fit
Of the information provided from the model fit, the primary objective is typically
the cluster assignments of each observation. These can be accessed via the
extract_cluster_assignment() function:
kmeans_spec_fit %>% extract_cluster_assignment()
Note that this function renames clusters in accordance with the standard celery
naming convention and ordering: clusters are named "Cluster_1", "Cluster_2", etc.
and are numbered by the order they appear in the rows of the training dataset.
Similarly, you can "predict" the cluster membership of new data using the
predict_cluster() function:
new_penguin <- tibble( bill_length_mm = 40, bill_depth_mm = 15 ) kmeans_spec_fit %>% predict_cluster(new_penguin)
In the case of kmeans, the cluster assignment is predicted by finding the closest
final centroid to the new observation.
To attach cluster assignments or predictions to a dataset, use augment_cluster():
### add this
A cluster is typically characterized by the location of its final centroid. These can be accessed by:
kmeans_spec_fit %>% extract_clusters()
[interpretation]
penguins_recipe_1 <- recipe(~ bill_length_mm + bill_depth_mm, data = penguins) penguins_recipe_2 <- recipe(species ~ bill_length_mm + bill_depth_mm, data = penguins) # wflow_1 <- workflow() %>% # add_celery_model(kmeans_spec) %>% # add_recipe(penguins_recipe_1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.