step_kmeans: K-Means Clustering Variable Reduction
In MachineShop: Machine Learning Models and Tools

step_kmeans

R Documentation

K-Means Clustering Variable Reduction

Description

Creates a specification of a recipe step that will convert numeric variables into one or more by averaging within k-means clusters.

Usage

step_kmeans(
  recipe,
  ...,
  k = 5,
  center = TRUE,
  scale = TRUE,
  algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"),
  max_iter = 10,
  num_start = 1,
  replace = TRUE,
  prefix = "KMeans",
  role = "predictor",
  skip = FALSE,
  id = recipes::rand_id("kmeans")
)

## S3 method for class 'step_kmeans'
tidy(x, ...)

## S3 method for class 'step_kmeans'
tunable(x, ...)

Arguments

`recipe`	recipe object to which the step will be added.
`...`	one or more selector functions to choose which variables will be used to compute the components. See `selections` for more details. These are not currently used by the `tidy` method.
`k`	number of k-means clusterings of the variables. The value of `k` is constrained to be between 1 and one less than the number of original variables.
`center`, `scale`	logicals indicating whether to mean center and standard deviation scale the original variables prior to deriving components, or functions or names of functions for the centering and scaling.
`algorithm`	character string specifying the clustering algorithm to use.
`max_iter`	maximum number of algorithm iterations allowed.
`num_start`	number of random cluster centers generated for starting the Hartigan-Wong algorithm.
`replace`	logical indicating whether to replace the original variables.
`prefix`	character string prefix added to a sequence of zero-padded integers to generate names for the resulting new variables.
`role`	analysis role that added step variables should be assigned. By default, they are designated as model predictors.
`skip`	logical indicating whether to skip the step when the recipe is baked. While all operations are baked when `prep` is run, some operations may not be applicable to new data (e.g. processing outcome variables). Care should be taken when using `skip = TRUE` as it may affect the computations for subsequent operations.
`id`	unique character string to identify the step.
`x`	`step_kmeans` object.

Details

K-means clustering partitions variables into k groups such that the sum of squares between the variables and their assigned cluster means is minimized. Variables within each cluster are then averaged to derive a new set of k variables.

Value

Function step_kmeans creates a new step whose class is of the same name and inherits from step_lincomp, adds it to the sequence of existing steps (if any) in the recipe, and returns the updated recipe. For the tidy method, a tibble with columns terms (selectors or variables selected), cluster assignments, sqdist (squared distance from cluster centers), and name of the new variable names.

References

Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21, 768-769.

Hartigan, J. A., & Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics, 28, 100-108.

Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability (vol. 1, pp. 281-297). University of California Press.

Examples

library(recipes)

rec <- recipe(rating ~ ., data = attitude)
kmeans_rec <- rec %>%
  step_kmeans(all_predictors(), k = 3)
kmeans_prep <- prep(kmeans_rec, training = attitude)
kmeans_data <- bake(kmeans_prep, attitude)

pairs(kmeans_data, lower.panel = NULL)

tidy(kmeans_rec, number = 1)
tidy(kmeans_prep, number = 1)

MachineShop documentation built on June 10, 2025, 1:08 a.m.