clustering_cv: Cluster Cross-Validation

View source: R/clustering.R

clustering_cvR Documentation

Cluster Cross-Validation


Cluster cross-validation splits the data into V groups of disjointed sets using k-means clustering of some variables. A resample of the analysis data consists of V-1 of the folds/clusters while the assessment set contains the final fold/cluster. In basic cross-validation (i.e. no repeats), the number of resamples is equal to V.


  v = 10,
  repeats = 1,
  distance_function = "dist",
  cluster_function = c("kmeans", "hclust"),



A data frame.


A vector of bare variable names to use to cluster the data.


The number of partitions of the data set.


The number of times to repeat the clustered partitioning.


Which function should be used for distance calculations? Defaults to stats::dist(). You can also provide your own function; see Details.


Which function should be used for clustering? Options are either "kmeans" (to use stats::kmeans()) or "hclust" (to use stats::hclust()). You can also provide your own function; see Details.


Extra arguments passed on to cluster_function.


The variables in the vars argument are used for k-means clustering of the data into disjointed sets or for hierarchical clustering of the data. These clusters are used as the folds for cross-validation. Depending on how the data are distributed, there may not be an equal number of points in each fold.

You can optionally provide a custom function to distance_function. The function should take a data frame (as created via data[vars]) and return a stats::dist() object with distances between data points.

You can optionally provide a custom function to cluster_function. The function must take three arguments:

  • dists, a stats::dist() object with distances between data points

  • v, a length-1 numeric for the number of folds to create

  • ..., to pass any additional named arguments to your function

The function should return a vector of cluster assignments of length nrow(data), with each element of the vector corresponding to the matching row of the data frame.


A tibble with classes rset, tbl_df, tbl, and data.frame. The results include a column for the data split objects and an identification variable id.


data(ames, package = "modeldata")
clustering_cv(ames, vars = c(Sale_Price, First_Flr_SF, Second_Flr_SF), v = 2)

tidymodels/rsample documentation built on Sept. 29, 2024, 10:48 p.m.