| cdist | R Documentation |
Computes a distance matrix for categorical variables with support for validation data, multiple distance metrics, and variable weighting. The function implements various distance calculation approaches as described in van de Velden et al. (2024), including commensurable distances and supervised options when response variable is provided.
cdist(x, response = NULL, validate_x = NULL, method = "tot_var_dist",
commensurable = FALSE, weights = 1)
x |
A data frame or matrix of categorical variables (factors). |
response |
Optional response variable for supervised distance calculations. Default is |
validate_x |
Optional validation data frame or matrix. If provided, distances are computed between observations in |
method |
Character string or vector specifying the distance metric(s). Options include:
Can be a single string or vector for different methods per variable. |
commensurable |
Logical. If |
weights |
Numeric vector or matrix of weights. If vector, must have length equal to number of variables. If matrix, must match the dimension of level-wise distances. Default is 1 (equal weighting). |
The cdist function provides a comprehensive framework for categorical distance calculations:
Supports multiple distance calculation methods that can be specified globally or per variable
Handles validation data through validate_x parameter
Implements supervised distances when response variable is provided
Supports commensurable distances for better comparability across variables
Provides flexible weighting schemes at variable and level granularity
Important notes:
Input variables are automatically converted to factors with dropped unused levels
Different methods per variable is not supported for "none", "st_dev", "HL", "cat_dis", "HLeucl", "mca"
Weight vector length must match the number of variables when specified as a vector
Variables should be factors; numeric variables will cause errors
A list containing:
distance_mat |
Matrix of pairwise distances. If |
delta |
Matrix or list of matrices containing level-wise distances for each variable. |
delta_names |
Vector of level names used in the delta matrices. |
van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.
mdist for mixed-type data distances, ndist for continuous data distances
library(palmerpenguins)
library(rsample)
# Prepare data with complete cases for both categorical variables and response
complete_vars <- c("species", "island", "sex", "body_mass_g")
penguins_complete <- penguins[complete.cases(penguins[, complete_vars]), ]
penguins_cat <- penguins_complete[, c("species", "island", "sex")]
response <- penguins_complete$body_mass_g
# Create training-test split
set.seed(123)
penguins_split <- initial_split(penguins_cat, prop = 0.8)
tr_penguins <- training(penguins_split)
ts_penguins <- testing(penguins_split)
response_tr <- response[penguins_split$in_id]
response_ts <- response[-penguins_split$in_id]
# Basic usage
result <- cdist(tr_penguins)
# With validation data
val_result <- cdist(x = tr_penguins,
validate_x = ts_penguins,
method = "tot_var_dist")
# ...and commensurability
val_result_COMM <- cdist(x = tr_penguins,
validate_x = ts_penguins,
method = "tot_var_dist",
commensurable = TRUE)
# Supervised distance with response variable
sup_result <- cdist(x = tr_penguins,
response = response_tr,
method = "supervised")
# Supervised with validation data
sup_val_result <- cdist(x = tr_penguins,
validate_x = ts_penguins,
response = response_tr,
method = "supervised")
# Commensurable distances with custom weights
comm_result <- cdist(tr_penguins,
commensurable = TRUE,
weights = c(2, 1, 1))
# Different methods per variable
multi_method <- cdist(tr_penguins,
method = c("matching", "goodall_3", "tot_var_dist"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.