dist_categorical: Compute pairwise distances for categorical data
In dbrobust: Robust Distance-Based Visualization and Analysis of Mixed-Type Data

dist_categorical

R Documentation

Compute pairwise distances for categorical data

Description

Internal helper function to compute distances between observations based on the matching coefficient, which measures the proportion of matching attributes between two categorical vectors. This approach is particularly useful for multiclass categorical variables.

Usage

dist_categorical(x, method = "matching_coefficient")

Arguments

`x`	A data frame or matrix containing only categorical variables (factor or character)
`method`	Currently only `"matching_coefficient"` is supported.

Details

The distance between two observations i and j is defined as:

d(i, j) = 1 - \frac{\alpha}{p^\prime}

where \alpha is the number of matching attributes (agreements) and p' is the number of non-missing comparisons between the two observations.

Only categorical columns (factor or character) are supported; numeric columns must be converted prior to using this function.
Missing values (NA) are ignored pairwise. If all attributes are missing for a given pair, the distance is returned as NA.
This distance is equivalent to the normalized Hamming distance when applied to binary variables.
The matching coefficient satisfies metric properties and can be used as a building block for mixed-type distances (e.g., combined with quantitative distances via Gower's similarity).

Value

A symmetric numeric matrix of pairwise distances. Distance is in the range [0, 1], where 0 indicates complete agreement and 1 indicates complete disagreement. NA is returned for pairs with no valid comparisons (all NA entries).

Examples

# Small categorical dataset
df <- data.frame(
  A = factor(c("red", "blue", "red")),
  B = factor(c("circle", "circle", "square"))
)
# Compute matching coefficient
dbrobust::dist_categorical(df)

dbrobust documentation built on Nov. 5, 2025, 6:24 p.m.