View source: R/dist_categorical.R
| dist_categorical | R Documentation |
Internal helper function to compute distances between observations based on the matching coefficient, which measures the proportion of matching attributes between two categorical vectors. This approach is particularly useful for multiclass categorical variables.
dist_categorical(x, method = "matching_coefficient")
x |
A data frame or matrix containing only categorical variables (factor or character) |
method |
Currently only |
The distance between two observations i and j is defined as:
d(i, j) = 1 - \frac{\alpha}{p^\prime}
where \alpha is the number of matching attributes (agreements) and p'
is the number of non-missing comparisons between the two observations.
Only categorical columns (factor or character) are supported; numeric columns must be converted prior to using this function.
Missing values (NA) are ignored pairwise. If all attributes are missing for a given pair, the distance is returned as NA.
This distance is equivalent to the normalized Hamming distance when applied to binary variables.
The matching coefficient satisfies metric properties and can be used as a building block for mixed-type distances (e.g., combined with quantitative distances via Gower's similarity).
A symmetric numeric matrix of pairwise distances. Distance is in the range [0, 1], where 0 indicates complete agreement and 1 indicates complete disagreement. NA is returned for pairs with no valid comparisons (all NA entries).
# Small categorical dataset
df <- data.frame(
A = factor(c("red", "blue", "red")),
B = factor(c("circle", "circle", "square"))
)
# Compute matching coefficient
dbrobust::dist_categorical(df)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.