CMDistance: Constrained Minimum Distance
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

CMDistance

R Documentation

Constrained Minimum Distance

Description

Calculates the Constrained Minimum Distance (Tatti, 2007) between two datasets.

Usage

CMDistance(X1, X2, binary = NULL, cov = FALSE,
            S.fun = function(x) as.numeric(as.character(x)), 
            cov.S = NULL, Omega = NULL, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`binary`	Should the simplified form for binary data be used? (default: `NULL`, it is checked internally if each variable in the pooled dataset takes on exactly two distinct values)
`cov`	If the the binary version is used, should covariances in addition to means be used as features? (default: `FALSE`, corresponds to example 3 in Tatti (2007), `TRUE` corresponds to example 4). Ignored if `binary = FALSE`.
`S.fun`	Feature function (default: `NULL`). Should be supplied as a function that takes one observation vector as its input. Ignored if `binary = TRUE` (default: `NULL`).
`cov.S`	Covariance matix of feature function (default: `NULL`). Ignored if `binary = TRUE`.
`Omega`	Sample space as matrix (default: `NULL`, the sample space is derived from the data internally). Each row represents one value in the sample space. Used for calculating the covariance matrix if `cov.S = NULL`. Either `cov.S` or `Omega` must be given. Ignored if `binary = TRUE`.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The constrained minimum (CM) distance is not a distance between distributions but rather a distance based on summaries. These summaries, called frequencies and denoted by \theta, are averages of feature functions S taken over the dataset. The constrained minimum distance of two datasets X_1 and X_2 can be calculated as

d_{CM}(X_1, X_2 |S)^2 = (\theta_1 - \theta_2)^T\text{Cov}^{-1}(S)(\theta_1 - \theta_2),

where \theta_i = S(X_i) is the frequency with respect to the i-th dataset, i = 1, 2, and

\text{Cov}(S) = \frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)S(\omega)^T - \left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)\left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)^T,

where \Omega denotes the sample space.

Note that the implementation can only handle limited dimensions of the sample space. The error message

"Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : invalid 'times' value"

occurs when the sample space becomes too large to enumerate all its elements. In case of binary data and S chosen as a conjunction or parity function T_{F} on a family of itemsets, the calculation of the CMD simplifies to

d_{CM}(D_1, D_2 | S_{F}) = 2 ||\theta_1 - \theta_2||_2,

where \theta_i = T_{F}(X_i), i = 1, 2, as the sample space and covariance matrix are known. In case of more than two categories, either the sample space or the covariance matrix of the feature function must be supplied.

Small values of the CM Distance indicate similarity between the datasets. No test is conducted.

Value

An object of class htest with the following components:

`statistic`	Observed value of the CM Distance
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names
`binary`, `cov`, `S.fun`, `cov.S`, `Omega`	Input parameters

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	No	Yes	No

Note

Note that there is an error in the calculation of the covariance matrix in A.4 Proof of Lemma 8 in Tatti (2007). The correct covariance matrix has the form

\text{Cov}[T_{\mathcal{F}}] = 0.25I

since

\text{Var}[T_A] = \text{E}[T_A^2] - \text{E}[T_A]^2 = 0.5 - 0.5^2 = 0.25

following from the correct statement that \text{E}[T_A^2] = \text{E}[T_A] = 0.5. Therefore, formula (4) changes to

d_{CM}(D_1, D_2 | S_{\mathcal{F}}) = 2 ||\theta_1 - \theta_2||_2

and the formula in example 3 changes to

d_{CM}(D_1, D_2 | S_{I}) = 2 ||\theta_1 - \theta_2||_2.

Our implementation is based on these corrected formulas. If the original formula was used, the results on the same data calculated with the formula for the binary special case and the results calculated with the general formula differ by a factor of \sqrt{2}.

References

Tatti, N. (2007). Distances between Data Sets Based on Summary Statistics. JMRL 8, 131-154.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Test example 2 in Tatti (2007)
CMDistance(X1 = data.frame(c("C", "C", "C", "A")), 
           X2 = data.frame(c("C", "A", "B", "A")),
           binary = FALSE, S.fun = function(x) as.numeric(x == "C"),
           Omega = data.frame(c("A", "B", "C")))

# Demonstration of corrected calculation
set.seed(1234)
X1bin <- matrix(sample(0:1, 100 * 3, replace = TRUE), ncol = 3)
X2bin <- matrix(sample(0:1, 100 * 3, replace = TRUE, prob = 1:2), ncol = 3)
CMDistance(X1bin, X2bin, binary = TRUE, cov = FALSE)
Omega <- expand.grid(0:1, 0:1, 0:1)
S.fun <- function(x) x
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, Omega = Omega)
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, cov.S = 0.5 * diag(3))
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, 
            cov.S = 0.5 * diag(3))$statistic * sqrt(2)

# Example for non-binary data
set.seed(1234)
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = S.fun, 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x) as.numeric(x == 1), 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x){ 
           c(x, x[1] * x[2], x[1] * x[3], x[2] * x[3])}, 
           Omega = expand.grid(1:4, 1:4, 1:4))

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.