CMDistance: Constrained Minimum Distance

View source: R/CMDistance.R

CMDistanceR Documentation

Constrained Minimum Distance

Description

Calculates the Constrained Minimum Distance (Tatti, 2007) between two datasets.

Usage

CMDistance(X1, X2, binary = NULL, cov = FALSE,
            S.fun = function(x) as.numeric(as.character(x)), 
            cov.S = NULL, Omega = NULL, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

binary

Should the simplified form for binary data be used? (default: NULL, it is checked internally if each variable in the pooled dataset takes on exactly two distinct values)

cov

If the the binary version is used, should covariances in addition to means be used as features? (default: FALSE, corresponds to example 3 in Tatti (2007), TRUE corresponds to example 4). Ignored if binary = FALSE.

S.fun

Feature function (default: NULL). Should be supplied as a function that takes one observation vector as its input. Ignored if binary = TRUE (default: NULL).

cov.S

Covariance matix of feature function (default: NULL). Ignored if binary = TRUE.

Omega

Sample space as matrix (default: NULL, the sample space is derived from the data internally). Each row represents one value in the sample space. Used for calculating the covariance matrix if cov.S = NULL. Either cov.S or Omega must be given. Ignored if binary = TRUE.

seed

Random seed (default: 42)

Details

The constrained minimum (CM) distance is not a distance between distributions but rather a distance based on summaries. These summaries, called frequencies and denoted by \theta, are averages of feature functions S taken over the dataset. The constrained minimum distance of two datasets X_1 and X_2 can be calculated as

d_{CM}(X_1, X_2 |S)^2 = (\theta_1 - \theta_2)^T\text{Cov}^{-1}(S)(\theta_1 - \theta_2),

where \theta_i = S(X_i) is the frequency with respect to the i-th dataset, i = 1, 2, and

\text{Cov}(S) = \frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)S(\omega)^T - \left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)\left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)^T,

where \Omega denotes the sample space.

Note that the implementation can only handle limited dimensions of the sample space. The error message

"Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : invalid 'times' value"

occurs when the sample space becomes too large to enumerate all its elements. In case of binary data and S chosen as a conjunction or parity function T_{F} on a family of itemsets, the calculation of the CMD simplifies to

d_{CM}(D_1, D_2 | S_{F}) = 2 ||\theta_1 - \theta_2||_2,

where \theta_i = T_{F}(X_i), i = 1, 2, as the sample space and covariance matrix are known. In case of more than two categories, either the sample space or the covariance matrix of the feature function must be supplied.

Small values of the CM Distance indicate similarity between the datasets. No test is conducted.

Value

An object of class htest with the following components:

statistic

Observed value of the CM Distance

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

binary, cov, S.fun, cov.S, Omega

Input parameters

Applicability

Target variable? Numeric? Categorical? K-sample?
No No Yes No

Note

Note that there is an error in the calculation of the covariance matrix in A.4 Proof of Lemma 8 in Tatti (2007). The correct covariance matrix has the form

\text{Cov}[T_{\mathcal{F}}] = 0.25I

since

\text{Var}[T_A] = \text{E}[T_A^2] - \text{E}[T_A]^2 = 0.5 - 0.5^2 = 0.25

following from the correct statement that \text{E}[T_A^2] = \text{E}[T_A] = 0.5. Therefore, formula (4) changes to

d_{CM}(D_1, D_2 | S_{\mathcal{F}}) = 2 ||\theta_1 - \theta_2||_2

and the formula in example 3 changes to

d_{CM}(D_1, D_2 | S_{I}) = 2 ||\theta_1 - \theta_2||_2.

Our implementation is based on these corrected formulas. If the original formula was used, the results on the same data calculated with the formula for the binary special case and the results calculated with the general formula differ by a factor of \sqrt{2}.

References

Tatti, N. (2007). Distances between Data Sets Based on Summary Statistics. JMRL 8, 131-154.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Test example 2 in Tatti (2007)
CMDistance(X1 = data.frame(c("C", "C", "C", "A")), 
           X2 = data.frame(c("C", "A", "B", "A")),
           binary = FALSE, S.fun = function(x) as.numeric(x == "C"),
           Omega = data.frame(c("A", "B", "C")))

# Demonstration of corrected calculation
X1bin <- matrix(sample(0:1, 100 * 3, replace = TRUE), ncol = 3)
X2bin <- matrix(sample(0:1, 100 * 3, replace = TRUE, prob = 1:2), ncol = 3)
CMDistance(X1bin, X2bin, binary = TRUE, cov = FALSE)
Omega <- expand.grid(0:1, 0:1, 0:1)
S.fun <- function(x) x
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, Omega = Omega)
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, cov.S = 0.5 * diag(3))
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, 
            cov.S = 0.5 * diag(3))$statistic * sqrt(2)

# Example for non-binary data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = S.fun, 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x) as.numeric(x == 1), 
           Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x){ 
           c(x, x[1] * x[2], x[1] * x[3], x[2] * x[3])}, 
           Omega = expand.grid(1:4, 1:4, 1:4))

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.