CMDistance | R Documentation |
Calculates the Constrained Minimum Distance (Tatti, 2007) between two datasets.
CMDistance(X1, X2, binary = NULL, cov = FALSE,
S.fun = function(x) as.numeric(as.character(x)),
cov.S = NULL, Omega = NULL, seed = 42)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
binary |
Should the simplified form for binary data be used? (default: |
cov |
If the the binary version is used, should covariances in addition to means be used as features? (default: |
S.fun |
Feature function (default: |
cov.S |
Covariance matix of feature function (default: |
Omega |
Sample space as matrix (default: |
seed |
Random seed (default: 42) |
The constrained minimum (CM) distance is not a distance between distributions but rather a distance based on summaries. These summaries, called frequencies and denoted by \theta
, are averages of feature functions S
taken over the dataset. The constrained minimum distance of two datasets X_1
and X_2
can be calculated as
d_{CM}(X_1, X_2 |S)^2 = (\theta_1 - \theta_2)^T\text{Cov}^{-1}(S)(\theta_1 - \theta_2),
where \theta_i = S(X_i)
is the frequency with respect to the i
-th dataset, i = 1, 2
, and
\text{Cov}(S) = \frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)S(\omega)^T - \left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)\left(\frac{1}{|\Omega|}\sum_{\omega\in\Omega} S(\omega)\right)^T,
where \Omega
denotes the sample space.
Note that the implementation can only handle limited dimensions of the sample space. The error message
"Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : invalid 'times' value"
occurs when the sample space becomes too large to enumerate all its elements.
In case of binary data and S
chosen as a conjunction or parity function T_{F}
on a family of itemsets, the calculation of the CMD simplifies to
d_{CM}(D_1, D_2 | S_{F}) = 2 ||\theta_1 - \theta_2||_2,
where \theta_i = T_{F}(X_i), i = 1, 2,
as the sample space and covariance matrix are known. In case of more than two categories, either the sample space or the covariance matrix of the feature function must be supplied.
Small values of the CM Distance indicate similarity between the datasets. No test is conducted.
An object of class htest
with the following components:
statistic |
Observed value of the CM Distance |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
binary , cov , S.fun , cov.S , Omega |
Input parameters |
Target variable? | Numeric? | Categorical? | K-sample? |
No | No | Yes | No |
Note that there is an error in the calculation of the covariance matrix in A.4 Proof of Lemma 8 in Tatti (2007). The correct covariance matrix has the form
\text{Cov}[T_{\mathcal{F}}] = 0.25I
since
\text{Var}[T_A] = \text{E}[T_A^2] - \text{E}[T_A]^2 = 0.5 - 0.5^2 = 0.25
following from the correct statement that \text{E}[T_A^2] = \text{E}[T_A] = 0.5
. Therefore, formula (4) changes to
d_{CM}(D_1, D_2 | S_{\mathcal{F}}) = 2 ||\theta_1 - \theta_2||_2
and the formula in example 3 changes to
d_{CM}(D_1, D_2 | S_{I}) = 2 ||\theta_1 - \theta_2||_2.
Our implementation is based on these corrected formulas. If the original formula was used, the results on the same data calculated with the formula for the binary special case and the results calculated with the general formula differ by a factor of \sqrt{2}
.
Tatti, N. (2007). Distances between Data Sets Based on Summary Statistics. JMRL 8, 131-154.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
# Test example 2 in Tatti (2007)
CMDistance(X1 = data.frame(c("C", "C", "C", "A")),
X2 = data.frame(c("C", "A", "B", "A")),
binary = FALSE, S.fun = function(x) as.numeric(x == "C"),
Omega = data.frame(c("A", "B", "C")))
# Demonstration of corrected calculation
X1bin <- matrix(sample(0:1, 100 * 3, replace = TRUE), ncol = 3)
X2bin <- matrix(sample(0:1, 100 * 3, replace = TRUE, prob = 1:2), ncol = 3)
CMDistance(X1bin, X2bin, binary = TRUE, cov = FALSE)
Omega <- expand.grid(0:1, 0:1, 0:1)
S.fun <- function(x) x
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, Omega = Omega)
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun, cov.S = 0.5 * diag(3))
CMDistance(X1bin, X2bin, binary = FALSE, S.fun = S.fun,
cov.S = 0.5 * diag(3))$statistic * sqrt(2)
# Example for non-binary data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = S.fun,
Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x) as.numeric(x == 1),
Omega = expand.grid(1:4, 1:4, 1:4))
CMDistance(X1cat, X2cat, binary = FALSE, S.fun = function(x){
c(x, x[1] * x[2], x[1] * x[3], x[2] * x[3])},
Omega = expand.grid(1:4, 1:4, 1:4))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.