LHZ: Empirical Characteristic Distance
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/LHZ.R

LHZ	R Documentation

Empirical Characteristic Distance

Description

The function implements the Li et al. (2022) empirical characteristic distance between two datasets.

Usage

LHZ(X1, X2, n.perm = 0, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`n.perm`	Number of permutations for permutation test (default: 0, no permutation test performed)
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The test statistic

T_{n, m} = \frac{1}{n^2} \sum_{j, q = 1}^n \left( \left\Vert \frac{1}{n} \sum_{k=1}^n e^{i\langle X_k, X_j-X_q \rangle} - \frac{1}{m} \sum_{l=1}^m e^{i\langle Y_l, X_j-X_q\rangle} \right\Vert^2 \right) + \frac{1}{m^2} \sum_{j, q = 1}^m \left( \left\Vert \frac{1}{n} \sum_{k=1}^n e^{i\langle X_k, Y_j-Y_q \rangle} - \frac{1}{m} \sum_{l=1}^m e^{i\langle Y_l, Y_j-Y_q\rangle} \right\Vert^2 \right)

is calculated according to Li et al. (2022). The datasets are denoted by X and Y with respective sample sizes n and m. By X_j the i-th row of dataset X is denoted. Furthermore, \Vert \cdot \Vert indicates the Euclidian norm and \langle X_i, X_j \rangle indicates the inner product between X_i and X_j.

Low values of the test statistic indicate similarity. Therefore, the permutation test rejects for large values of the test statistic.

Value

An object of class htest with the following components:

`method`	Description of the test
`statistic`	Observed value of the test statistic
`p.value`	Permutation p value (only if `n.perm` > 0)
`data.name`	The dataset names
`alternative`	The alternative hypothesis

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	No

References

Li, X., Hu, W. and Zhang, B. (2022). Measuring and testing homogeneity of distributions by characteristic distance, Statistical Papers 64 (2), 529-556, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s00362-022-01327-7")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Calculate LHZ statistic
LHZ(X1, X2)

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.