KMD: Kernel Measure of Multi-Sample Dissimilarity (KMD)
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/KMD.R

KMD	R Documentation

Kernel Measure of Multi-Sample Dissimilarity (KMD)

Description

Calculates the kernel measure of multi-sample dissimilarity (KMD) and performs a permutation multi-sample test (Huang and Sen, 2023). The implementation here uses the KMD and KMD_test implementations from the KMD package.

Usage

KMD(X1, X2, ..., n.perm = 0, graph = "knn", k = ceiling(N/10), 
    kernel = "discrete", seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`...`	Optionally more datasets as matrices or data.frames
`n.perm`	Number of permutations for permutation test (default: 0, no permutation test performed).
`graph`	Graph used in calculation of KMD. Possible options are `"knn"` (default) and `"mst"`.
`k`	Number of neighbors for construction of `k`-nearest neighbor graph. Ignored for `graph = "mst"`.
`kernel`	Kernel used in calculation of KMD. Can either be `"discrete"` (default) for use of the discrete kernel or a kernal matrix with numbers of rows and columns corresponding to the number of datasets. For the latter, the entry in the `i`-th row and `j`-th column corresponds to the kernel value `k(i,j)`.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

Given the pooled sample Z_1, \dots, Z_N and the corresponding sample memberships \Delta_1,\dots, \Delta_N let \mathcal{G} be a geometric graph on \mathcal{X} such that an edge between two points Z_i and Z_j in the pooled sample implies that Z_i and Z_j are close, e.g. K-nearest neighbor graph with K\ge 1 or MST. Denote by (Z_i,Z_j)\in\mathcal{E}(\mathcal{G}) that there is an edge in \mathcal{G} connecting Z_i and Z_j. Moreover, let o_i be the out-degree of Z_i in \mathcal{G}. Then an estimator for the KMD \eta is defined as

\hat{\eta} := \frac{\frac{1}{N} \sum_{i=1}^N \frac{1}{o_i} \sum_{j:(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})} K(\Delta_i, \Delta_j) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}{\frac{1}{N}\sum_{i=1}^N K(\Delta_i, \Delta_i) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}.

Euclidean distances are used for computing the KNN graph (ties broken at random) and the MST.

For n.perm == 0, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For this, the KMD is standardized by the null mean and standard deviation. For n.perm > 0, a permutation test is performed, i.e. the observed KMD statistic is compared to the permutation KMD statistics.

The theoretical KMD of two distributions is zero if and only if the distributions coincide. It is upper bound by one. Therefore, low values of the empirical KMD indicate similarity and the test rejects for high values.

Huang and Sen (2023) recommend using the k-NN graph for its flexibility, but the choice of k is unclear. Based on the simulation results in the original article, the recommended values are k = 0.1 N for testing and k = 1 for estimation. For increasing power it is beneficial to choose large values of k, for consistency of the tests, k = o(N / \log(N)) together with a continuous distribution of inter-point distances is sufficient, i.e. k cannot be chosen too large compared to N. On the other hand, in the context of estimating the KMD, choosing k is a bias-variance trade-off with small values of k decreasing the bias and larger values of k decreasing the variance (for more details see discussion in Appendix D.3 of Huang and Sen (2023)).

This implementation is a wrapper function around the functions KMD and KMD_test that modifies the in- and output of those functions to match the other functions provided in this package. For more details see KMD and KMD_test.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Permutation / asymptotic p value
`estimate`	Estimated KMD value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names
`graph`	Graph used for calculation
`k`	Number of neighbors used if `graph` is the KNN graph.
`kernel`	Kernel used for calculation

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	Yes

References

Huang, Z. and Sen, B. (2023). A Kernel Measure of Dissimilarity between M Distributions. Journal of the American Statistical Association, 0, 1-27. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2023.2298036")}.

Huang, Z. (2022). KMD: Kernel Measure of Multi-Sample Dissimilarity. R package version 0.1.0, https://CRAN.R-project.org/package=KMD.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform KMD test 
if(requireNamespace("KMD", quietly = TRUE)) {
  KMD(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.