KMD: Kernel Measure of Multi-Sample Dissimilarity (KMD)

View source: R/KMD.R

KMDR Documentation

Kernel Measure of Multi-Sample Dissimilarity (KMD)

Description

Calculates the kernel measure of multi-sample dissimilarity (KMD) and performs a permutation multi-sample test (Huang and Sen, 2023). The implementation here uses the KMD and KMD_test implementations from the KMD package.

Usage

KMD(X1, X2, ..., n.perm = 0, graph = "knn", k = ceiling(N/10), 
    kernel = "discrete", seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

n.perm

Number of permutations for permutation test (default: 0, no permutation test performed).

graph

Graph used in calculation of KMD. Possible options are "knn" (default) and "mst".

k

Number of neighbors for construction of k-nearest neighbor graph. Ignored for graph = "mst".

kernel

Kernel used in calculation of KMD. Can either be "discrete" (default) for use of the discrete kernel or a kernal matrix with numbers of rows and columns corresponding to the number of datasets. For the latter, the entry in the i-th row and j-th column corresponds to the kernel value k(i,j).

seed

Random seed (default: 42)

Details

Given the pooled sample Z_1, \dots, Z_N and the corresponding sample memberships \Delta_1,\dots, \Delta_N let \mathcal{G} be a geometric graph on \mathcal{X} such that an edge between two points Z_i and Z_j in the pooled sample implies that Z_i and Z_j are close, e.g. K-nearest neighbor graph with K\ge 1 or MST. Denote by (Z_i,Z_j)\in\mathcal{E}(\mathcal{G}) that there is an edge in \mathcal{G} connecting Z_i and Z_j. Moreover, let o_i be the out-degree of Z_i in \mathcal{G}. Then an estimator for the KMD \eta is defined as

\hat{\eta} := \frac{\frac{1}{N} \sum_{i=1}^N \frac{1}{o_i} \sum_{j:(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})} K(\Delta_i, \Delta_j) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}{\frac{1}{N}\sum_{i=1}^N K(\Delta_i, \Delta_i) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}.

Euclidean distances are used for computing the KNN graph (ties broken at random) and the MST.

For n.perm == 0, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For this, the KMD is standardized by the null mean and standard deviation. For n.perm > 0, a permutation test is performed, i.e. the observed KMD statistic is compared to the permutation KMD statistics.

The theoretical KMD of two distributions is zero if and only if the distributions coincide. It is upper bound by one. Therefore, low values of the empirical KMD indicate similarity and the test rejects for high values.

Huang and Sen (2023) recommend using the k-NN graph for its flexibility, but the choice of k is unclear. Based on the simulation results in the original article, the recommended values are k = 0.1 N for testing and k = 1 for estimation. For increasing power it is beneficial to choose large values of k, for consistency of the tests, k = o(N / \log(N)) together with a continuous distribution of inter-point distances is sufficient, i.e. k cannot be chosen too large compared to N. On the other hand, in the context of estimating the KMD, choosing k is a bias-variance trade-off with small values of k decreasing the bias and larger values of k decreasing the variance (for more details see discussion in Appendix D.3 of Huang and Sen (2023)).

This implementation is a wrapper function around the functions KMD and KMD_test that modifies the in- and output of those functions to match the other functions provided in this package. For more details see KMD and KMD_test.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Permutation / asymptotic p value

estimate

Estimated KMD value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

graph

Graph used for calculation

k

Number of neighbors used if graph is the KNN graph.

kernel

Kernel used for calculation

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No Yes

References

Huang, Z. and Sen, B. (2023). A Kernel Measure of Dissimilarity between M Distributions. Journal of the American Statistical Association, 0, 1-27. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2023.2298036")}.

Huang, Z. (2022). KMD: Kernel Measure of Multi-Sample Dissimilarity. R package version 0.1.0, https://CRAN.R-project.org/package=KMD.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

MMD

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform KMD test 
if(requireNamespace("KMD", quietly = TRUE)) {
  KMD(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.