KMD | R Documentation |
Calculates the kernel measure of multi-sample dissimilarity (KMD) and performs a permutation multi-sample test (Huang and Sen, 2023). The implementation here uses the KMD
and KMD_test
implementations from the KMD package.
KMD(X1, X2, ..., n.perm = 0, graph = "knn", k = ceiling(N/10),
kernel = "discrete", seed = 42)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
... |
Optionally more datasets as matrices or data.frames |
n.perm |
Number of permutations for permutation test (default: 0, no permutation test performed). |
graph |
Graph used in calculation of KMD. Possible options are |
k |
Number of neighbors for construction of |
kernel |
Kernel used in calculation of KMD. Can either be |
seed |
Random seed (default: 42) |
Given the pooled sample Z_1, \dots, Z_N
and the corresponding sample memberships \Delta_1,\dots, \Delta_N
let \mathcal{G}
be a geometric graph on \mathcal{X}
such that an edge between two points Z_i
and Z_j
in the pooled sample implies that Z_i
and Z_j
are close, e.g. K
-nearest neighbor graph with K\ge 1
or MST. Denote by (Z_i,Z_j)\in\mathcal{E}(\mathcal{G})
that there is an edge in \mathcal{G}
connecting Z_i
and Z_j
. Moreover, let o_i
be the out-degree of Z_i
in \mathcal{G}
. Then an estimator for the KMD \eta
is defined as
\hat{\eta} := \frac{\frac{1}{N} \sum_{i=1}^N \frac{1}{o_i} \sum_{j:(Z_i,Z_j)\in\mathcal{E}(\mathcal{G})} K(\Delta_i, \Delta_j) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}{\frac{1}{N}\sum_{i=1}^N K(\Delta_i, \Delta_i) - \frac{1}{N(N-1)} \sum_{i\ne j} K(\Delta_i, \Delta_j)}.
Euclidean distances are used for computing the KNN graph (ties broken at random) and the MST.
For n.perm == 0
, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For this, the KMD is standardized by the null mean and standard deviation. For n.perm > 0
, a permutation test is performed, i.e. the observed KMD statistic is compared to the permutation KMD statistics.
The theoretical KMD of two distributions is zero if and only if the distributions coincide. It is upper bound by one. Therefore, low values of the empirical KMD indicate similarity and the test rejects for high values.
Huang and Sen (2023) recommend using the k
-NN graph for its flexibility, but the choice of k
is unclear. Based on the simulation results in the original article, the recommended values are k = 0.1 N
for testing and k = 1
for estimation. For increasing power it is beneficial to choose large values of k
, for consistency of the tests, k = o(N / \log(N))
together with a continuous distribution of inter-point distances is sufficient, i.e. k
cannot be chosen too large compared to N
. On the other hand, in the context of estimating the KMD, choosing k
is a bias-variance trade-off with small values of k
decreasing the bias and larger values of k
decreasing the variance (for more details see discussion in Appendix D.3 of Huang and Sen (2023)).
This implementation is a wrapper function around the functions KMD
and KMD_test
that modifies the in- and output of those functions to match the other functions provided in this package. For more details see KMD
and KMD_test
.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
p.value |
Permutation / asymptotic p value |
estimate |
Estimated KMD value |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
graph |
Graph used for calculation |
k |
Number of neighbors used if |
kernel |
Kernel used for calculation |
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | Yes |
Huang, Z. and Sen, B. (2023). A Kernel Measure of Dissimilarity between M
Distributions. Journal of the American Statistical Association, 0, 1-27. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2023.2298036")}.
Huang, Z. (2022). KMD: Kernel Measure of Multi-Sample Dissimilarity. R package version 0.1.0, https://CRAN.R-project.org/package=KMD.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
MMD
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform KMD test
if(requireNamespace("KMD", quietly = TRUE)) {
KMD(X1, X2, n.perm = 100)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.