MMD: Maximum Mean Discrepancy (MMD) Test

View source: R/MMD.R

MMDR Documentation

Maximum Mean Discrepancy (MMD) Test

Description

Performs a two-sample test based on the maximum mean discrepancy (MMD) using either, the Rademacher or the asmyptotic bounds or a permutation testing procedure. The implementation adds a permutation test to the kmmd implementation from the kernlab package.

Usage

MMD(X1, X2, n.perm = 0, alpha = 0.05, asymptotic = FALSE, replace = TRUE, 
    n.times = 150, frac = 1, seed = 42, ...)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

alpha

Significance level of the test (default: 0.05). Used to calculate asymptotic or Rademacher bound.

asymptotic

Should the asymptotic bound be calculated? (default: FALSE, Rademacher bound is used, TRUE calculation of asymptotic bounds is suitable for smaller datasets)

replace

Should sampling with replacement be used in computation of asymptotic bounds? (default: TRUE)

n.times

Number of repetitions for sampling procedure (default: 150)

frac

Fraction of points to sample (default: 1)

seed

Random seed (default: 42)

...

Further arguments passed to kmmd specifying the kernel. E.g. kernel for passing the kernel as a character (default: rbfdot RBF kernel function) and kpar for passing the kernel parameter(s) as a named list (default: "automatic" uses heuristic for choosing a good bandwidth for the RBF or Laplace kernel). For details, see kmmd.

Details

For a given kernel function k an unbiased estimator for MMD^2 is defined as

\widehat{\text{MMD}}^2(\mathcal{H}, X_1, X_2)_{U} = \frac{1}{n_1(n_1-1)}\sum_{i=1}^{n_1}\sum_{\substack{j=1 \\ j\neq i}}^{n_1} k\left(X_{1i}, X_{1j}\right) \\ + \frac{1}{n_2(n_2-1)}\sum_{i=1}^{n_2}\sum_{\substack{j=1 \\ j\neq i}}^{n_2} k\left(X_{2i}, X_{2j}\right)\\ - \frac{2}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{\substack{j = 1 \\ j\neq i}}^{n_2} k\left(X_{1i}, X_{2j}\right).

Its square root is returned as the statistic here.

The theoretical MMD of two distributions is equal to zero if and only if the two distributions coincide. Therefore, low values indicate similarity of datasets and the test rejects for large values.

The orignal proposal of the test is based on critical values calculated asymptotically or using Rademacher bounds. Here, the option for calculating a permutation p value is added. The Rademacher bound is always returned. Additionally, the asymptotic bound can be returned depending on the value of asymptotic.

This implementation is a wrapper function around the function kmmd that modifies the in- and output of that function to match the other functions provided in this package. Moreover, a permutation test is added. For more details see the kmmd.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Permutation p value

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

H0

Is H_0 rejected according to the Rademacher bound?

asymp.H0

Is H_0 rejected according to the asymptotic bound?

kernel.fun

Kernel function used

Rademacher.bound

The Rademacher bound

asymp.bound

The asymptotic bound

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes When suitable kernel function is passed No

References

Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B. and Smola, A. (2006). A Kernel Method for the Two-Sample-Problem. Neural Information Processing Systems 2006, Vancouver. https://papers.neurips.cc/paper/3110-a-kernel-method-for-the-two-sample-problem.pdf

Muandet, K., Fukumizu, K., Sriperumbudur, B. and Schölkopf, B. (2017). Kernel Mean Embedding of Distributions: A Review and Beyond. Foundations and Trends® in Machine Learning, 10(1-2), 1-141. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1561/2200000060")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform MMD test 
if(requireNamespace("kernlab", quietly = TRUE)) {
  MMD(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.