GPK: Generalized Permutation-Based Kernel (GPK) Two-Sample Test

View source: R/GPK.R

GPKR Documentation

Generalized Permutation-Based Kernel (GPK) Two-Sample Test

Description

Performs the generalized permutation-based kernel two-sample test proposed by Song and Chen (2021). The implementation here uses the kertests implementation from the kerTests package.

Usage

GPK(X1, X2, n.perm = 0, fast = (n.perm == 0), M = FALSE,
    sigma = findSigma(X1, X2), r1 = 1.2, r2 = 0.8, seed = 42)
findSigma(X1, X2)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

n.perm

Number of permutations for permutation test (default: 0, fast test is performed). For fast = FALSE, only the permutation test and no asymptotic test is available. For fast = TRUE, either an asymptotic test (set n.perm = 0) and a permutation test (set n.perm > 0) can be performed.

fast

Should the fast test be performed? (default: TRUE if n.perm = 0, FALSE if n.perm > 0)

M

Should the MMD approximation test be performed? (default: FALSE). Ignored if fast = FALSE.

sigma

Bandwidth parameter of the kernel. By default the median heuristic is used to choose sigma.

r1

Constant in the test statistic Z_{W, r1} for the fast test (default: 1.2, proposed in original article)

r2

Constant in the test statistic Z_{W, r2} for the fast test (default: 0.8, proposed in original article)

seed

Random seed (default: 42)

Details

The GPK test is motivated by the observation that the MMD test performs poorly for detecting differences in variances. The unbiased MMD^2 estimator for a given kernel function k can be written as

\text{MMD}_u^2 = \alpha + \beta - 2\gamma, \text{ where}

\alpha = \frac{1}{n_1^2 - n_1}\sum_{i=1}^{n_1}\sum_{j=1, j\ne i}^{n_1} k(X_{1i}, X_{1j}),

\beta = \frac{1}{n_2^2 - n_2}\sum_{i=1}^{n_2}\sum_{j=1, j\ne i}^{n_2} k(X_{2i}, X_{2j}),

\gamma = \frac{1}{n_1 n_2}\sum_{i=1}^{n_1}\sum_{j=1}^{n_2} k(X_{1i}, X_{2j}).

The GPK test statistic is defined as

\text{GPK} = (\alpha - \text{E}(\alpha), \beta - \text{E}(\beta))\Sigma^{-1} \binom{\alpha - \text{E}(\alpha)}{\beta - \text{E}(\beta)}

= Z_{W,1}^2 + Z_D^2\text{ with}

Z_{W,r} = \frac{W_r - \text{E}(W_r)}{\sqrt{\text{Var}(W_r)}}, W_r = r\frac{n_1 \alpha}{n_1 + n_2},

Z_D = \frac{D - \text{E}(D)}{\sqrt{\text{Var}(D)}}, D = n_1(n_1 - 1)\alpha - n_2(n_2 - 1)\beta,

where the expectations are calculated under the null and \Sigma is the covariance matrix of \alpha and \beta under the null.

The asymptotic null distribution for GPK is unknown. Therefore, only a permutation test can be performed.

For r \ne 1, the asymptotic null distribution of Z_{W,r} is normal, but for r further away from 1, the test performance decreases. Therefore, r_1 = 1.2 and r_2 = 0.8 are proposed as a compromise.

For the fast GPK test, three (asymptotic or permutation) tests based on Z_{W, r1}, Z_{W, r2} and Z_{D} are concucted and the overall p value is calculated as 3 times the minimum of the three p values.

For the fast MMD test, only the two asymptotic tests based on Z_{W, r1}, Z_{W, r2} are used and the p value is 2 times the minimum of the two p values. This is an approximation of the MMD-permutation test, see MMD.

This implementation is a wrapper function around the function kertests that modifies the in- and output of that function to match the other functions provided in this package. For more details see the kertests.

findSigma finds the optimal bandwidth parameter of the kernel function using the median heuristic and is a wrapper around med_sigma.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic or permutation p value

null.value

Needed for pretty printing of results

alternative

Needed for pretty printing of results

method

Description of the test

data.name

Needed for pretty printing of results

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No No

References

Song, H. and Chen, H. (2021). Generalized Kernel Two-Sample Tests. arXiv preprint. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/biomet/asad068")}.

Song H, Chen H (2023). kerTests: Generalized Kernel Two-Sample Tests. R package version 0.1.4, https://CRAN.R-project.org/package=kerTests.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

MMD, kerTests

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
if(requireNamespace("kerTests", quietly = TRUE)) {
  # Perform GPK test
  GPK(X1, X2, n.perm = 100)
  # Perform fast GPK test (permutation version)
  GPK(X1, X2, n.perm = 100, fast = TRUE)
  # Perform fast GPK test (asymptotic version)
  GPK(X1, X2, n.perm = 0, fast = TRUE)
  # Perform fast MMD test (permutation version)
  GPK(X1, X2, n.perm = 100, fast = TRUE, M = TRUE)
  # Perform fast MMD test (asymptotic version)
  GPK(X1, X2, n.perm = 0, fast = TRUE, M = TRUE)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.