Wasserstein: Wasserstein Distance based Test

View source: R/Wasserstein.R

WassersteinR Documentation

Wasserstein Distance based Test

Description

Performs a permutation two-sample test based on the Wasserstein distance. The implementation here uses the wasserstein_permut implementation from the Ecume package.

Usage

Wasserstein(X1, X2, n.perm = 0, fast = (nrow(X1) + nrow(X2)) > 1000, 
            S = max(1000, (nrow(X1) + nrow(X2))/2), seed = 42, ...)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

n.perm

Number of permutations for permutation test (default: 0, no test is performed).

fast

Should the subwasserstein approximate function be used? (default: TRUE if the pooled sample size is more than 1000)

S

Number of samples to use for approximation if fast = TRUE. See subwasserstein

seed

Random seed (default: 42)

...

Other parameters passed to wasserstein or wasserstein1d, e.g. the power p\ge 1

Details

A permutation test for the p-Wasserstein distance is performed. By default, the 1-Wasserstein distance is calculated using Euclidean distances. The p-Wasserstein distance between two probability measures \mu and \nu on a Euclidean space M is defined as

W_p(\mu, \nu) = \left(\inf_{\gamma\in\Gamma(\mu,\nu)}\int_{M\times M} ||x - y||^p \text{d} \gamma(x, y)\right)^{\frac{1}{p}},

where \Gamma(\mu,\nu) is the set of probability measures on M\times M such that \mu and \nu are the marginal distributions.

As the Wasserstein distance of two distributions is a metric, it is zero if and only if the distributions coincides. Therefore, low values of the statistic indicate similarity of the datasets and the test rejects for high values.

This implementation is a wrapper function around the function wasserstein_permut that modifies the in- and output of that function to match the other functions provided in this package. For more details see the wasserstein_permut.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No No

References

Rachev, S. T. (1991). Probability metrics and the stability of stochastic models. John Wiley & Sons, Chichester.

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume

Schuhmacher, D., Bähre, B., Gottschlich, C., Hartmann, V., Heinemann, F., Schmitzer, B. and Schrieber, J. (2019). transport: Computation of Optimal Transport Plans and Wasserstein Distances. R package version 0.15-0. https://cran.r-project.org/package=transport

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Wasserstein distance based test 
if(requireNamespace("Ecume", quietly = TRUE)) {
  Wasserstein(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.