OTDD: Optimal Transport Dataset Distance

View source: R/OTDD.R

OTDDR Documentation

Optimal Transport Dataset Distance

Description

The function implements the optimal transport dataset distance (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.

Usage

OTDD(X1, X2, target1 = "y", target2 = "y", method = "precomputed.labeldist", 
      feature.cost = stats::dist, lambda.x = 1, lambda.y = 1, p = 2, ground.p = 2,
      sinkhorn = FALSE, debias = FALSE, inner.ot.method = "exact", inner.ot.p = 2, 
      inner.ot.ground.p = 2, inner.ot.sinkhorn = FALSE, inner.ot.debias = FALSE,
      seed = 42)
hammingDist(x) 

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

target1

Character specifying the column name of the class variable in the first dataset (default: "y")

target2

Character specifying the column name of the class variable in the second dataset (default: "y")

method

Character specifying the method for computing the OTDD. Possible options are "augmentation", i.e. computing the optimal transport on the augmented dataset, and "precomputed.labeldist", i.e. the usual computation of label distances (default).

feature.cost

Function that calculates the distance matrix on the pooled feature dataset (default: stats::dist). Ignored if method = precomputed.labeldist, then L_p-distance is used as feature cost.

lambda.x, lambda.y

Weights of the feature distances and label distances in the overall cost (default: 1, equally weighted). Note that values unequal to one are only supported for method = precomputed.labeldist.

p

Power p of the p-Wasserstein distance for the outer optimal transport problem (default: 2).

ground.p

Power p of the L_p-norm used in calculation of Wasserstein distance for the outer optimal transport problem (default: 2). Ignored if method = "precomputed.labeldist".

sinkhorn

Should the Sinkhorn approximation be used for solving the outer optimal transport problem? (default: FALSE, exact solution is used)

debias

Should debiased estimator be used when using Sinkhorn approximation for outer optimal transport problem? (default: FALSE)

inner.ot.method

Method for computing the label distances. Possible options are "exact" (the default), i.e. calculating the solution to the optimal transport of the label distributions, "gaussian.approx", i.e. calculating the Wasserstein distance of the labels using a Gaussian approximation of the label distributions, "naive.upperbound", i.e. calculating the upperbound d_{UB}, "only.means", i.e. approximating the label distance by computing the Euclidean distance of the mean vectors of the label distributions. Ignored if method = "augmentation".

inner.ot.p

Power p of the p-Wasserstein distance for the inner optimal transport problem (default: 2). Used only if method = "precomputed.labeldist" and inner.ot.method = "exact".

inner.ot.ground.p

Power p of the L_p-norm used in calculation of Wasserstein distance for the outer optimal transport problem (default: 2). Used only if method = "precomputed.labeldist" and inner.ot.method = "exact".

inner.ot.sinkhorn

Should the Sinkhorn approximation be used for solving the inner optimal transport problem? (default: FALSE, exact solution is used). Used only if method = "precomputed.labeldist" and inner.ot.method = "exact".

inner.ot.debias

Should debiased estimator be used when using Sinkhorn approximation for inner optimal transport problem? (default: FALSE). Used only if method = "precomputed.labeldist" and inner.ot.method = "exact".

seed

Random seed (default: 42)

x

Dataset for which the distance matrix of pairwise Hamming distances is calculated.

Details

Alvarez-Melis and Fusi (2020) define a dataset distance that takes into account both the feature variables as well as a target (label) variable. The idea is to compute the optimal transport based on a cost function that is a combination of the feature distance and the Wasserstein distance between the label distributions. The label distribution refers to the distribution of features for a given label. With this, the distance between feature-label pairs z:= (x, y) can be defined as

d_{\mathcal{Z}}(z, z^\prime) := (d_{\mathcal{X}}(x, x^{\prime})^p + W_p^p(\alpha_y, \alpha_{y^{\prime}}))^{1/p},

where \alpha_y denotes the distribution \text{P}(X|Y=y) for label y over the feature space. With this, the optimal transport dataset distance is defined as

d_{OT}(\mathcal{D}_1, \mathcal{D}_2) = \min_{\pi\in\Pi(\alpha, \beta)} \int_{\mathcal{Z}\times\mathcal{Z}} d_{\mathcal{Z}}(z, z^\prime)^p \text{d }\pi(z, z^\prime),

where

\Pi(\alpha, \beta) := \{\pi_{1,2}\in\mathcal{P}(\mathcal{X}\times\mathcal{X}) | \pi_1 = \alpha, \pi_2 = \beta\}

is the set of joint distributions with \alpha and \beta as marginals.

Here, we use the Wasserstein distance implementation from the approxOT package for solving the optimal transport problems.

There are multiple simplifications implemented. First, under the assumption that the metric on the feature space coincides with the ground metric in the optimal transport problem on the labels and that all covariance matrices of the label distributions commute (rarely fulfilled in practice), the computation reduces to solving the optimal transport problem on the datasets augmented with the means and covariance matrices of the label distributions. This simplification is used when setting method = "augmentation". Next, the Sinkhorn approximation can be utilized both for calculating the solution of the overall (outer) optimal transport problem (sinkhorn = TRUE) and for the inner optimal transport problem for computing the label distances (inner.ot.sinkhorn = TRUE). The solution of the inner problem can also be sped up by using a normal approximation of the label distributions (inner.ot.method = "gaussian.approx") which results in a closed form expression of the solution. inner.ot.method = "only.means" further simplifies the calculation by using only the means of these Gaussians, which corresponds to assuming equal covariances in all Gaussian approximations of the label distributions. Using inner.ot.method = "upper.bound" uses a distribution-agnostic upper bound to bypass the solution of the inner optimal transport problem.

For categorical data, specify an appropriate feature.cost and use method = "precomputed.labeldist" and inner.ot.method = "exact". A pre-implemented option is setting feature.cost = hammingDist for using the Hamming distance for categorical data. When implementing an appropriate function that takes the pooled dataset without the target column as input and gives a distance matrix as the output, a mix of categorical and numerical data is also possible.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

Applicability

Target variable? Numeric? Categorical? K-sample?
Yes Yes Yes No

Note

Especially for large numbers of variables and low numbers of observations, it can happen that the Gaussian approximation of the inner OT problem fails since the estimated covariance matrix for one label distribution is numerically no longer psd. An error is thrown in that case.

Author(s)

Original python implementation: David Alvarez-Melis, Chengrun Yang

R implementation: Marieke Stolte

References

Interactive visualizations: https://www.microsoft.com/en-us/research/blog/measuring-dataset-similarity-using-optimal-transport/

Alvarez-Melis, D. and Fusi, N. (2020). Geometric Dataset Distances via Optimal Transport. In Advances in Neural Information Processing Systems 33 21428-21439.

Original python implementation: Alvarez-Melis, D., and Yang, C. (2024). Optimal Transport Dataset Distance (OTDD). https://github.com/microsoft/otdd

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
# Calculate OTDD

if(requireNamespace("approxOT", quietly = TRUE) & 
    requireNamespace("expm", quietly = TRUE)) {
  OTDD(X1, X2)
  OTDD(X1, X2, sinkhorn = TRUE, inner.ot.sinkhorn = TRUE)
  OTDD(X1, X2, method = "augmentation") 
  OTDD(X1, X2, inner.ot.method = "gaussian.approx")
  OTDD(X1, X2, inner.ot.method = "means.only")
  OTDD(X1, X2, inner.ot.method = "naive.upperbound")
}


# For categorical data
X1cat <- matrix(sample(LETTERS[1:4], 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(LETTERS[1:4], 300, replace = TRUE, prob = 1:4), ncol = 3)
y1 <- sample(0:1, 300, TRUE)
y2 <- sample(0:1, 300, TRUE)
X1 <- data.frame(X = X1cat, y = y1)
X2 <- data.frame(X = X2cat, y = y2)

if(requireNamespace("approxOT", quietly = TRUE) &
    requireNamespace("expm", quietly = TRUE)) {
  OTDD(X1, X2, feature.cost = hammingDist)
  OTDD(X1, X2, sinkhorn = TRUE, inner.ot.sinkhorn = TRUE, feature.cost = hammingDist)
}


DataSimilarity documentation built on April 3, 2025, 9:39 p.m.