OTDD | R Documentation |
The function implements the optimal transport dataset distance (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.
OTDD(X1, X2, target1 = "y", target2 = "y", method = "precomputed.labeldist",
feature.cost = stats::dist, lambda.x = 1, lambda.y = 1, p = 2, ground.p = 2,
sinkhorn = FALSE, debias = FALSE, inner.ot.method = "exact", inner.ot.p = 2,
inner.ot.ground.p = 2, inner.ot.sinkhorn = FALSE, inner.ot.debias = FALSE,
seed = 42)
hammingDist(x)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
target1 |
Character specifying the column name of the class variable in the first dataset (default: |
target2 |
Character specifying the column name of the class variable in the second dataset (default: |
method |
Character specifying the method for computing the OTDD. Possible options are |
feature.cost |
Function that calculates the distance matrix on the pooled feature dataset (default: |
lambda.x , lambda.y |
Weights of the feature distances and label distances in the overall cost (default: 1, equally weighted). Note that values unequal to one are only supported for |
p |
Power |
ground.p |
Power |
sinkhorn |
Should the Sinkhorn approximation be used for solving the outer optimal transport problem? (default: |
debias |
Should debiased estimator be used when using Sinkhorn approximation for outer optimal transport problem? (default: |
inner.ot.method |
Method for computing the label distances. Possible options are |
inner.ot.p |
Power |
inner.ot.ground.p |
Power |
inner.ot.sinkhorn |
Should the Sinkhorn approximation be used for solving the inner optimal transport problem? (default: |
inner.ot.debias |
Should debiased estimator be used when using Sinkhorn approximation for inner optimal transport problem? (default: |
seed |
Random seed (default: 42) |
x |
Dataset for which the distance matrix of pairwise Hamming distances is calculated. |
Alvarez-Melis and Fusi (2020) define a dataset distance that takes into account both the feature variables as well as a target (label) variable. The idea is to compute the optimal transport based on a cost function that is a combination of the feature distance and the Wasserstein distance between the label distributions. The label distribution refers to the distribution of features for a given label. With this, the distance between feature-label pairs z:= (x, y)
can be defined as
d_{\mathcal{Z}}(z, z^\prime) := (d_{\mathcal{X}}(x, x^{\prime})^p + W_p^p(\alpha_y, \alpha_{y^{\prime}}))^{1/p},
where \alpha_y
denotes the distribution \text{P}(X|Y=y)
for label y
over the feature space. With this, the optimal transport dataset distance is defined as
d_{OT}(\mathcal{D}_1, \mathcal{D}_2) = \min_{\pi\in\Pi(\alpha, \beta)} \int_{\mathcal{Z}\times\mathcal{Z}} d_{\mathcal{Z}}(z, z^\prime)^p \text{d }\pi(z, z^\prime),
where
\Pi(\alpha, \beta) := \{\pi_{1,2}\in\mathcal{P}(\mathcal{X}\times\mathcal{X}) | \pi_1 = \alpha, \pi_2 = \beta\}
is the set of joint distributions with \alpha
and \beta
as marginals.
Here, we use the Wasserstein distance implementation from the approxOT package for solving the optimal transport problems.
There are multiple simplifications implemented. First, under the assumption that the metric on the feature space coincides with the ground metric in the optimal transport problem on the labels and that all covariance matrices of the label distributions commute (rarely fulfilled in practice), the computation reduces to solving the optimal transport problem on the datasets augmented with the means and covariance matrices of the label distributions. This simplification is used when setting method = "augmentation"
. Next, the Sinkhorn approximation can be utilized both for calculating the solution of the overall (outer) optimal transport problem (sinkhorn = TRUE
) and for the inner optimal transport problem for computing the label distances (inner.ot.sinkhorn = TRUE
). The solution of the inner problem can also be sped up by using a normal approximation of the label distributions (inner.ot.method = "gaussian.approx"
) which results in a closed form expression of the solution. inner.ot.method = "only.means"
further simplifies the calculation by using only the means of these Gaussians, which corresponds to assuming equal covariances in all Gaussian approximations of the label distributions. Using inner.ot.method = "upper.bound"
uses a distribution-agnostic upper bound to bypass the solution of the inner optimal transport problem.
For categorical data, specify an appropriate feature.cost
and use method = "precomputed.labeldist"
and inner.ot.method = "exact"
. A pre-implemented option is setting feature.cost = hammingDist
for using the Hamming distance for categorical data. When implementing an appropriate function that takes the pooled dataset without the target column as input and gives a distance matrix as the output, a mix of categorical and numerical data is also possible.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
method |
Description of the test |
data.name |
The dataset names |
alternative |
The alternative hypothesis |
Target variable? | Numeric? | Categorical? | K-sample? |
Yes | Yes | Yes | No |
Especially for large numbers of variables and low numbers of observations, it can happen that the Gaussian approximation of the inner OT problem fails since the estimated covariance matrix for one label distribution is numerically no longer psd. An error is thrown in that case.
Original python implementation: David Alvarez-Melis, Chengrun Yang
R implementation: Marieke Stolte
Interactive visualizations: https://www.microsoft.com/en-us/research/blog/measuring-dataset-similarity-using-optimal-transport/
Alvarez-Melis, D. and Fusi, N. (2020). Geometric Dataset Distances via Optimal Transport. In Advances in Neural Information Processing Systems 33 21428-21439.
Original python implementation: Alvarez-Melis, D., and Yang, C. (2024). Optimal Transport Dataset Distance (OTDD). https://github.com/microsoft/otdd
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
# Calculate OTDD
if(requireNamespace("approxOT", quietly = TRUE) &
requireNamespace("expm", quietly = TRUE)) {
OTDD(X1, X2)
OTDD(X1, X2, sinkhorn = TRUE, inner.ot.sinkhorn = TRUE)
OTDD(X1, X2, method = "augmentation")
OTDD(X1, X2, inner.ot.method = "gaussian.approx")
OTDD(X1, X2, inner.ot.method = "means.only")
OTDD(X1, X2, inner.ot.method = "naive.upperbound")
}
# For categorical data
X1cat <- matrix(sample(LETTERS[1:4], 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(LETTERS[1:4], 300, replace = TRUE, prob = 1:4), ncol = 3)
y1 <- sample(0:1, 300, TRUE)
y2 <- sample(0:1, 300, TRUE)
X1 <- data.frame(X = X1cat, y = y1)
X2 <- data.frame(X = X2cat, y = y2)
if(requireNamespace("approxOT", quietly = TRUE) &
requireNamespace("expm", quietly = TRUE)) {
OTDD(X1, X2, feature.cost = hammingDist)
OTDD(X1, X2, sinkhorn = TRUE, inner.ot.sinkhorn = TRUE, feature.cost = hammingDist)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.