YMRZL: Yu et al. (2007) Two-Sample Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

YMRZL

R Documentation

Yu et al. (2007) Two-Sample Test

Description

Performs the Yu et al. (2007) two-sample test. The implementation here uses the classifier_test implementation from the Ecume package.

Usage

YMRZL(X1, X2, n.perm = 0, split = 0.7, control = NULL, 
       train.args = NULL, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`n.perm`	Number of permutations for permutation test (default: 0, asymptotic test is performed).
`split`	Proportion of observations used for training
`control`	Control parameters for fitting. See `trainControl`. Defaults to `caret::trainControl(method = "boot")` as recommended if `control = NULL`. The number of Bootstrap samples defaults to 25 and can be set by specifying the `number` argument of `caret::trainControl`.
`train.args`	Further arguments passed to `train` as a named list.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The two-sample test proposed by Yu et al. (2007) works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classification tree is trained on the training data. The test classification error is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test error will be close to chance level. The test rejects if the test error is smaller than chance level.

The tree model is fit by rpart and the classification error for tuning is by default predicted using the Bootstrap .632+ estimator as recommended by Yu et al. (2007).

For n.perm > 0, a permutation test is conducted. Otherwise, an asymptotic binomial test is performed.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Asymptotic p value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names
`classifier`	Chosen classification method (tree)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	Yes	No

Note

As the idea of the test is very similar to that of the classifier two-sample test by Lopez-Paz and Oquab (2022), the implementation here is based on that C2ST. Note that Lopez-Paz and Oquab (2022) utilize the classification accuracy instead of the classification error. Moreover, they propose to use a binomial test instead of the permutation test proposed by Yu et al.. Here, we implemented both the binomial and the permutation test.

References

Yu, K., Martin, R., Rothman, N., Zheng, T., Lan, Q. (2007). Two-sample Comparison Based on Prediction Error, with Applications to Candidate Gene Association Studies. Annals of Human Genetics, 71(1). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.1469-1809.2006.00306.x")}

Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform the Yu et al. test
YMRZL(X1, X2)

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.