YMRZL: Yu et al. (2007) Two-Sample Test

View source: R/YMRZL.R

YMRZLR Documentation

Yu et al. (2007) Two-Sample Test

Description

Performs the Yu et al. (2007) two-sample test. The implementation here uses the classifier_test implementation from the Ecume package.

Usage

YMRZL(X1, X2, n.perm = 0, split = 0.7, control = NULL, 
       train.args = NULL, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

split

Proportion of observations used for training

control

Control parameters for fitting. See trainControl. Defaults to caret::trainControl(method = "boot") as recommended if control = NULL. The number of Bootstrap samples defaults to 25 and can be set by specifying the number argument of caret::trainControl.

train.args

Further arguments passed to train as a named list.

seed

Random seed (default: 42)

Details

The two-sample test proposed by Yu et al. (2007) works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classification tree is trained on the training data. The test classification error is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test error will be close to chance level. The test rejects if the test error is smaller than chance level.

The tree model is fit by rpart and the classification error for tuning is by default predicted using the Bootstrap .632+ estimator as recommended by Yu et al. (2007).

For n.perm > 0, a permutation test is conducted. Otherwise, an asymptotic binomial test is performed.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

classifier

Chosen classification method (tree)

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes Yes No

Note

As the idea of the test is very similar to that of the classifier two-sample test by Lopez-Paz and Oquab (2022), the implementation here is based on that C2ST. Note that Lopez-Paz and Oquab (2022) utilize the classification accuracy instead of the classification error. Moreover, they propose to use a binomial test instead of the permutation test proposed by Yu et al.. Here, we implemented both the binomial and the permutation test.

References

Yu, K., Martin, R., Rothman, N., Zheng, T., Lan, Q. (2007). Two-sample Comparison Based on Prediction Error, with Applications to Candidate Gene Association Studies. Annals of Human Genetics, 71(1). \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.1469-1809.2006.00306.x")}

Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

C2ST, HMN

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform the Yu et al. test
YMRZL(X1, X2)

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.