C2ST: Classifier Two-Sample Test

View source: R/C2ST.R

C2STR Documentation

Classifier Two-Sample Test

Description

The function implements the Classifier Two-Sample Test (C2ST) of Lopez-Paz & Oquab (2017). The comparison of multiple (\ge 2) samples is also possible. The implementation here uses the classifier_test implementation from the Ecume package.

Usage

C2ST(X1, X2, ..., split = 0.7, thresh = 0, method = "knn", control = NULL, 
      train.args = NULL, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

split

Proportion of observations used for training

thresh

Value to add to the null hypothesis value (default:0). The null hypothesis tested can be formulated as H_0: t = p_0 + thresh, where t denotes the test accuracy of the classifier and p_0 is the chance level (proportion of largest dataset in pooled sample).

method

Classifier to use during training (default: "knn"). See details for possible options.

control

Control parameters for fitting. See trainControl. Defaults to NULL in which case it is set to caret::trainControl(method = "cv").

train.args

Further arguments passed to train as a named list.

seed

Random seed (default: 42)

Details

The classifier two-sample test works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classifier is trained on the training data. The classification accuracy on the test data is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test accuracy will be close to chance level. The test rejects if the test accuracy is greater than chance level.

All methods available for classification within the caret framework can be used as methods. A list of possible models can for example be retrieved via

names(caret::getModelInfo())[sapply(caret::getModelInfo(), function(x) "Classification" %in% x$type)]

This implementation is a wrapper function around the function classifier_test that modifies the in- and output of that function to match the other functions provided in this package. For more details see the classifier_test.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

classifier

Chosen classification method

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes Yes Yes

References

Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx.

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

HMN, YMRZL

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform classifier two-sample test 
if(requireNamespace("Ecume", quietly = TRUE)) {
  C2ST(X1, X2)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.