C2ST: Classifier Two-Sample Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/C2ST.R

C2ST	R Documentation

Classifier Two-Sample Test

Description

The function implements the Classifier Two-Sample Test (C2ST) of Lopez-Paz & Oquab (2017). The comparison of multiple (\ge 2) samples is also possible. The implementation here uses the classifier_test implementation from the Ecume package.

Usage

C2ST(X1, X2, ..., split = 0.7, thresh = 0, classifier = "knn", control = NULL, 
      train.args = NULL, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`...`	Optionally more datasets as matrices or data.frames
`split`	Proportion of observations used for training
`thresh`	Value to add to the null hypothesis value (default:0). The null hypothesis tested can be formulated as `H_0: t = p_0 +` `thresh`, where `t` denotes the test accuracy of the classifier and `p_0` is the chance level (proportion of largest dataset in pooled sample).
`classifier`	Classifier to use during training (default: `"knn"`). See details for possible options.
`control`	Control parameters for fitting. See `trainControl`. Defaults to `NULL` in which case it is set to `caret::trainControl(method = "cv")`.
`train.args`	Further arguments passed to `train` as a named list.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The classifier two-sample test works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classifier is trained on the training data. The classification accuracy on the test data is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test accuracy will be close to chance level. The test rejects if the test accuracy is greater than chance level.

All methods available for classification within the caret framework can be used as methods. A list of possible models can for example be retrieved via

names(caret::getModelInfo())[sapply(caret::getModelInfo(), function(x) "Classification" %in% x$type)]

This implementation is a wrapper function around the function classifier_test that modifies the in- and output of that function to match the other functions provided in this package. For more details see the classifier_test.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Asymptotic p value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names
`classifier`	Chosen classification method

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	Yes	Yes

References

Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx.

Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform classifier two-sample test 
if(requireNamespace("Ecume", quietly = TRUE)) {
  C2ST(X1, X2)
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.