C2ST | R Documentation |
The function implements the Classifier Two-Sample Test (C2ST) of Lopez-Paz & Oquab (2017). The comparison of multiple (\ge 2
) samples is also possible. The implementation here uses the classifier_test
implementation from the Ecume package.
C2ST(X1, X2, ..., split = 0.7, thresh = 0, method = "knn", control = NULL,
train.args = NULL, seed = 42)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
... |
Optionally more datasets as matrices or data.frames |
split |
Proportion of observations used for training |
thresh |
Value to add to the null hypothesis value (default:0). The null hypothesis tested can be formulated as |
method |
Classifier to use during training (default: |
control |
Control parameters for fitting. See |
train.args |
Further arguments passed to |
seed |
Random seed (default: 42) |
The classifier two-sample test works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. The pooled sample is then split into training and test set and a classifier is trained on the training data. The classification accuracy on the test data is then used as a test statistic. If the distributions of the datasets do not differ, the classifier will be unable to distinguish between the datasets and therefore the test accuracy will be close to chance level. The test rejects if the test accuracy is greater than chance level.
All methods available for classification within the caret framework can be used as methods. A list of possible models can for example be retrieved via
names(caret::getModelInfo())[sapply(caret::getModelInfo(), function(x) "Classification" %in% x$type)]
This implementation is a wrapper function around the function classifier_test
that modifies the in- and output of that function to match the other functions provided in this package. For more details see the classifier_test
.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
p.value |
Asymptotic p value |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
classifier |
Chosen classification method |
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | Yes | Yes |
Lopez-Paz, D., and Oquab, M. (2022). Revisiting classifier two-sample tests. ICLR 2017. https://openreview.net/forum?id=SJkXfE5xx.
Roux de Bezieux, H. (2021). Ecume: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions. R package version 0.9.1, https://CRAN.R-project.org/package=Ecume.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
HMN
, YMRZL
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform classifier two-sample test
if(requireNamespace("Ecume", quietly = TRUE)) {
C2ST(X1, X2)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.