View source: R/DataSimilarity.R
DataSimilarity | R Documentation |
Calculate the similarity of two or more datasets
DataSimilarity(X1, X2, method, ...)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
method |
Name of method for calculating the similarity of the supplied datasets as a character. See Details. |
... |
Further arguments passed on to |
The package includes various methods for calculating the similarity of two or more datasets.
Appropriate methods for a specific situation can be found using the findSimilarityMethod
function.
In the following, the available methods are listed sorted by their applicability to datasets with different characteristics.
We differentiate between the number of datasets (two vs. more than two), the type of data (numeric vs. categorical) and the presence of a target variable in each dataset (with vs. without target variable).
Typically, this target variable has to be categorical.
Methods might be applicable in multiple cases.
Then, they are listed once in each case for which they can be used.
For the list of methods, see below.
Typically an object of class htest
with the following components:
statistic |
Observed value of the test statistic |
parameter |
Paramter of the null distribution of the test statistic (where applicable) |
p.value |
Permutation or asymptotic p value (where applicable) |
estimate |
Sample estimate (where applicable) |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
Further components specific to the method might be included. For details see the help pages of the respective methods.
Bahr
The Bahr (1996) two-sample test. Compares two numeric datasets based on inter-point distances, special case of the test of Bahrinhaus and Franz (2010) (BF
).
BallDivergence
Ball divergence based two- or k
-sample test for numeric datasets. The Ball Divergence is the square of the measure difference over a given closed ball collection.
BF
The Bahrinhaus and Franz (2010) test. Compares two numeric datasets based on inter-point distances using a kernel function. Different kernel functions are tailored to detecting certain alternatives, e.g. shift or scale.
BG
The Biau and Gyorfi (2005) two-sample homogeneity test. Generalization of the Kolmogorov-Smirnov test for multivariate data, uses the L_1
-distance between two empicial distribution functions restricted to a finite partition.
BG2
The Biswas and Ghosh (2014) two-sample test for high-dimensional data. Compares two numeric datasets based on inter-point distances by comparing the means of the distributions of the within-sample and between-sample distances of both samples.
BMG
The Biswas, Mukhopadhyay and Gosh (2014) distribution-free two-sample runs test. Compares two numeric datasets using the Shortest Hamiltonian Path in the pooled sample.
BQS
The nearest-neighbor-based multivariate two-sample test of Barakat et al. (1996). Modifies the Schilling-Henze nearest neighbor tests (SH
) such that the number of nearest neighbors does not have to be chosen.
C2ST
Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
CCS
Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR
.
CF
Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR
.
Cramer
The Cramér two-sample test (Baringhaus and Franz, 2004). Compares two numeric datasets based on inter-point distances, specialcase of the test of Bahrinhaus and Franz (2010) (BF
), equivalent to the Energy distance Energy
.
DiProPerm
Direction Projection Permutation test. Compares two numeric datasets using a linear classifier that distinguishes between the two datasets by projecting all observations onto the normal vector of that classifier and performing a permutation test using a univariate two-sample statistic on these projected scores.
DISCOB
, DISCOF
Energy statistics distance components (DISCO) (Rizzo and Székely, 2010). Compares two or more numeric datasets based on a decomposition of the total variance similar to ANOVA but using inter-point distances. DISCOB
uses the between-sample inter-point distances, DISCOF
uses an F-type statistic that takes the within- and between-sample inter-point distances into account.
DS
Multivariate rank-based two-sample test using measure transportation by Deb and Sen (2021). Uses a rank version of the Energy
statistic.
Energy
The Energy statistic multi-sample test (Székely and Rizzo, 2004). Compares two or more numeric datasets based on inter-point distances. Equivalent to the Cramer
test.
engineerMetric
The L_q
-engineer metric for comparing two multivariate distributions.
FR
The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.
FStest
Modified/ multiscale/ aggregated FS test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on a Fisher test for the independence of a clustering of the data and the true dataset membership.
GPK
Generalized permutation-based kernel two-sample test proposed by Song and Chen (2021). Modification of the MMD
test intended to better detect differences in variances.
HMN
Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.
Jeffreys
Jeffreys divergence. Symmetrized version of the Kullback-Leibler divergence.
KMD
Kernel measure of multi-sample dissimilarity (KMD) by Huang and Sen (2023). Uses the association between the features and the sample membership to quantify the dissimilarity of the distributions of two or more numeric datasets.
LHZ
Characteristic distance by Li et al. (2022). Compares two numeric datasets using their empirical characteristic functions.
MMCM
Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the Petrie
test.
MMD
Maximum Mean Discrepancy (MMD). Compares two numeric datasets using a kernel function. Measures the difference between distributions in the reproducing kernel Hilbert space induced by the chosen kernel function.
MW
Nonparametric graph-based LP (GLP) multisample test proposed by Mokhopadhyay and Wang (2020). Compares two or more numeric datasets based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel.
Petrie
Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the MMCM
test.
RItest
Modified/ multiscale/ aggregated RI test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on the Rand index of a clustering of the data and the true dataset membership.
Rosenbaum
Rosenbaum (2005) two-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Petrie
and to the MMCM
test.
SC
Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.
SH
Schilling-Henze nearest neighbor test (Schilling, 1986; Henze, 1988). Uses the number of edges connecting points from different samples in a K
-nearest neighbor graph on the pooled sample.
Wasserstein
Wasserstein distance. Permutation two-sample test for numeric data using the p
-Wasserstein distance.
YMRZL
Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.
ZC
Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR
) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF
).
GGRL
Decision-tree based measure of dataset distance and two-sample test (Ganti et al., 2002). Compares the proportions of datapoints of the two datasets falling into each section of the intersection of the partitions induced by fitting a decision tree on each dataset.
NKT
Decision–tree based measure of dataset similarity by Ntoutsi et al. (2008). Uses density estimates based on the intersection of the partitions induced by fitting a decision tree on each dataset.
OTDD
Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.
BallDivergence
Ball divergence based two- or k
-sample test for numeric datasets. The Ball Divergence is the square of the measure difference over a given closed ball collection.
C2ST
Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
DISCOB
, DISCOF
Energy statistics distance components (DISCO) (Rizzo and Székely, 2010). Compares two or more numeric datasets based on a decomposition of the total variance similar to ANOVA but using inter-point distances. DISCOB
uses the between-sample inter-point distances, DISCOF
uses an F-type statistic that takes the within- and between-sample inter-point distances into account.
Energy
The Energy statistic multi-sample test (Székely and Rizzo, 2004). Compares two or more numeric datasets based on inter-point distances. Equivalent to the Cramer
test.
FStest
Modified/ multiscale/ aggregated FS test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on a Fisher test for the independence of a clustering of the data and the true dataset membership.
KMD
Kernel measure of multi-sample dissimilarity (KMD) by Huang and Sen (2023). Uses the association between the features and the sample membership to quantify the dissimilarity of the distributions of two or more numeric datasets.
MMCM
Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the Petrie
test.
MW
Nonparametric graph-based LP (GLP) multi-sample test proposed by Mokhopadhyay and Wang (2020). Compares two or more numeric datasets based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel.
Petrie
Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the MMCM
test.
RItest
Modified/ multiscale/ aggregated RI test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on the Rand index of a clustering of the data and the true dataset membership.
SC
Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.
C2ST
Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
CCS_cat
Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR_cat
.
CF_cat
Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR_cat
.
CMDistance
Constrained Minimum (CM) distance (Tatti, 2007). Compares two categorical datasets using the distance of summaries.
FR_cat
The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.
HMN
Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.
MMCM
Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the Petrie
test.
Petrie
Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the MMCM
test.
YMRZL
Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.
ZC_cat
Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR
) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF
).
GGRLCat
Decision-tree based measure of dataset distance and two-sample test (Ganti et al., 2002). Compares the proportions of datapoints of the two datasets falling into each section of the intersection of the partitions induced by fitting a decision tree on each dataset.
OTDD
Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.
C2ST
Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
MMCM
Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the Petrie
test.
Petrie
Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the MMCM
test.
BMG
(in case of no ties, appropriate distance function has to be specified) The Biswas, Mukhopadhyay and Gosh (2014) distribution-free two-sample runs test. Compares two numeric datasets using the Shortest Hamiltonian Path in the pooled sample.
BQS
(in case of no ties, appropriate distance function has to be specified) The nearest-neighbor-based multivariate two-sample test of Barakat et al. (1996). Modifies the Schilling-Henze nearest neighbor tests (SH
) such that the number of nearest neighbors does not have to be chosen.
C2ST
Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
CCS
(in case of no ties, appropriate distance function has to be specified) Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR
.
CF
(in case of no ties, appropriate distance function has to be specified) Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR
.
FR
(in case of no ties, appropriate distance function has to be specified) The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.
HMN
Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.
MMCM
(in case of no ties, appropriate distance function has to be specified) Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the Petrie
test.
Petrie
(in case of no ties, appropriate distance function has to be specified) Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the MMCM
test.
Rosenbaum
(in case of no ties, appropriate distance function has to be specified)Rosenbaum (2005) two-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Petrie
and to the MMCM
test.
SC
(in case of no ties, appropriate distance function has to be specified) Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.
SH
(in case of no ties, appropriate distance function has to be specified) Schilling-Henze nearest neighbor test (Schilling, 1986; Henze, 1988). Uses the number of edges connecting points from different samples in a K
-nearest neighbor graph on the pooled sample.
YMRZL
Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.
ZC
(in case of no ties, appropriate distance function has to be specified) Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR
) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF
).
OTDD
(appropriate distance function has to be specified) Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.
C2ST
Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
MMCM
(in case of no ties, appropriate distance function has to be specified) Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the Petrie
test.
Petrie
(in case of no ties, appropriate distance function has to be specified) Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum
and to the MMCM
test.
SC
(in case of no ties, appropriate distance function has to be specified) Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
method.table
, findSimilarityMethod
# Workflow for using the DataSimilarity package:
# Prepare data example: comparing species in iris dataset
data("iris")
iris.split <- split(iris[, -5], iris$Species)
setosa <- iris.split$setosa
versicolor <- iris.split$versicolor
virginica <- iris.split$virginica
# 1. Find appropriate methods that can be used to compare 3 numeric datasets:
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE)
# get more information
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE, only.names = FALSE)
# 2. Choose a method and apply it:
# All suitable methods
possible.methds <- findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE,
only.names = FALSE)
# Select, e.g., method with highest number of fulfilled criteria
possible.methds$Implementation[which.max(possible.methds$Number.Fulfilled)]
set.seed(1234)
if(requireNamespace("KMD")) {
DataSimilarity(setosa, versicolor, virginica, method = "KMD")
}
# or directly
set.seed(1234)
if(requireNamespace("KMD")) {
KMD(setosa, versicolor, virginica)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.