DataSimilarity: Dataset Similarity
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

DataSimilarity

R Documentation

Dataset Similarity

Description

Calculate the similarity of two or more datasets

Usage

DataSimilarity(X1, X2, method, ...)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`method`	Name of method for calculating the similarity of the supplied datasets as a character. See Details.
`...`	Further arguments passed on to `method`

Details

The package includes various methods for calculating the similarity of two or more datasets. Appropriate methods for a specific situation can be found using the findSimilarityMethod function. In the following, the available methods are listed sorted by their applicability to datasets with different characteristics. We differentiate between the number of datasets (two vs. more than two), the type of data (numeric vs. categorical) and the presence of a target variable in each dataset (with vs. without target variable). Typically, this target variable has to be categorical. Methods might be applicable in multiple cases. Then, they are listed once in each case for which they can be used. For the list of methods, see below.

Value

Typically an object of class htest with the following components:

`statistic`	Observed value of the test statistic
`parameter`	Paramter of the null distribution of the test statistic (where applicable)
`p.value`	Permutation or asymptotic p value (where applicable)
`estimate`	Sample estimate (where applicable)
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Further components specific to the method might be included. For details see the help pages of the respective methods.

Methods for two numeric datasets without target variables

Bahr: The Bahr (1996) two-sample test. Compares two numeric datasets based on inter-point distances, special case of the test of Bahrinhaus and Franz (2010) (BF).
BallDivergence: Ball divergence based two- or k-sample test for numeric datasets. The Ball Divergence is the square of the measure difference over a given closed ball collection.
BF: The Bahrinhaus and Franz (2010) test. Compares two numeric datasets based on inter-point distances using a kernel function. Different kernel functions are tailored to detecting certain alternatives, e.g. shift or scale.
BG: The Biau and Gyorfi (2005) two-sample homogeneity test. Generalization of the Kolmogorov-Smirnov test for multivariate data, uses the L_1-distance between two empicial distribution functions restricted to a finite partition.
BG2: The Biswas and Ghosh (2014) two-sample test for high-dimensional data. Compares two numeric datasets based on inter-point distances by comparing the means of the distributions of the within-sample and between-sample distances of both samples.
BMG: The Biswas, Mukhopadhyay and Gosh (2014) distribution-free two-sample runs test. Compares two numeric datasets using the Shortest Hamiltonian Path in the pooled sample.
BQS: The nearest-neighbor-based multivariate two-sample test of Barakat et al. (1996). Modifies the Schilling-Henze nearest neighbor tests (SH) such that the number of nearest neighbors does not have to be chosen.
C2ST: Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
CCS: Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR.
CF: Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR.
Cramer: The Cramér two-sample test (Baringhaus and Franz, 2004). Compares two numeric datasets based on inter-point distances, specialcase of the test of Bahrinhaus and Franz (2010) (BF), equivalent to the Energy distance Energy.
DiProPerm: Direction Projection Permutation test. Compares two numeric datasets using a linear classifier that distinguishes between the two datasets by projecting all observations onto the normal vector of that classifier and performing a permutation test using a univariate two-sample statistic on these projected scores.
DISCOB, DISCOF: Energy statistics distance components (DISCO) (Rizzo and Székely, 2010). Compares two or more numeric datasets based on a decomposition of the total variance similar to ANOVA but using inter-point distances. DISCOB uses the between-sample inter-point distances, DISCOF uses an F-type statistic that takes the within- and between-sample inter-point distances into account.
DS: Multivariate rank-based two-sample test using measure transportation by Deb and Sen (2021). Uses a rank version of the Energy statistic.
Energy: The Energy statistic multi-sample test (Székely and Rizzo, 2004). Compares two or more numeric datasets based on inter-point distances. Equivalent to the Cramer test.
engineerMetric: The L_q-engineer metric for comparing two multivariate distributions.
FR: The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.
FStest: Modified/ multiscale/ aggregated FS test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on a Fisher test for the independence of a clustering of the data and the true dataset membership.
GPK: Generalized permutation-based kernel two-sample test proposed by Song and Chen (2021). Modification of the MMD test intended to better detect differences in variances.
HMN: Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.
Jeffreys: Jeffreys divergence. Symmetrized version of the Kullback-Leibler divergence.
KMD: Kernel measure of multi-sample dissimilarity (KMD) by Huang and Sen (2023). Uses the association between the features and the sample membership to quantify the dissimilarity of the distributions of two or more numeric datasets.
LHZ: Characteristic distance by Li et al. (2022). Compares two numeric datasets using their empirical characteristic functions.
MMCM: Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.
MMD: Maximum Mean Discrepancy (MMD). Compares two numeric datasets using a kernel function. Measures the difference between distributions in the reproducing kernel Hilbert space induced by the chosen kernel function.
MW: Nonparametric graph-based LP (GLP) multisample test proposed by Mokhopadhyay and Wang (2020). Compares two or more numeric datasets based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel.
Petrie: Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.
RItest: Modified/ multiscale/ aggregated RI test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on the Rand index of a clustering of the data and the true dataset membership.
Rosenbaum: Rosenbaum (2005) two-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Petrie and to the MMCM test.
SC: Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.
SH: Schilling-Henze nearest neighbor test (Schilling, 1986; Henze, 1988). Uses the number of edges connecting points from different samples in a K-nearest neighbor graph on the pooled sample.
Wasserstein: Wasserstein distance. Permutation two-sample test for numeric data using the p-Wasserstein distance.
YMRZL: Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.
ZC: Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF).

Methods for two numeric datasets with target variables

GGRL: Decision-tree based measure of dataset distance and two-sample test (Ganti et al., 2002). Compares the proportions of datapoints of the two datasets falling into each section of the intersection of the partitions induced by fitting a decision tree on each dataset.
NKT: Decision–tree based measure of dataset similarity by Ntoutsi et al. (2008). Uses density estimates based on the intersection of the partitions induced by fitting a decision tree on each dataset.
OTDD: Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.

Methods for more than two numeric datasets without target variables

BallDivergence: Ball divergence based two- or k-sample test for numeric datasets. The Ball Divergence is the square of the measure difference over a given closed ball collection.
C2ST: Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
DISCOB, DISCOF: Energy statistics distance components (DISCO) (Rizzo and Székely, 2010). Compares two or more numeric datasets based on a decomposition of the total variance similar to ANOVA but using inter-point distances. DISCOB uses the between-sample inter-point distances, DISCOF uses an F-type statistic that takes the within- and between-sample inter-point distances into account.
Energy: The Energy statistic multi-sample test (Székely and Rizzo, 2004). Compares two or more numeric datasets based on inter-point distances. Equivalent to the Cramer test.
FStest: Modified/ multiscale/ aggregated FS test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on a Fisher test for the independence of a clustering of the data and the true dataset membership.
KMD: Kernel measure of multi-sample dissimilarity (KMD) by Huang and Sen (2023). Uses the association between the features and the sample membership to quantify the dissimilarity of the distributions of two or more numeric datasets.
MMCM: Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.
MW: Nonparametric graph-based LP (GLP) multi-sample test proposed by Mokhopadhyay and Wang (2020). Compares two or more numeric datasets based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel.
Petrie: Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.
RItest: Modified/ multiscale/ aggregated RI test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on the Rand index of a clustering of the data and the true dataset membership.
SC: Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.

Methods for two categorical datasets without target variables

C2ST: Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
CCS_cat: Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR_cat.
CF_cat: Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR_cat.
CMDistance: Constrained Minimum (CM) distance (Tatti, 2007). Compares two categorical datasets using the distance of summaries.
FR_cat: The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.
HMN: Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.
MMCM: Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.
Petrie: Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.
YMRZL: Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.
ZC_cat: Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF).

Methods for two categorical datasets with target variables

GGRLCat: Decision-tree based measure of dataset distance and two-sample test (Ganti et al., 2002). Compares the proportions of datapoints of the two datasets falling into each section of the intersection of the partitions induced by fitting a decision tree on each dataset.
OTDD: Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.

Methods for more than two categorical datasets without target variables

C2ST: Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
MMCM: Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.
Petrie: Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.

Methods for two datasets with both categorical and numeric variables but without target variables

BMG (in case of no ties, appropriate distance function has to be specified): The Biswas, Mukhopadhyay and Gosh (2014) distribution-free two-sample runs test. Compares two numeric datasets using the Shortest Hamiltonian Path in the pooled sample.
BQS (in case of no ties, appropriate distance function has to be specified): The nearest-neighbor-based multivariate two-sample test of Barakat et al. (1996). Modifies the Schilling-Henze nearest neighbor tests (SH) such that the number of nearest neighbors does not have to be chosen.
C2ST: Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
CCS (in case of no ties, appropriate distance function has to be specified): Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR.
CF (in case of no ties, appropriate distance function has to be specified): Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR.
FR (in case of no ties, appropriate distance function has to be specified): The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.
HMN: Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.
MMCM (in case of no ties, appropriate distance function has to be specified): Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.
Petrie (in case of no ties, appropriate distance function has to be specified): Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.
Rosenbaum (in case of no ties, appropriate distance function has to be specified): Rosenbaum (2005) two-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Petrie and to the MMCM test.
SC (in case of no ties, appropriate distance function has to be specified): Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.
SH (in case of no ties, appropriate distance function has to be specified): Schilling-Henze nearest neighbor test (Schilling, 1986; Henze, 1988). Uses the number of edges connecting points from different samples in a K-nearest neighbor graph on the pooled sample.
YMRZL: Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.
ZC (in case of no ties, appropriate distance function has to be specified): Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF).

Methods for two datasets with both categorical and numeric variables and target variables

OTDD (appropriate distance function has to be specified): Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.

Methods for more than two datasets with both categorical and numeric variables but without target variables

C2ST: Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.
MMCM (in case of no ties, appropriate distance function has to be specified): Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.
Petrie (in case of no ties, appropriate distance function has to be specified): Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.
SC (in case of no ties, appropriate distance function has to be specified): Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.

References

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Workflow for using the DataSimilarity package: 
# Prepare data example: comparing species in iris dataset
data("iris")
iris.split <- split(iris[, -5], iris$Species)
setosa <- iris.split$setosa
versicolor <- iris.split$versicolor
virginica <- iris.split$virginica

# 1. Find appropriate methods that can be used to compare 3 numeric datasets:
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE)

# get more information 
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE, only.names = FALSE)

# 2. Choose a method and apply it:
# All suitable methods
possible.methds <- findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE, 
                                          only.names = FALSE)
# Select, e.g., method with highest number of fulfilled criteria
possible.methds$Implementation[which.max(possible.methds$Number.Fulfilled)]

set.seed(1234)
if(requireNamespace("KMD")) {
  DataSimilarity(setosa, versicolor, virginica, method = "KMD")
}

# or directly 
set.seed(1234)
if(requireNamespace("KMD")) {
  KMD(setosa, versicolor, virginica)
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.

DataSimilarity index

Package overview Details on methods and implementations Getting Started with DataSimilarity

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DataSimilarity
Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

DataSimilarity: Dataset Similarity
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

Dataset Similarity

Description

Usage

Arguments

Details

Value

Methods for two numeric datasets without target variables

Methods for two numeric datasets with target variables

Methods for more than two numeric datasets without target variables

Methods for two categorical datasets without target variables

Methods for two categorical datasets with target variables

Methods for more than two categorical datasets without target variables

Methods for two datasets with both categorical and numeric variables but without target variables

Methods for two datasets with both categorical and numeric variables and target variables

Methods for more than two datasets with both categorical and numeric variables but without target variables

References

See Also

Examples

Related to DataSimilarity in DataSimilarity...

R Package Documentation

Browse R Packages

We want your feedback!

DataSimilarity Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

DataSimilarity: Dataset Similarity In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

Dataset Similarity

Description

Usage

Arguments

Details

Value

Methods for two numeric datasets without target variables

Methods for two numeric datasets with target variables

Methods for more than two numeric datasets without target variables

Methods for two categorical datasets without target variables

Methods for two categorical datasets with target variables

Methods for more than two categorical datasets without target variables

Methods for two datasets with both categorical and numeric variables but without target variables

Methods for two datasets with both categorical and numeric variables and target variables

Methods for more than two datasets with both categorical and numeric variables but without target variables

References

See Also

Examples

Related to DataSimilarity in DataSimilarity...

R Package Documentation

Browse R Packages

We want your feedback!

DataSimilarity
Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

DataSimilarity: Dataset Similarity
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing