DataSimilarity: Dataset Similarity

View source: R/DataSimilarity.R

DataSimilarityR Documentation

Dataset Similarity

Description

Calculate the similarity of two or more datasets

Usage

DataSimilarity(X1, X2, method, ...)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

method

Name of method for calculating the similarity of the supplied datasets as a character. See Details.

...

Further arguments passed on to method

Details

The package includes various methods for calculating the similarity of two or more datasets. Appropriate methods for a specific situation can be found using the findSimilarityMethod function. In the following, the available methods are listed sorted by their applicability to datasets with different characteristics. We differentiate between the number of datasets (two vs. more than two), the type of data (numeric vs. categorical) and the presence of a target variable in each dataset (with vs. without target variable). Typically, this target variable has to be categorical. Methods might be applicable in multiple cases. Then, they are listed once in each case for which they can be used. For the list of methods, see below.

Value

Typically an object of class htest with the following components:

statistic

Observed value of the test statistic

parameter

Paramter of the null distribution of the test statistic (where applicable)

p.value

Permutation or asymptotic p value (where applicable)

estimate

Sample estimate (where applicable)

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Further components specific to the method might be included. For details see the help pages of the respective methods.

Methods for two numeric datasets without target variables

Bahr

The Bahr (1996) two-sample test. Compares two numeric datasets based on inter-point distances, special case of the test of Bahrinhaus and Franz (2010) (BF).

BallDivergence

Ball divergence based two- or k-sample test for numeric datasets. The Ball Divergence is the square of the measure difference over a given closed ball collection.

BF

The Bahrinhaus and Franz (2010) test. Compares two numeric datasets based on inter-point distances using a kernel function. Different kernel functions are tailored to detecting certain alternatives, e.g. shift or scale.

BG

The Biau and Gyorfi (2005) two-sample homogeneity test. Generalization of the Kolmogorov-Smirnov test for multivariate data, uses the L_1-distance between two empicial distribution functions restricted to a finite partition.

BG2

The Biswas and Ghosh (2014) two-sample test for high-dimensional data. Compares two numeric datasets based on inter-point distances by comparing the means of the distributions of the within-sample and between-sample distances of both samples.

BMG

The Biswas, Mukhopadhyay and Gosh (2014) distribution-free two-sample runs test. Compares two numeric datasets using the Shortest Hamiltonian Path in the pooled sample.

BQS

The nearest-neighbor-based multivariate two-sample test of Barakat et al. (1996). Modifies the Schilling-Henze nearest neighbor tests (SH) such that the number of nearest neighbors does not have to be chosen.

C2ST

Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.

CCS

Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR.

CF

Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR.

Cramer

The Cramér two-sample test (Baringhaus and Franz, 2004). Compares two numeric datasets based on inter-point distances, specialcase of the test of Bahrinhaus and Franz (2010) (BF), equivalent to the Energy distance Energy.

DiProPerm

Direction Projection Permutation test. Compares two numeric datasets using a linear classifier that distinguishes between the two datasets by projecting all observations onto the normal vector of that classifier and performing a permutation test using a univariate two-sample statistic on these projected scores.

DISCOB, DISCOF

Energy statistics distance components (DISCO) (Rizzo and Székely, 2010). Compares two or more numeric datasets based on a decomposition of the total variance similar to ANOVA but using inter-point distances. DISCOB uses the between-sample inter-point distances, DISCOF uses an F-type statistic that takes the within- and between-sample inter-point distances into account.

DS

Multivariate rank-based two-sample test using measure transportation by Deb and Sen (2021). Uses a rank version of the Energy statistic.

Energy

The Energy statistic multi-sample test (Székely and Rizzo, 2004). Compares two or more numeric datasets based on inter-point distances. Equivalent to the Cramer test.

engineerMetric

The L_q-engineer metric for comparing two multivariate distributions.

FR

The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.

FStest

Modified/ multiscale/ aggregated FS test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on a Fisher test for the independence of a clustering of the data and the true dataset membership.

GPK

Generalized permutation-based kernel two-sample test proposed by Song and Chen (2021). Modification of the MMD test intended to better detect differences in variances.

HMN

Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.

Jeffreys

Jeffreys divergence. Symmetrized version of the Kullback-Leibler divergence.

KMD

Kernel measure of multi-sample dissimilarity (KMD) by Huang and Sen (2023). Uses the association between the features and the sample membership to quantify the dissimilarity of the distributions of two or more numeric datasets.

LHZ

Characteristic distance by Li et al. (2022). Compares two numeric datasets using their empirical characteristic functions.

MMCM

Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.

MMD

Maximum Mean Discrepancy (MMD). Compares two numeric datasets using a kernel function. Measures the difference between distributions in the reproducing kernel Hilbert space induced by the chosen kernel function.

MW

Nonparametric graph-based LP (GLP) multisample test proposed by Mokhopadhyay and Wang (2020). Compares two or more numeric datasets based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel.

Petrie

Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.

RItest

Modified/ multiscale/ aggregated RI test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on the Rand index of a clustering of the data and the true dataset membership.

Rosenbaum

Rosenbaum (2005) two-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Petrie and to the MMCM test.

SC

Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.

SH

Schilling-Henze nearest neighbor test (Schilling, 1986; Henze, 1988). Uses the number of edges connecting points from different samples in a K-nearest neighbor graph on the pooled sample.

Wasserstein

Wasserstein distance. Permutation two-sample test for numeric data using the p-Wasserstein distance.

YMRZL

Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.

ZC

Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF).

Methods for two numeric datasets with target variables

GGRL

Decision-tree based measure of dataset distance and two-sample test (Ganti et al., 2002). Compares the proportions of datapoints of the two datasets falling into each section of the intersection of the partitions induced by fitting a decision tree on each dataset.

NKT

Decision–tree based measure of dataset similarity by Ntoutsi et al. (2008). Uses density estimates based on the intersection of the partitions induced by fitting a decision tree on each dataset.

OTDD

Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.

Methods for more than two numeric datasets without target variables

BallDivergence

Ball divergence based two- or k-sample test for numeric datasets. The Ball Divergence is the square of the measure difference over a given closed ball collection.

C2ST

Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.

DISCOB, DISCOF

Energy statistics distance components (DISCO) (Rizzo and Székely, 2010). Compares two or more numeric datasets based on a decomposition of the total variance similar to ANOVA but using inter-point distances. DISCOB uses the between-sample inter-point distances, DISCOF uses an F-type statistic that takes the within- and between-sample inter-point distances into account.

Energy

The Energy statistic multi-sample test (Székely and Rizzo, 2004). Compares two or more numeric datasets based on inter-point distances. Equivalent to the Cramer test.

FStest

Modified/ multiscale/ aggregated FS test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on a Fisher test for the independence of a clustering of the data and the true dataset membership.

KMD

Kernel measure of multi-sample dissimilarity (KMD) by Huang and Sen (2023). Uses the association between the features and the sample membership to quantify the dissimilarity of the distributions of two or more numeric datasets.

MMCM

Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.

MW

Nonparametric graph-based LP (GLP) multi-sample test proposed by Mokhopadhyay and Wang (2020). Compares two or more numeric datasets based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel.

Petrie

Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.

RItest

Modified/ multiscale/ aggregated RI test (Paul et al., 2021). Compares two or more datasets in the high dimension low sample size (HDLSS) setting based on the Rand index of a clustering of the data and the true dataset membership.

SC

Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.

Methods for two categorical datasets without target variables

C2ST

Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.

CCS_cat

Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR_cat.

CF_cat

Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR_cat.

CMDistance

Constrained Minimum (CM) distance (Tatti, 2007). Compares two categorical datasets using the distance of summaries.

FR_cat

The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.

HMN

Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.

MMCM

Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.

Petrie

Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.

YMRZL

Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.

ZC_cat

Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF).

Methods for two categorical datasets with target variables

GGRLCat

Decision-tree based measure of dataset distance and two-sample test (Ganti et al., 2002). Compares the proportions of datapoints of the two datasets falling into each section of the intersection of the partitions induced by fitting a decision tree on each dataset.

OTDD

Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.

Methods for more than two categorical datasets without target variables

C2ST

Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.

MMCM

Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.

Petrie

Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.

Methods for two datasets with both categorical and numeric variables but without target variables

BMG (in case of no ties, appropriate distance function has to be specified)

The Biswas, Mukhopadhyay and Gosh (2014) distribution-free two-sample runs test. Compares two numeric datasets using the Shortest Hamiltonian Path in the pooled sample.

BQS (in case of no ties, appropriate distance function has to be specified)

The nearest-neighbor-based multivariate two-sample test of Barakat et al. (1996). Modifies the Schilling-Henze nearest neighbor tests (SH) such that the number of nearest neighbors does not have to be chosen.

C2ST

Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.

CCS (in case of no ties, appropriate distance function has to be specified)

Weighted edge-count two-sample test for multivariate data proposed by Chen, Chen and Su (2018). The test is intended for comparing two samples with unequal sample sizes. It is a modification of the graph-based Friedman-Rafsky test FR.

CF (in case of no ties, appropriate distance function has to be specified)

Generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The test is intended for better simultaneous detection of shift and scale alternatives. It is a modification of the graph-based Friedman-Rafsky test FR.

FR (in case of no ties, appropriate distance function has to be specified)

The Friedman-Rafsky two-sample test (original edge-count test) for multivariate data (Friedman and Rafsky, 1979). Compares two numeric datasets using the number of edges connecting points from different samples in a similarity graph (e.g. MST) on the pooled sample.

HMN

Random-forest based two-sample test by Hediger et al. (2021). Uses the (OOB) classification error of a random forest that is trained to distinguish between two datasets. Can also be used with categorical data.

MMCM (in case of no ties, appropriate distance function has to be specified)

Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.

Petrie (in case of no ties, appropriate distance function has to be specified)

Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.

Rosenbaum (in case of no ties, appropriate distance function has to be specified)

Rosenbaum (2005) two-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Petrie and to the MMCM test.

SC (in case of no ties, appropriate distance function has to be specified)

Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.

SH (in case of no ties, appropriate distance function has to be specified)

Schilling-Henze nearest neighbor test (Schilling, 1986; Henze, 1988). Uses the number of edges connecting points from different samples in a K-nearest neighbor graph on the pooled sample.

YMRZL

Tree-based test of Yu et al. (2007). Uses the classification error of a decision tree trained to distinguish between two datasets. Can also be used with categorical data.

ZC (in case of no ties, appropriate distance function has to be specified)

Max-type edge-count test (Zhang and Chen, 2019). Enhancement of the Friedman-Rafsky test (original edge-count test, FR) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017) (CF).

Methods for two datasets with both categorical and numeric variables and target variables

OTDD (appropriate distance function has to be specified)

Optimal transport dataset distance (OTDD) (Alvarez-Melis and Fusi, 2020). The distance combines the distance between features and the distance between label distributions.

Methods for more than two datasets with both categorical and numeric variables but without target variables

C2ST

Classifier Two-Sample Test (C2ST) of Lopez-Paz and Oquab (2017). Can be used for multiple samples and categorical data also. Uses the classification accuracy of a classifier that is trained to distinguish between the datasets.

MMCM (in case of no ties, appropriate distance function has to be specified)

Multisample Mahalanobis crossmatch (MMCM) test (Mukherjee et al., 2022). Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the Petrie test.

Petrie (in case of no ties, appropriate distance function has to be specified)

Petrie (2016) multi-sample cross-match test. Uses the optimal non-bipartite matching for comparing two or more numeric or categorical samples. In the two-sample case equivalent to the Rosenbaum and to the MMCM test.

SC (in case of no ties, appropriate distance function has to be specified)

Graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). Uses the within- and between-sample edge counts in a similarity graph to compare two or more numeric datasets.

References

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

method.table, findSimilarityMethod

Examples

# Workflow for using the DataSimilarity package: 
# Prepare data example: comparing species in iris dataset
data("iris")
iris.split <- split(iris[, -5], iris$Species)
setosa <- iris.split$setosa
versicolor <- iris.split$versicolor
virginica <- iris.split$virginica

# 1. Find appropriate methods that can be used to compare 3 numeric datasets:
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE)

# get more information 
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE, only.names = FALSE)

# 2. Choose a method and apply it:
# All suitable methods
possible.methds <- findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE, 
                                          only.names = FALSE)
# Select, e.g., method with highest number of fulfilled criteria
possible.methds$Implementation[which.max(possible.methds$Number.Fulfilled)]

set.seed(1234)
if(requireNamespace("KMD")) {
  DataSimilarity(setosa, versicolor, virginica, method = "KMD")
}

# or directly 
set.seed(1234)
if(requireNamespace("KMD")) {
  KMD(setosa, versicolor, virginica)
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.