DISCOB: Distance Components (DISCO) Tests

View source: R/DISCO.R

DISCOBR Documentation

Distance Components (DISCO) Tests

Description

Performs Energy statistics distance components (DISCO) multi-sample tests (Rizzo and Székely, 2010). The implementation here uses the disco implementation from the energy package.

Usage

DISCOB(X1, X2, ..., n.perm = 0, alpha = 1, seed = NULL)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Further datasets as matrices or data.frames

n.perm

Number of permutations for Bootstrap test (default: 0, no Bootstrap test performed)

alpha

Power of the distance used for generalized Energy statistic (default: 1). Has to lie in (0,2]. For values in (0, 2), consistency of the resulting test has been shown. Rizzo and Székely (2010) recommend larger values for (close to) normal data and smaller values for heavy-tailed distributions. In Stolte et al. (2026), good performance for alpha = 0.5 was observed across scenarios but performance generally did not differ much depending on alpha.

seed

Random seed (default: NULL). A random seed will only be set if one is provided.

Details

DISCO is a method for multi-sample testing based on all pairwise between-sample distances. It is analogous to the classical ANOVA. Instead of decomposing squared differences from the sample mean, the total dispersion (generalized Energy statistic) is composed into distance components (DISCO) consisting of the within-sample and between-sample measures of dispersion.

DISCOB computes the between-sample DISCO statistic which is the between-sample component.

Small values of the statistic indicate similarity of the datasets and therefore, the null hypothesis of equal distributions is rejected for large values of the statistic.

This implementation is a wrapper function around the function disco that modifies the in- and output of that function to match the other functions provided in this package. For more details see the disco.

Value

An object of class htest with the following components:

call

The function call

statistic

Observed value of the test statistic

p.value

Bootstrap p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No Yes

Note

Because this method cannot handle missing data, any missing values are removed automatically and a warning is issued.

References

Székely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).

Rizzo, M. L. and Székely, G. J. (2010). DISCO Analysis: A Nonparametric Extension of Analysis of Variance, Annals of Applied Statistics, 4(2), 1034-1055. doi:10.1214/09-AOAS245

Székely, G. J. (2000) Technical Report 03-05: E-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.

Rizzo, M., Székely, G. (2022). energy: E-Statistics: Multivariate Inference via the Energy of Data. R package version 1.7-11, https://CRAN.R-project.org/package=energy.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Stolte, M., Rahnenführer, J., Bommert, A. (2026). An Empirical Comparison of Methods for Quantifying the Similarity of Numeric Datasets. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2604.12327")}

See Also

DISCOF, Energy

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform DISCO tests
if(requireNamespace("energy", quietly = TRUE)) {
  DISCOB(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on May 15, 2026, 9:07 a.m.