BallDivergence: Ball Divergence based two- or k-sample test

BallDivergenceR Documentation

Ball Divergence based two- or k-sample test

Description

The function implements the Pan et al. (2018) multivariate two- or k-sample test based on the Ball Divergence. The implementation here uses the bd.test implementation from the Ball package.

Usage

BallDivergence(X1, X2, ..., n.perm = 0, seed = 42, num.threads = 0, 
                kbd.type = "sum", weight = c("constant", "variance"), 
                args.bd.test = NULL)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

n.perm

Number of permutations for permutation test (default: 0, no permutation test performed). Note that for more than two samples, no test is performed.

seed

Random seed (default: 42)

num.threads

Number of threads (default: 0, all available cores are used)

kbd.type

Character specifying which k-sample test statistic will be used. Must be one of "sum" (default), "maxsum", or "max".

weight

Character specifying the weight form of the Ball Divergence test statistic. Must be one of "constant" (default) or "variance".

args.bd.test

Further arguments passed to bd.test as a named list.

Details

For n.perm = 0, the asymptotic test is performed. For n.perm > 0, a permutation test is performed.

The Ball Divergence is defined as the square of the measure difference over a given closed ball collection. The empirical test performed here is based on the difference between averages of metric ranks. It is robust to outliers and heavy-tailed data and suitable for imbalanced sample sizes.

The Ball Divergence of two distributions is zero if and only if the distributions coincide. Therefore, low values of the test statistic indicate similarity and the test rejects for large values of the test statistic.

For the k-sample problem the pairwise Ball divergences can be summarized in different ways. First, one can simply sum up all pairwise Ball divergences (kbd.type = "sum"). Next, one can find the sample with the largest difference to the other, i.e. take the maximum of the sums of all Ball divergences for each sample with all other samples (kbd.type = "maxsum"). Last, one can sum up the largest k-1 pairwise Ball divergences (kbd.type = "max").

This implementation is a wrapper function around the function bd.test that modifies the in- and output of that function to match the other functions provided in this package. For more details see bd.test and bd.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Permutation p value (only if n.perm > 0 and for two datasets)

n.perm

Number of permutations for permutation test

size

Number of observations for each dataset

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No Yes

References

Pan, W., T. Y. Tian, X. Wang, H. Zhang (2018). Ball Divergence: Nonparametric two sample test, Annals of Statistics 46(3), 1109-1137, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/17-AOS1579")}.

J. Zhu, W. Pan, W. Zheng, and X. Wang (2021). Ball: An R Package for Detecting Distribution Difference and Association in Metric Spaces, Journal of Statistical Software, 97(6), \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v097.i06")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Calculate Ball Divergence and perform test 
if(requireNamespace("Ball", quietly = TRUE)) {
  BallDivergence(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.