BG: Biau and Gyorfi (2005) two-sample homogeneity test

View source: R/BG.R

BGR Documentation

Biau and Gyorfi (2005) two-sample homogeneity test

Description

The function implements the Biau and Gyorfi (2005) two-sample homogeneity test. This test uses the L_1-distance between two empicial distribution functions restricted to a finite partition.

Usage

BG(X1, X2, partition = rectPartition, exponent = 0.8, eps = 0.01, seed = 42, ...)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame of the same sample size as X1

partition

Function that creates a finite partition for the subspace spanned by the two datasets (default: rectPartition, see Details)

exponent

Exponent used in the partition function, should be between 0 and 1 (default: 0.8)

eps

Small threshold to guarantee edge points are included (default: 0.01)

seed

Random seed (default: 42)

...

Further arguments to be passed to the partition function

Details

The Biau and Gyorfi (2005) two-sample homogeneity test is defined for two datasets of the same sample size.

By default a rectangular partition (rectPartition) is being calculated under the assumption of approximately equal cell probabilities. Use the exponent argument to choose the number of elements of the partition m_n accoring to the convergence criteria in Biau and Gyorfi (2005). By default choose m_n = n^{0.8}. For each of the p variables of the datasets, create m_n^{1/p} + 1 cutpoints along the range of both datasets to define the partition, and ensure at least three cutpoints exist per variable (min, max, and one point splitting the data into two bins).

The test statistic is the L_1-distance between the vectors of the proportions of points falling into each cell of the partition for each dataset. An asymptotic test is performed using a standardized version of the L_1 distance that is approximately standard normally distributed (Corollary to Theorem 2 in Biau and Gyorfi (2005)). Low values of the test statistic indicate similarity. Therefore, the test rejects for large values of the test statistic.

Value

An object of class htest with the following components:

statistic

Observed value of the (asymptotic) test statistic

p.value

p value

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No No

References

Biau G. and Gyorfi, L. (2005). On the asymptotic properties of a nonparametric L_1-test statistic of homogeneity, IEEE Transactions on Information Theory, 51(11), 3965-3973. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1109/TIT.2005.856979")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

rectPartition

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform BG test 
BG(X1, X2)

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.