ZC_cat: Maxtype Edge-Count Test for Discrete Data
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

ZC_cat

R Documentation

Maxtype Edge-Count Test for Discrete Data

Description

Performs the maxtype edge-count two-sample test for multivariate data proposed by Zhang and Chen (2022). The implementation here uses the g.tests implementation from the gTests package.

Usage

ZC_cat(X1, X2, dist.fun, agg.type, graph.type = "mstree", K = 1, n.perm = 0, 
        maxtype.kappa = 1.14, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`dist.fun`	Function for calculating the distance of two observations. Should take two vectors as its input and return their distance as a scalar value.
`agg.type`	Character giving the method for aggregating over possible similarity graphs. Options are `"u"` for union of possible similarity graphs and `"a"` for averaging over test statistics calculated on possible similarity graphs.
`graph.type`	Character specifying which similarity graph to use. Possible options are `"mstree"` (default, Minimum Spanning Tree) and `"nnlink"` (Nearest Neighbor Graph).
`K`	Parameter for graph (default: 1). If `graph.type = "mstree"`, a `K`-MST is constructed (`K=1` is the classical MST). If `graph.type = "nnlink"`, `K` gives the number of neighbors considered in the `K`-NN graph.
`n.perm`	Number of permutations for permutation test (default: 0, asymptotic test is performed).
`maxtype.kappa`	Parameter `\kappa` of the test (default: 1.14). See details.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The test is an enhancement of the Friedman-Rafsky test (original edge-count test) that aims at detecting both location and scale alternatives and is more flexible than the generalized edge-count test of Chen and Friedman (2017). The test statistic is the maximum of two statistics. The first statistic ist the weighted edge-count statistic multiplied by a factor \kappa. The second statistic is the absolute value of the standardized difference of edge-counts within the first and within the second sample.

Low values of the test statistic indicate similarity of the datasets. Thus, the null hypothesis of equal distributions is rejected for high values.

For discrete data, the similarity graph used in the test is not necessarily unique. This can be solved by either taking a union of all optimal similarity graphs or averaging the test statistics over all optimal similarity graphs. For details, see Zhang and Chen (2017).

For n.perm = 0, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function g.tests that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Asymptotic or permutation p value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	No	Yes	No

References

Zhang, J. and Chen, H. (2022). Graph-Based Two-Sample Tests for Data with Repeated Observations. Statistica Sinica 32, 391-415, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.5705/ss.202019.0116")}.

Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
# Perform generalized edge-count test
if(requireNamespace("gTests", quietly = TRUE)) {
  ZC_cat(X1cat, X2cat, dist.fun = function(x, y) sum(x != y), agg.type = "a")
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.