CF: Generalized Edge-Count Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/CF.R

CF	R Documentation

Generalized Edge-Count Test

Description

Performs the generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The implementation here uses the g.tests implementation from the gTests package.

Usage

CF(X1, X2, dist.fun = stats::dist, graph.fun = MST5, n.perm = 0, 
    dist.args = NULL, graph.args = NULL, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`dist.fun`	Function for calculating a distance matrix on the pooled dataset (default: `stats::dist`, Euclidean distance).
`graph.fun`	Function for calculating a similarity graph using the distance matrix on the pooled sample (default: `MST5`, 5-Minimum Spanning Tree).
`n.perm`	Number of permutations for permutation test (default: 0, asymptotic test is performed).
`dist.args`	Named list of further arguments passed to `dist.fun` (default: `NULL`).
`graph.args`	Named list of further arguments passed to `graph.fun` (default: `NULL`).
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The test is an enhancement of the Friedman-Rafsky test (original edge-count test) that aims at detecting both location and scale alternatives. The test statistic is given as

S = (R_1 - \mu_1, R_2 - \mu_2)\Sigma^{-1} \binom{R_1 - \mu_1}{R_2 - \mu_2}, \text{ where}

R_1 and R_2 denote the number of edges in the similarity graph connecting points within the first and second sample X_1 and X_2, respectively, \mu_1 = \text{E}_{H_0}(R_1), \mu_2 = \text{E}_{H_0}(R_2) and \Sigma is the covariance matrix of R_1 and R_2 under the null.

High values of the test statistic indicate dissimilarity of the datasets as the number of edges connecting points within the same sample is high meaning that points are more similar within the datasets than between the datasets.

For n.perm = 0, an asymptotic test using the asymptotic \chi^2 approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function g.tests that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests.

The chosen defaults are based on the simulation results of Stolte et al. (2026).

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`parameter`	Degrees of freedom for `\chi^2` distribution under `H_0` (only for asymptotic test)
`p.value`	Asymptotic or permutation p value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	No

Note

Because this method cannot handle missing data, any missing values are removed automatically and a warning is issued.

References

Chen, H. and Friedman, J.H. (2017). A New Graph-Based Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 112(517), 397-409. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2016.1147356")}

Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Stolte, M., Rahnenführer, J., Bommert, A. (2026). An Empirical Comparison of Methods for Quantifying the Similarity of Numeric Datasets. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2604.12327")}

Stolte, M., Rahnenführer, J., Bommert, A. (2026). An Empirical Comparison of Methods for Quantifying the Similarity of Categorical Datasets. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2604.11458")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform generalized edge-count test
if(requireNamespace("gTests", quietly = TRUE)) {
  # Using 5-MST
  CF(X1, X2)
  # Using 3-MST
  CF(X1, X2, graph.fun = MST, graph.args = list(K = 3))
}

DataSimilarity documentation built on May 15, 2026, 9:07 a.m.