CF_cat: Generalized Edge-Count Test for Discrete Data

View source: R/CF.R

CF_catR Documentation

Generalized Edge-Count Test for Discrete Data

Description

Performs the generalized edge-count two-sample test for multivariate data proposed by Chen and Friedman (2017). The implementation here uses the g.tests implementation from the gTests package.

Usage

CF_cat(X1, X2, dist.fun, agg.type, graph.type = "mstree", K = 1, n.perm = 0, 
        seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

dist.fun

Function for calculating a distance matrix on the pooled dataset.

agg.type

Character giving the method for aggregating over possible similarity graphs. Options are "u" for union of possible similarity graphs and "a" for averaging over test statistics calculated on possible similarity graphs.

graph.type

Character specifying which similarity graph to use. Possible options are "mstree" (default, Minimum Spanning Tree) and "nnlink" (Nearest Neighbor Graph).

K

Parameter for graph (default: 1). If graph.type = "mstree", a K-MST is constructed (K=1 is the classical MST). If graph.type = "nnlink", K gives the number of neighbors considered in the K-NN graph.

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

seed

Random seed (default: 42)

Details

The test is an enhancement of the Friedman-Rafsky test (original edge-count test) that aims at detecting both location and scale alternatives. The test statistic is given as

S = (R_1 - \mu_1, R_2 - \mu_2)\Sigma^{-1} \binom{R_1 - \mu_1}{R_2 - \mu_2}, \text{ where}

R_1 and R_2 denote the number of edges in the similarity graph connecting points within the first and second sample X_1 and X_2, respectively, \mu_1 = \text{E}_{H_0}(R_1), \mu_2 = \text{E}_{H_0}(R_2) and \Sigma is the covariance matrix of R_1 and R_2 under the null.

For discrete data, the similarity graph used in the test is not necessarily unique. This can be solved by either taking a union of all optimal similarity graphs or averaging the test statistics over all optimal similarity graphs. For details, see Zhang and Chen (2022).

High values of the test statistic indicate dissimilarity of the datasets as the number of edges connecting points within the same sample is high meaning that points are more similar within the datasets than between the datasets.

For n.perm = 0, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function g.tests that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

parameter

Degrees of freedom for \chi^2 distribution under H_0 (only for asymptotic test)

p.value

Asymptotic or permutation p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Applicability

Target variable? Numeric? Categorical? K-sample?
No No Yes No

References

Chen, H. and Friedman, J.H. (2017). A New Graph-Based Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 112(517), 397-409. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2016.1147356")}

Zhang, J. and Chen, H. (2022). Graph-Based Two-Sample Tests for Data with Repeated Observations. Statistica Sinica 32, 391-415, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.5705/ss.202019.0116")}.

Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

FR_cat for the original edge-count test, CCS_cat for the weighted edge-count test, ZC_cat for the maxtype edge-count test, gTests_cat for performing all these edge-count tests at once, CCS, FR, CF, ZC, and gTests for versions of the tests for continuous data, and SH for performing the Schilling-Henze nearest neighbor test

Examples

# Draw some data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
# Perform generalized edge-count test
if(requireNamespace("gTests", quietly = TRUE)) {
  CF_cat(X1cat, X2cat, dist.fun = function(x, y) sum(x != y), agg.type = "a")
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.