gTests_cat: Graph-Based Tests for Discrete Data

View source: R/gTests.R

gTests_catR Documentation

Graph-Based Tests for Discrete Data

Description

Performs the edge-count two-sample tests for multivariate categorical data implementated in g.tests from the gTests package. This function is inteded to be used e.g. in comparison studies where all four graph-based tests need to be calculated at the same time. Since large parts of the calculation coincide, using this function should be faster than computing all four statistics individually.

Usage

gTests_cat(X1, X2, dist.fun = function(x, y) sum(x != y), graph.type = "mstree", 
            K = 1, n.perm = 0, maxtype.kappa = 1.14, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

dist.fun

Function for calculating a distance matrix on the pooled dataset (default: Number of unequal components).

graph.type

Character specifying which similarity graph to use. Possible options are "mstree" (default, Minimum Spanning Tree) and "nnlink" (Nearest Neighbor Graph).

K

Parameter for graph (default: 1). If graph.type = "mstree", a K-MST is constructed (K=1 is the classical MST). If graph.type = "nnlink", K gives the number of neighbors considered in the K-NN graph.

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

maxtype.kappa

Parameter \kappa of the maxtype test (default: 1.14). See ZC.

seed

Random seed (default: 42)

Details

The original, weighted, generalized and maxtype edge-count test are performed.

For discrete data, the similarity graph used in the test is not necessarily unique. This can be solved by either taking a union ("u") of all optimal similarity graphs or averaging ("a") the test statistics over all optimal similarity graphs. For details, see Zhang and Chen (2022). Both options are performed here.

For n.perm = 0, an asymptotic test using the asymptotic normal approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function g.tests that modifies the in- and output of that function to match the other functions provided in this package. For more details see the g.tests.

Value

A list with the following components:

statistic

Observed values of the test statistics

p.value

Asymptotic or permutation p values

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Applicability

Target variable? Numeric? Categorical? K-sample?
No No Yes No

References

Friedman, J. H., and Rafsky, L. C. (1979). Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests. The Annals of Statistics, 7(4), 697-717.

Chen, H. and Friedman, J.H. (2017). A New Graph-Based Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 112(517), 397-409. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2016.1147356")}

Chen, H., Chen, X. and Su, Y. (2018). A Weighted Edge-Count Two-Sample Test for Multivariate and Object Data. Journal of the American Statistical Association, 113(523), 1146-1155, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2017.1307757")}

Zhang, J. and Chen, H. (2022). Graph-Based Two-Sample Tests for Data with Repeated Observations. Statistica Sinica 32, 391-415, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.5705/ss.202019.0116")}.

Chen, H., and Zhang, J. (2017). gTests: Graph-Based Two-Sample Tests. R package version 0.2, https://CRAN.R-project.org/package=gTests.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

FR_cat for the original edge-count test, CF_cat for the generalized edge-count test, CCS_cat for the weighted edge-count test, and ZC_cat for the maxtype edge-count test, gTests, FR, CF, CCS, and ZC for versions of the test for continuous data

Examples

# Draw some data
X1cat <- matrix(sample(1:4, 300, replace = TRUE), ncol = 3)
X2cat <- matrix(sample(1:4, 300, replace = TRUE, prob = 1:4), ncol = 3)
# Perform edge-count tests
if(requireNamespace("gTests", quietly = TRUE)) {
  gTests_cat(X1cat, X2cat)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.