gTestsMulti: Graph-Based Multi-Sample Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

gTestsMulti

R Documentation

Graph-Based Multi-Sample Test

Description

Performs both proposed graph-based multi-sample test for high-dimensional data by Song and Chen (2022). The implementation here uses the gtestsmulti implementation from the gTestsMulti package. This function is inteded to be used e.g. in comparison studies where both tests need to be calculated at the same time. Since large parts of the calculation coincide, using this function should be faster than computing all four statistics individually.

Usage

gTestsMulti(X1, X2, ..., n.perm = 0, dist.fun = stats::dist, graph.fun = MST, 
              dist.args = NULL, graph.args = NULL, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`...`	Optionally more datasets as matrices or data.frames
`n.perm`	Number of permutations for permutation test (default: 0, no permutation test performed)
`dist.fun`	Function for calculating a distance matrix on the pooled dataset (default: `stats::dist`, Euclidean distance).
`graph.fun`	Function for calculating a similarity graph using the distance matrix on the pooled sample (default: `MST`, Minimum Spanning Tree).
`dist.args`	Named list of further arguments passed to `dist.fun` (default: `NULL`).
`graph.args`	Named list of further arguments passed to `graph.fun` (default: `NULL`).
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

Two multi-sample test statistics are defined by Song and Chen (2022) based on a similarity graph. The first one is defined as

S = S_W + S_B, \text{ where}

S_W = (R_W - \text{E}(R_W))^T \Sigma_W^{-1}R_W - \text{E}(R_W)),

S_B = (R_B - \text{E}(R_B))^T \Sigma_W^{-1}R_B - \text{E}(R_B)),

with R_W denoting the vector of within-sample edge counts and R_B the vector of between-sample edge counts. Expectations and covariance matrix are calculated under the null.

The second statistic is defined as

S_A = (R_A - \text{E}(R_A))^T \Sigma_W^{-1}R_A - \text{E}(R_A)),

where R_A is the vector of all linearly independent edge counts, i.e. the edge counts for all pairs of samples except the last pair k-1 and k.

This implementation is a wrapper function around the function gtestsmulti that modifies the in- and output of that function to match the other functions provided in this package. For more details see the gtestsmulti.

Value

An list with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Boostrap/ permutation p value (only if `n.perm` > 0)
`estimate`	Estimated KMD value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	Yes

References

Song, H. and Chen, H. (2022). New graph-based multi-sample tests for high-dimensional and non- Euclidean data. arXiv:2205.13787, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2205.13787")}

Song, H., Chen, H. (2023). gTestsMulti: New Graph-Based Multi-Sample Tests. R package version 0.1.1, https://CRAN.R-project.org/package=gTestsMulti.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Song and Chen tests
if(requireNamespace("gTestsMulti", quietly = TRUE)) {
  gTestsMulti(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.