SC: Graph-Based Multi-Sample Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/SC.R

SC	R Documentation

Graph-Based Multi-Sample Test

Description

Performs the graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). The implementation here uses the gtestsmulti implementation from the gTestsMulti package.

Usage

SC(X1, X2, ..., n.perm = 0, dist.fun = stats::dist, graph.fun = MST, 
    dist.args = NULL, graph.args = NULL, type = "S", seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`...`	Optionally more datasets as matrices or data.frames
`n.perm`	Number of permutations for permutation test (default: 0, no permutation test performed)
`dist.fun`	Function for calculating a distance matrix on the pooled dataset (default: `stats::dist`, Euclidean distance).
`graph.fun`	Function for calculating a similarity graph using the distance matrix on the pooled sample (default: `MST`, Minimum Spanning Tree).
`dist.args`	Named list of further arguments passed to `dist.fun` (default: `NULL`).
`graph.args`	Named list of further arguments passed to `graph.fun` (default: `NULL`).
`type`	Character specifying the test statistic to use. Possible options are `"S"` (default) and `"SA"`. See details.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

Two multi-sample test statistics are defined by Song and Chen (2022) based on a similarity graph. The first one is defined as

S = S_W + S_B, \text{ where}

S_W = (R_W - \text{E}(R_W))^T \Sigma_W^{-1}(R_W - \text{E}(R_W)),

S_B = (R_B - \text{E}(R_B))^T \Sigma_W^{-1}(R_B - \text{E}(R_B)),

with R_W denoting the vector of within-sample edge counts and R_B the vector of between-sample edge counts. Expectations and covariance matrix are calculated under the null.

The second statistic is defined as

S_A = (R_A - \text{E}(R_A))^T \Sigma_W^{-1}(R_A - \text{E}(R_A)),

where R_A is the vector of all linearly independent edge counts, i.e. the edge counts for all pairs of samples except the last pair k-1 and k.

This implementation is a wrapper function around the function gtestsmulti that modifies the in- and output of that function to match the other functions provided in this package. For more details see the gtestsmulti.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Permutation p value (only if `n.perm` > 0)
`estimate`	Estimated KMD value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	Yes

References

Song, H. and Chen, H. (2022). New graph-based multi-sample tests for high-dimensional and non- Euclidean data. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2205.13787")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Song and Chen test 
if(requireNamespace("gTestsMulti", quietly = TRUE)) {
  SC(X1, X2, n.perm = 100)
  SC(X1, X2, n.perm = 100, type = "SA")
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.