SC: Graph-Based Multi-Sample Test

View source: R/SC.R

SCR Documentation

Graph-Based Multi-Sample Test

Description

Performs the graph-based multi-sample test for high-dimensional data proposed by Song and Chen (2022). The implementation here uses the gtestsmulti implementation from the gTestsMulti package.

Usage

SC(X1, X2, ..., n.perm = 0, dist.fun = stats::dist, graph.fun = MST, 
    dist.args = NULL, graph.args = NULL, type = "S", seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

n.perm

Number of permutations for permutation test (default: 0, no permutation test performed)

dist.fun

Function for calculating a distance matrix on the pooled dataset (default: stats::dist, Euclidean distance).

graph.fun

Function for calculating a similarity graph using the distance matrix on the pooled sample (default: MST, Minimum Spanning Tree).

dist.args

Named list of further arguments passed to dist.fun (default: NULL).

graph.args

Named list of further arguments passed to graph.fun (default: NULL).

type

Character specifying the test statistic to use. Possible options are "S" (default) and "SA". See details.

seed

Random seed (default: 42)

Details

Two multi-sample test statistics are defined by Song and Chen (2022) based on a similarity graph. The first one is defined as

S = S_W + S_B, \text{ where}

S_W = (R_W - \text{E}(R_W))^T \Sigma_W^{-1}(R_W - \text{E}(R_W)),

S_B = (R_B - \text{E}(R_B))^T \Sigma_W^{-1}(R_B - \text{E}(R_B)),

with R_W denoting the vector of within-sample edge counts and R_B the vector of between-sample edge counts. Expectations and covariance matrix are calculated under the null.

The second statistic is defined as

S_A = (R_A - \text{E}(R_A))^T \Sigma_W^{-1}(R_A - \text{E}(R_A)),

where R_A is the vector of all linearly independent edge counts, i.e. the edge counts for all pairs of samples except the last pair k-1 and k.

This implementation is a wrapper function around the function gtestsmulti that modifies the in- and output of that function to match the other functions provided in this package. For more details see the gtestsmulti.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Permutation p value (only if n.perm > 0)

estimate

Estimated KMD value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No Yes

References

Song, H. and Chen, H. (2022). New graph-based multi-sample tests for high-dimensional and non- Euclidean data. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.2205.13787")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

gTestsMulti for performing both tests at once, MST

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Song and Chen test 
if(requireNamespace("gTestsMulti", quietly = TRUE)) {
  SC(X1, X2, n.perm = 100)
  SC(X1, X2, n.perm = 100, type = "SA")
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.