MW: Nonparametric Graph-Based LP (GLP) Test

View source: R/MW.R

MWR Documentation

Nonparametric Graph-Based LP (GLP) Test

Description

Performs the nonparametric graph-based LP (GLP) multisample test proposed by Mokhopadhyay and Wang (2020). The implementation here uses the GLP implementation from the LPKsample package.

Usage

MW(X1, X2, ..., sum.all = FALSE, m.max = 4, components = NULL, alpha = 0.05, 
    c.poly = 0.5, clust.alg = "kmeans", n.perm = 0, combine.criterion = "kernel", 
    multiple.comparison = TRUE, compress.algorithm = FALSE, nbasis = 8, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

sum.all

Should all components be summed up for calculating the test statistic? (default: FALSE, only significant components are summed up)

m.max

Maximum order of LP components to investigate (default: 4)

components

Vector specifying which components to test. If components is not NULL (default), only the specified components are examined and m.max is ignored.

alpha

Significance level \alpha (default: 0.05)

c.poly

Parameter for polynomial kernel (default: 0.5)

clust.alg

Character specifying the cluster algorithm used in graph community detection. possible options are "kmeans" (default) and "mclust".

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

combine.criterion

Character specifying how to obtain the overall test result based on the component-wise results. Possible options are "kernel" meaning that an overall kernel W is computed based on the significant components and the LP graph test is run on W, and "pvalue" which uses Fisher's method to combine the p values from each component.

multiple.comparison

Should an adjustment for multiple comparisons be used when determining which components are significant? (default: TRUE)

compress.algorithm

Should smooth compression of Laplacian spectra be used for testing? (default: FALSE). It is recommended to set this to TRUE for large sample sizes.

nbasis

Number of bases used for approximation when compress.algorithm = TRUE (default: 8)

seed

Random seed (default: 42)

Details

The GLP statistic is based on learning an LP graph kernel using a pre-specified number of LP components and performing clustering on the eigenvectors of the Laplacian matrix for this learned kernel. The cluster assignment is tested for association with the true dataset memberships for each component of the LP graph kernel. The results are combined by either constructing a super-kernel using specific components and performing the cluster and test step again or by using the combination of the significant components after adjustment for multiple testing.

Small values of the GLP statistic indicate dataset similarity. Therefore, the test rejects for large values.

Value

An object of class htest with the following components:

statistic

Observed value of the GLP test statistic

p.value

Asymptotic or permutation overall p value

null.value

Needed for pretty printing of results

alternative

Needed for pretty printing of results

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No Yes

Note

When sum.all = FALSE and no components are significant, the test statistic value is always set to zero.

Note that the implementation cannot handle univariate data.

References

Mukhopadhyay, S. and Wang, K. (2020). A nonparametric approach to high-dimensional k-sample comparison problems, Biometrika, 107(3), 555-572, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1093/biomet/asaa015")}

Mukhopadhyay, S. and Wang, K. (2019). Towards a unified statistical theory of spectralgraph analysis, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.1901.07090")}

Mukhopadhyay, S., Wang, K. (2020). LPKsample: LP Nonparametric High Dimensional K-Sample Comparison. R package version 2.1, https://CRAN.R-project.org/package=LPKsample

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform GLP test 
if(requireNamespace("LPKsample", quietly = TRUE)) {
  MW(X1, X2, n.perm = 100)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.