Petrie: Multisample Crossmatch (MCM) Test

View source: R/Petrie.R

PetrieR Documentation

Multisample Crossmatch (MCM) Test

Description

Performs the multisample crossmatch (MCM) test (Petrie, 2016).

Usage

Petrie(X1, X2, ..., dist.fun = stats::dist, dist.args = NULL, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

dist.fun

Function for calculating a distance matrix on the pooled dataset (default: stats::dist, Euclidean distance).

dist.args

Named list of further arguments passed to dist.fun (default: NULL).

seed

Random seed (default: 42)

Details

The test is an extension of the Rosenbaum (2005) crossmatch test to multiple samples that uses the crossmatch count of all pairs of samples.

The observed cross-counts are calculated using the functions distancematrix and nonbimatch from the nbpMatching package.

High values of the multisample crossmatch statistic indicate similarity between the datasets. Thus, the test rejects the null hypothesis of equal distributions for low values of the test statistic.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic p value

estimate

Observed multisample edge-count

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

stderr

Standard deviation under the null

mu0

Expectation under the null

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes Yes Yes

Note

In case of ties in the distance matrix, the optimal non-bipartite matching might not be defined uniquely. Here, the observations are matched in the order in which the samples are supplied. When searching for a match, the implementation starts at the end of the pooled sample. Therefore, with many ties (e.g. for categorical data), observations from the first dataset are often matched with ones from the last dataset and so on. This might affect the validity of the test negatively.

References

Mukherjee, S., Agarwal, D., Zhang, N. R. and Bhattacharya, B. B. (2022). Distribution-Free Multisample Tests Based on Optimal Matchings With Applications to Single Cell Genomics, Journal of the American Statistical Association, 117(538), 627-638, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.2020.1791131")}

Rosenbaum, P. R. (2005). An Exact Distribution-Free Test Comparing Two Multivariate Distributions Based on Adjacency. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(4), 515-530.

Petrie, A. (2016). Graph-theoretic multisample tests of equality in distribution for high dimensional data. Computational Statistics & Data Analysis, 96, 145-158, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.csda.2015.11.003")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

MMCM, Rosenbaum

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform MCM test 
if(requireNamespace("nbpMatching", quietly = TRUE)) {
   Petrie(X1, X2)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.