SH: Schilling-Henze Nearest Neighbor Test

View source: R/SH.R

SHR Documentation

Schilling-Henze Nearest Neighbor Test

Description

Performs the Schilling-Henze two-sample test for multivariate data (Schilling, 1986; Henze, 1988).

Usage

SH(X1, X2, K = 1, graph.fun = knn.bf, dist.fun = stats::dist, n.perm = 0, 
    dist.args = NULL, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

K

Number of nearest neighbors to consider (default: 1)

graph.fun

Function for calculating a similarity graph using the distance matrix on the pooled sample (default: knn.bf which searches for the K nearest neighbors by ranking all pairwise distances, alternative: knn which is a wrapper for extracting the edge matrix from the result of kNN in dbscan, knn.fast which is a wrapper for the approximative KNN implementation get.knn in FNN, or any other function that calculates the KNN edge matrix from a distance matrix and the number of nearest neighbors K).

dist.fun

Function for calculating a distance matrix on the pooled dataset (default: stats::dist, Euclidean distance).

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

dist.args

Named list of further arguments passed to dist.fun.

seed

Random seed (default: 42)

Details

The test statistic is the proportion of edges connecting points from the same dataset in a K-nearest neighbor graph calculated on the pooled sample (standardized with expectation and SD under the null).

Low values of the test statistic indicate similarity of the datasets. Thus, the null hypothesis of equal distributions is rejected for high values.

For n.perm = 0, an asymptotic test using the asymptotic normal approximation of the conditional null distribution is performed. For n.perm > 0, a permutation test is performed.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic or permutation p value

estimate

The number of within-sample edges

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No No

Note

The default of K=1 is chosen rather arbitrary based on computational speed as there is no good rule for chossing K proposed in the literature so far. Typical values for K chosen in the literature are 1 and 5.

References

Schilling, M. F. (1986). Multivariate Two-Sample Tests Based on Nearest Neighbors. Journal of the American Statistical Association, 81(395), 799-806. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.2307/2289012")}

Henze, N. (1988). A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences. The Annals of Statistics, 16(2), 772-783.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

knn, BQS, FR, CF, CCS, ZC for other graph-based tests, FR_cat, CF_cat, CCS_cat, and ZC_cat for versions of the test for categorical data

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Schilling-Henze test
SH(X1, X2)

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.