SH: Schilling-Henze Nearest Neighbor Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/SH.R

SH	R Documentation

Schilling-Henze Nearest Neighbor Test

Description

Performs the Schilling-Henze two-sample test for multivariate data (Schilling, 1986; Henze, 1988).

Usage

SH(X1, X2, K = 1, graph.fun = knn.bf, dist.fun = stats::dist, n.perm = 0, 
    dist.args = NULL, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`K`	Number of nearest neighbors to consider (default: 1)
`graph.fun`	Function for calculating a similarity graph using the distance matrix on the pooled sample (default: `knn.bf` which searches for the `K` nearest neighbors by ranking all pairwise distances, alternative: `knn` which is a wrapper for extracting the edge matrix from the result of `kNN` in dbscan, `knn.fast` which is a wrapper for the approximative KNN implementation `get.knn` in FNN, or any other function that calculates the KNN edge matrix from a distance matrix and the number of nearest neighbors `K`).
`dist.fun`	Function for calculating a distance matrix on the pooled dataset (default: `stats::dist`, Euclidean distance).
`n.perm`	Number of permutations for permutation test (default: 0, asymptotic test is performed).
`dist.args`	Named list of further arguments passed to `dist.fun`.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The test statistic is the proportion of edges connecting points from the same dataset in a K-nearest neighbor graph calculated on the pooled sample (standardized with expectation and SD under the null).

Low values of the test statistic indicate similarity of the datasets. Thus, the null hypothesis of equal distributions is rejected for high values.

For n.perm = 0, an asymptotic test using the asymptotic normal approximation of the conditional null distribution is performed. For n.perm > 0, a permutation test is performed.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Asymptotic or permutation p value
`estimate`	The number of within-sample edges
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	No

Note

The default of K=1 is chosen rather arbitrary based on computational speed as there is no good rule for chossing K proposed in the literature so far. Typical values for K chosen in the literature are 1 and 5.

References

Schilling, M. F. (1986). Multivariate Two-Sample Tests Based on Nearest Neighbors. Journal of the American Statistical Association, 81(395), 799-806. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.2307/2289012")}

Henze, N. (1988). A Multivariate Two-Sample Test Based on the Number of Nearest Neighbor Type Coincidences. The Annals of Statistics, 16(2), 772-783.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform Schilling-Henze test
SH(X1, X2)

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.