HMN: Random Forest Based Two-Sample Test

View source: R/HMN.R

HMNR Documentation

Random Forest Based Two-Sample Test

Description

Performs the random forest based two-sample test proposed by Hediger et al. (2022). The implementation here uses the hypoRF implementation from the hypoRF package.

Usage

HMN(X1, X2, n.perm = 0, statistic = "PerClassOOB", normal.approx = FALSE, 
    seed = 42, ...)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

n.perm

Number of permutations for permutation test (default: 0, binomial test is performed).

statistic

Character specifying the test statistic. Possible options are "PerClassOOB" (default) corresponding to the sum of out-of-bag (OOB) per class errors, and "OverallOOB" corresponding to the overall OOB error.

normal.approx

Should a normal approximation be used in the permutation test procedure? (default: FALSE)

seed

Random seed (default: 42)

...

Arguments passed to ranger

Details

For the test, a random forest is fitted to the pooled dataset where the target variable is the original dataset membership. The test statistic is either the overall out-of-bag classification accuracy or the sum or mean of the per-class out-of-bag errors for the permutation test. For the asymptotic test (n.perm = 0), the pooled dataset is split into a training and test set and the test statistic is either the overall classification error on the test set or the mean of the per-class classification errors on the test set. In the former case, a binomial test is performed, in the latter case, a Wald test is performed. If the underlying distributions coincide, classification errors close to chance level are expected. The test rejects for small classification errors.

Note that the per class OOB statistic differs for the permutation test and approximate test: for the permutation test, the sum of the per class OOB errors is returned, for the asymptotic version, the standardized sum is returned.

This implementation is a wrapper function around the function hypoRF that modifies the in- and output of that function to match the other functions provided in this package. For more details see hypoRF.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

parameter

Paremeter(s) of the null distribution

p.value

Asymptotic p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

val

The OOB statistic values for the permuted data (for n.perm > 0)

varest

The estimated variance of the OOB statistic values for the permuted data (for n.perm > 0)

importance_ranking

Variable importance (for importance = "impurity")

cutoff

The quantile of the importance distribution at level \alpha

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes Yes No

References

Hediger, S., Michel, L., Näf, J. (2022). On the use of random forest for two-sample testing. Computational Statistics & Data Analysis, 170, 107435, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.csda.2022.107435")}.

Simon, H., Michel, L., Näf, J. (2021). hypoRF: Random Forest Two-Sample Tests. R package version 1.0.0,https://CRAN.R-project.org/package=hypoRF.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

ranger, C2ST, YMRZL

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform random forest based test (low number of permutations due to runtime, 
# should be chosen considerably higher in practice) 
if(requireNamespace("hypoRF", quietly = TRUE)) {
  HMN(X1, X2, n.perm = 10)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.