HMN | R Documentation |
Performs the random forest based two-sample test proposed by Hediger et al. (2022). The implementation here uses the hypoRF
implementation from the hypoRF package.
HMN(X1, X2, n.perm = 0, statistic = "PerClassOOB", normal.approx = FALSE,
seed = 42, ...)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
n.perm |
Number of permutations for permutation test (default: 0, binomial test is performed). |
statistic |
Character specifying the test statistic. Possible options are |
normal.approx |
Should a normal approximation be used in the permutation test procedure? (default: |
seed |
Random seed (default: 42) |
... |
Arguments passed to |
For the test, a random forest is fitted to the pooled dataset where the target variable is the original dataset membership. The test statistic is either the overall out-of-bag classification accuracy or the sum or mean of the per-class out-of-bag errors for the permutation test. For the asymptotic test (n.perm = 0
), the pooled dataset is split into a training and test set and the test statistic is either the overall classification error on the test set or the mean of the per-class classification errors on the test set. In the former case, a binomial test is performed, in the latter case, a Wald test is performed. If the underlying distributions coincide, classification errors close to chance level are expected. The test rejects for small classification errors.
Note that the per class OOB statistic differs for the permutation test and approximate test: for the permutation test, the sum of the per class OOB errors is returned, for the asymptotic version, the standardized sum is returned.
This implementation is a wrapper function around the function hypoRF
that modifies the in- and output of that function to match the other functions provided in this package. For more details see hypoRF
.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
parameter |
Paremeter(s) of the null distribution |
p.value |
Asymptotic p value |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
val |
The OOB statistic values for the permuted data (for |
varest |
The estimated variance of the OOB statistic values for the permuted data (for |
importance_ranking |
Variable importance (for |
cutoff |
The quantile of the importance distribution at level |
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | Yes | No |
Hediger, S., Michel, L., Näf, J. (2022). On the use of random forest for two-sample testing. Computational Statistics & Data Analysis, 170, 107435, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.csda.2022.107435")}.
Simon, H., Michel, L., Näf, J. (2021). hypoRF: Random Forest Two-Sample Tests. R package version 1.0.0,https://CRAN.R-project.org/package=hypoRF.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
ranger
, C2ST
, YMRZL
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform random forest based test (low number of permutations due to runtime,
# should be chosen considerably higher in practice)
if(requireNamespace("hypoRF", quietly = TRUE)) {
HMN(X1, X2, n.perm = 10)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.