RItest: Multisample RI Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

RItest

R Documentation

Multisample RI Test

Description

Performs the (modified/ multiscale/ aggregated) RI test (Paul et al., 2021). The implementation is based on the RItest, MTRItest, and ARItest implementations from the HDLSSkST package.

Usage

RItest(X1, X2, ..., n.clust, randomization = TRUE, version = "original", 
        mult.test = "Holm", kmax = 2 * n.clust, s.psi = 1, s.h = 1, 
        lb = 1, n.perm = 1/alpha, alpha = 0.05, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`...`	Optionally more datasets as matrices or data.frames
`n.clust`	Number of clusters (only applicable for `version = "original"`).
`randomization`	Should a randomized test be performed? (default: `TRUE`, ranomized test is performed)
`version`	Which version of the test should be performed? Possible options are `"original"` (default) for the FS test, `"modified"` for the MFS test (number of clusters is estimated), `"multiscale"` for the MSFS test (all numbers of clusters up to `kmax` are tried and results are summarized), `"aggregated-knw"` (all pairwise comparisons are tested with the FS test and results are aggregated), and `"aggregated-est"` (all pairwise comparisons are tested with the MFS test and results are aggregated).
`mult.test`	Multiple testing adjustment for AFS test and MSFS test. Possible options are `"Holm"` (default) and `"BenHoch"`.
`kmax`	Maximum number of clusters to try for estimating the number of clusters (default: `2*n.clust`).
`s.psi`	Numeric code for function required for calculating the distance for `K`-means clustering. The value `1` corresponds to `\psi(t) = t^2` (the default), `2` corresponds to `\psi(t) = 1 - \exp(-t)`, `3` corresponds to `\psi(t) = 1 - \exp(-t^2)`, `4` corresponds to `\psi(t) = \log(1 + t)`, `5` corresponds to `\psi(t) = t`.
`s.h`	Numeric code for function required for calculating the distance for `K`-means clustering. The value `1` corresponds to `h(t) = \sqrt{t}` (the default), and `2` corresponds to `h(t) = t`.
`lb`	Length of smaller vectors into which each observation is partitioned (default: 1).
`n.perm`	Number of simulations of the test statistic (default: 1/alpha, minimum number required for running the test, set to a higher value for meaningful test results).
`alpha`	Test level (default: 0.05).
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The tests are intended for the high dimension low sample size (HDLSS) setting. The idea is to cluster the pooled sample using a clustering algorithm that is suitable for the HDLSS setting and then to compare the clustering to the true dataset membership using the Rand index. For the original RI test, the number of clusters has to be specified. If no number is specified it is set to the number of samples. This is a reasonable number of clusters in many cases.

However, in some cases, different numbers of clusters might be needed. For example in case of multimodal distributions in the datasets, there might be multiple clusters within each dataset. Therefore, the modified (MRI) test allows to estimate the number of clusters from the data.

In case of a really unclear number of clusters, the multiscale (MSRI) test can be applied which calculates the test for each number of clusters up to kmax and then summarizes the test results using some adjustment for multiple testing.

These three tests take into account all samples simultaneously. The aggregated (ARI) test instead performs all pairwise FS or MFS tests on the samples and aggregates those results by taking the minimum test statistic value and applying a multiple testing procedure.

For clustering, a K-means algorithm using the generalized version of the Mean Absolute Difference of Distances (MADD) (Sarkar and Ghosh, 2020) is applied. The MADD is defined as

\rho_{h,\varphi}(z_i, z_j) = \frac{1}{N-2} \sum_{m\in \{1,\dots, N\}\setminus\{i,j\}} \left| \varphi_{h,\psi}(z_i, z_m) - \varphi_{h,\psi}(z_j, z_m)\right|,

where z_i \in\mathbb{R}^p, i = 1,\dots,N, denote points from the pooled sample and

\varphi_{h,\psi}(z_i, z_j) = h\left(\frac{1}{p}\sum_{i=l}^p\psi|z_{il} - z_{jl}|\right),

with h:\mathbb{R}^{+} \to\mathbb{R}^{+} and \psi:\mathbb{R}^{+} \to\mathbb{R}^{+} continuous and strictly increasing functions. The functions h and \psi can be set via changing s.psi and s.h.

In all cases, high values of the test statistic correspond to similarity between the datasets. Therefore, the null hypothesis of equal distributions is rejected for low values.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Asymptotic p value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names
`est.cluster.label`	The estimated cluster label (not for AFS and MSFS)
`observed.cont.table`	The observed contingency table of dataset membership and estimated cluster label (not for AFS)
`crit.value`	The critical value of the test (not for MSFS)
`random.gamma`	The randomization constant of the test (not for MSFS)
`decision`	The (overall) test decision
`decision.per.k`	The test decisions of all individual tests (only for MSFS)
`est.cluster.no`	The estimated number of clusters (not for MSFS)

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	Yes

Note

In case of version = "multiscale" the output is a list object and not of class htest as there are multiple test statistic values and corresponding p values.

Note that the aggregated test cannot handle univariate data.

References

Paul, B., De, S. K. and Ghosh, A. K. (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2021.104897")}

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, 66(336):846-850, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.1971.10482356")}

Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.2517-6161.1995.tb02031.x")}

Sarkar, S. and Ghosh, A. K. (2020). On Perfect Clustering of High Dimension, Low Sample Size Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 2257-2272. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1109/TPAMI.2019.2912599")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
if(requireNamespace("HDLSSkST", quietly = TRUE)) {
  # Perform RI test 
  RItest(X1, X2, n.clust = 2)
  # Perform MRI test
  RItest(X1, X2, version = "modified")
  # Perform MSRI
  RItest(X1, X2, version = "multiscale")
  # Perform ARI test 
  RItest(X1, X2, n.clust = 2, version = "aggregated-knw")
  RItest(X1, X2, version = "aggregated-est")
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.