RItest: Multisample RI Test

View source: R/RItest.R

RItestR Documentation

Multisample RI Test

Description

Performs the (modified/ multiscale/ aggregated) RI test (Paul et al., 2021). The implementation is based on the RItest, MTRItest, and ARItest implementations from the HDLSSkST package.

Usage

RItest(X1, X2, ..., n.clust, randomization = TRUE, version = "original", 
        mult.test = "Holm", kmax = 2 * n.clust, s.psi = 1, s.h = 1, 
        lb = 1, n.perm = 1/alpha, alpha = 0.05, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

n.clust

Number of clusters (only applicable for version = "original").

randomization

Should a randomized test be performed? (default: TRUE, ranomized test is performed)

version

Which version of the test should be performed? Possible options are "original" (default) for the FS test, "modified" for the MFS test (number of clusters is estimated), "multiscale" for the MSFS test (all numbers of clusters up to kmax are tried and results are summarized), "aggregated-knw" (all pairwise comparisons are tested with the FS test and results are aggregated), and "aggregated-est" (all pairwise comparisons are tested with the MFS test and results are aggregated).

mult.test

Multiple testing adjustment for AFS test and MSFS test. Possible options are "Holm" (default) and "BenHoch".

kmax

Maximum number of clusters to try for estimating the number of clusters (default: 2*n.clust).

s.psi

Numeric code for function required for calculating the distance for K-means clustering. The value 1 corresponds to \psi(t) = t^2 (the default), 2 corresponds to \psi(t) = 1 - \exp(-t), 3 corresponds to \psi(t) = 1 - \exp(-t^2), 4 corresponds to \psi(t) = \log(1 + t), 5 corresponds to \psi(t) = t.

s.h

Numeric code for function required for calculating the distance for K-means clustering. The value 1 corresponds to h(t) = \sqrt{t} (the default), and 2 corresponds to h(t) = t.

lb

Length of smaller vectors into which each observation is partitioned (default: 1).

n.perm

Number of simulations of the test statistic (default: 1/alpha, minimum number required for running the test, set to a higher value for meaningful test results).

alpha

Test level (default: 0.05).

seed

Random seed (default: 42)

Details

The tests are intended for the high dimension low sample size (HDLSS) setting. The idea is to cluster the pooled sample using a clustering algorithm that is suitable for the HDLSS setting and then to compare the clustering to the true dataset membership using the Rand index. For the original RI test, the number of clusters has to be specified. If no number is specified it is set to the number of samples. This is a reasonable number of clusters in many cases.

However, in some cases, different numbers of clusters might be needed. For example in case of multimodal distributions in the datasets, there might be multiple clusters within each dataset. Therefore, the modified (MRI) test allows to estimate the number of clusters from the data.

In case of a really unclear number of clusters, the multiscale (MSRI) test can be applied which calculates the test for each number of clusters up to kmax and then summarizes the test results using some adjustment for multiple testing.

These three tests take into account all samples simultaneously. The aggregated (ARI) test instead performs all pairwise FS or MFS tests on the samples and aggregates those results by taking the minimum test statistic value and applying a multiple testing procedure.

For clustering, a K-means algorithm using the generalized version of the Mean Absolute Difference of Distances (MADD) (Sarkar and Ghosh, 2020) is applied. The MADD is defined as

\rho_{h,\varphi}(z_i, z_j) = \frac{1}{N-2} \sum_{m\in \{1,\dots, N\}\setminus\{i,j\}} \left| \varphi_{h,\psi}(z_i, z_m) - \varphi_{h,\psi}(z_j, z_m)\right|,

where z_i \in\mathbb{R}^p, i = 1,\dots,N, denote points from the pooled sample and

\varphi_{h,\psi}(z_i, z_j) = h\left(\frac{1}{p}\sum_{i=l}^p\psi|z_{il} - z_{jl}|\right),

with h:\mathbb{R}^{+} \to\mathbb{R}^{+} and \psi:\mathbb{R}^{+} \to\mathbb{R}^{+} continuous and strictly increasing functions. The functions h and \psi can be set via changing s.psi and s.h.

In all cases, high values of the test statistic correspond to similarity between the datasets. Therefore, the null hypothesis of equal distributions is rejected for low values.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

est.cluster.label

The estimated cluster label (not for AFS and MSFS)

observed.cont.table

The observed contingency table of dataset membership and estimated cluster label (not for AFS)

crit.value

The critical value of the test (not for MSFS)

random.gamma

The randomization constant of the test (not for MSFS)

decision

The (overall) test decision

decision.per.k

The test decisions of all individual tests (only for MSFS)

est.cluster.no

The estimated number of clusters (not for MSFS)

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No Yes

Note

In case of version = "multiscale" the output is a list object and not of class htest as there are multiple test statistic values and corresponding p values.

Note that the aggregated test cannot handle univariate data.

References

Paul, B., De, S. K. and Ghosh, A. K. (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2021.104897")}

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, 66(336):846-850, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.1971.10482356")}

Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.2517-6161.1995.tb02031.x")}

Sarkar, S. and Ghosh, A. K. (2020). On Perfect Clustering of High Dimension, Low Sample Size Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 2257-2272. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1109/TPAMI.2019.2912599")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

FStest

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
if(requireNamespace("HDLSSkST", quietly = TRUE)) {
  # Perform RI test 
  RItest(X1, X2, n.clust = 2)
  # Perform MRI test
  RItest(X1, X2, version = "modified")
  # Perform MSRI
  RItest(X1, X2, version = "multiscale")
  # Perform ARI test 
  RItest(X1, X2, n.clust = 2, version = "aggregated-knw")
  RItest(X1, X2, version = "aggregated-est")
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.