RItest | R Documentation |
Performs the (modified/ multiscale/ aggregated) RI test (Paul et al., 2021). The implementation is based on the RItest
, MTRItest
, and ARItest
implementations from the HDLSSkST package.
RItest(X1, X2, ..., n.clust, randomization = TRUE, version = "original",
mult.test = "Holm", kmax = 2 * n.clust, s.psi = 1, s.h = 1,
lb = 1, n.perm = 1/alpha, alpha = 0.05, seed = 42)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
... |
Optionally more datasets as matrices or data.frames |
n.clust |
Number of clusters (only applicable for |
randomization |
Should a randomized test be performed? (default: |
version |
Which version of the test should be performed? Possible options are |
mult.test |
Multiple testing adjustment for AFS test and MSFS test. Possible options are |
kmax |
Maximum number of clusters to try for estimating the number of clusters (default: |
s.psi |
Numeric code for function required for calculating the distance for |
s.h |
Numeric code for function required for calculating the distance for |
lb |
Length of smaller vectors into which each observation is partitioned (default: 1). |
n.perm |
Number of simulations of the test statistic (default: 1/alpha, minimum number required for running the test, set to a higher value for meaningful test results). |
alpha |
Test level (default: 0.05). |
seed |
Random seed (default: 42) |
The tests are intended for the high dimension low sample size (HDLSS) setting. The idea is to cluster the pooled sample using a clustering algorithm that is suitable for the HDLSS setting and then to compare the clustering to the true dataset membership using the Rand index. For the original RI test, the number of clusters has to be specified. If no number is specified it is set to the number of samples. This is a reasonable number of clusters in many cases.
However, in some cases, different numbers of clusters might be needed. For example in case of multimodal distributions in the datasets, there might be multiple clusters within each dataset. Therefore, the modified (MRI) test allows to estimate the number of clusters from the data.
In case of a really unclear number of clusters, the multiscale (MSRI) test can be applied which calculates the test for each number of clusters up to kmax
and then summarizes the test results using some adjustment for multiple testing.
These three tests take into account all samples simultaneously. The aggregated (ARI) test instead performs all pairwise FS or MFS tests on the samples and aggregates those results by taking the minimum test statistic value and applying a multiple testing procedure.
For clustering, a K
-means algorithm using the generalized version of the Mean Absolute Difference of Distances (MADD) (Sarkar and Ghosh, 2020) is applied.
The MADD is defined as
\rho_{h,\varphi}(z_i, z_j) = \frac{1}{N-2} \sum_{m\in \{1,\dots, N\}\setminus\{i,j\}} \left| \varphi_{h,\psi}(z_i, z_m) - \varphi_{h,\psi}(z_j, z_m)\right|,
where z_i \in\mathbb{R}^p, i = 1,\dots,N
, denote points from the pooled sample and
\varphi_{h,\psi}(z_i, z_j) = h\left(\frac{1}{p}\sum_{i=l}^p\psi|z_{il} - z_{jl}|\right),
with h:\mathbb{R}^{+} \to\mathbb{R}^{+}
and \psi:\mathbb{R}^{+} \to\mathbb{R}^{+}
continuous and strictly increasing functions.
The functions h
and \psi
can be set via changing s.psi
and s.h
.
In all cases, high values of the test statistic correspond to similarity between the datasets. Therefore, the null hypothesis of equal distributions is rejected for low values.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
p.value |
Asymptotic p value |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
est.cluster.label |
The estimated cluster label (not for AFS and MSFS) |
observed.cont.table |
The observed contingency table of dataset membership and estimated cluster label (not for AFS) |
crit.value |
The critical value of the test (not for MSFS) |
random.gamma |
The randomization constant of the test (not for MSFS) |
decision |
The (overall) test decision |
decision.per.k |
The test decisions of all individual tests (only for MSFS) |
est.cluster.no |
The estimated number of clusters (not for MSFS) |
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | Yes |
In case of version = "multiscale"
the output is a list object and not of class htest
as there are multiple test statistic values and corresponding p values.
Note that the aggregated test cannot handle univariate data.
Paul, B., De, S. K. and Ghosh, A. K. (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2021.104897")}
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical association, 66(336):846-850, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/01621459.1971.10482356")}
Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.2517-6161.1995.tb02031.x")}
Sarkar, S. and Ghosh, A. K. (2020). On Perfect Clustering of High Dimension, Low Sample Size Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 2257-2272. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1109/TPAMI.2019.2912599")}
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
FStest
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
if(requireNamespace("HDLSSkST", quietly = TRUE)) {
# Perform RI test
RItest(X1, X2, n.clust = 2)
# Perform MRI test
RItest(X1, X2, version = "modified")
# Perform MSRI
RItest(X1, X2, version = "multiscale")
# Perform ARI test
RItest(X1, X2, n.clust = 2, version = "aggregated-knw")
RItest(X1, X2, version = "aggregated-est")
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.