FStest | R Documentation |
Performs the (modified/ multiscale/ aggregated) FS test (Paul et al., 2021). The implementation is based on the FStest
, MTFStest
, and AFStest
implementations from the HDLSSkST package.
FStest(X1, X2, ..., n.clust, randomization = TRUE, version = "original",
mult.test = "Holm", kmax = 2 * n.clust, s.psi = 1, s.h = 1,
lb = 1, n.perm = 1/alpha, alpha = 0.05, seed = 42)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
... |
Optionally more datasets as matrices or data.frames |
n.clust |
Number of clusters (only applicable for |
randomization |
Should a randomized test be performed? (default: |
version |
Which version of the test should be performed? Possible options are |
mult.test |
Multiple testing adjustment for AFS test and MSFS test. Possible options are |
kmax |
Maximum number of clusters to try for estimating the number of clusters (default: |
s.psi |
Numeric code for function required for calculating the distance for |
s.h |
Numeric code for function required for calculating the distance for |
lb |
Length of smaller vectors into which each observation is partitioned (default: 1). |
n.perm |
Number of simulations of the test statistic (default: 1/alpha, minimum number required for running the test, set to a higher value for meaningful test results). |
alpha |
Test level (default: 0.05). |
seed |
Random seed (default: 42) |
The tests are intended for the high dimension low sample size (HDLSS) setting. The idea is to cluster the pooled sample using a clustering algorithm that is suitable for the HDLSS setting and then to compare the clustering to the true dataset membership and test for dependence using a generalized Fisher test on the contingency table of clustering and dataset membership. For the original FS test, the number of clusters has to be specified. If no number is specified it is set to the number of samples. This is a reasonable number of clusters in many cases.
However, in some cases, different numbers of clusters might be needed. For example in case of multimodal distributions in the datasets, there might be multiple clusters within each dataset. Therefore, the modified (MFS) test allows to estimate the number of clusters from the data.
In case of a really unclear number of clusters, the multiscale (MSFS) test can be applied which calculates the test for each number of clusters up to kmax
and then summarizes the test results using some adjustment for multiple testing.
These three tests take into account all samples simultaneously. The aggregated (AFS) test instead performs all pairwise FS or MFS tests on the samples and aggregates those results by taking the minimum test statistic value and applying a multiple testing procedure.
For clustering, a K
-means algorithm using the generalized version of the Mean Absolute Difference of Distances (MADD) (Sarkar and Ghosh, 2020) is applied.
The MADD is defined as
\rho_{h,\varphi}(z_i, z_j) = \frac{1}{N-2} \sum_{m\in \{1,\dots, N\}\setminus\{i,j\}} \left| \varphi_{h,\psi}(z_i, z_m) - \varphi_{h,\psi}(z_j, z_m)\right|,
where z_i \in\mathbb{R}^p, i = 1,\dots,N
, denote points from the pooled sample and
\varphi_{h,\psi}(z_i, z_j) = h\left(\frac{1}{p}\sum_{i=l}^p\psi|z_{il} - z_{jl}|\right),
with h:\mathbb{R}^{+} \to\mathbb{R}^{+}
and \psi:\mathbb{R}^{+} \to\mathbb{R}^{+}
continuous and strictly increasing functions.
The functions h
and \psi
can be set via changing s.psi
and s.h
.
In all cases, high values of the test statistic correspond to similarity between the datasets. Therefore, the null hypothesis of equal distributions is rejected for low values.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
p.value |
Asymptotic p value |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
est.cluster.label |
The estimated cluster label (not for AFS and MSFS) |
observed.cont.table |
The observed contingency table of dataset membership and estimated cluster label (not for AFS) |
crit.value |
The critical value of the test (not for MSFS) |
random.gamma |
The randomization constant of the test (not for MSFS) |
decision |
The (overall) test decision |
decision.per.k |
The test decisions of all individual tests (only for MSFS) |
est.cluster.no |
The estimated number of clusters (not for MSFS) |
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | No | Yes |
In case of version = "multiscale"
the output is a list object and not of class htest
as there are multiple test statistic values and corresponding p values.
Note that the aggregated test cannot handle univariate data.
Paul, B., De, S. K. and Ghosh, A. K. (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2021.104897")}
Mehta, C. R. and Patel, N.R. (1983). A network algorithm for performing Fisher's exact test in rxc contingency tables, Journal of the American Statistical Association, 78(382):427-434, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.2307/2288652")}
Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.2517-6161.1995.tb02031.x")}
Sarkar, S. and Ghosh, A. K. (2020). On Perfect Clustering of High Dimension, Low Sample Size Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 2257-2272. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1109/TPAMI.2019.2912599")}
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
RItest
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
if(requireNamespace("HDLSSkST", quietly = TRUE)) {
# Perform FS test
FStest(X1, X2, n.clust = 2)
# Perform MFS test
FStest(X1, X2, version = "modified")
# Perform MSFS
FStest(X1, X2, version = "multiscale")
# Perform AFS test
FStest(X1, X2, n.clust = 2, version = "aggregated-knw")
FStest(X1, X2, version = "aggregated-est")
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.