FStest: Multisample FS Test

View source: R/FStest.R

FStestR Documentation

Multisample FS Test

Description

Performs the (modified/ multiscale/ aggregated) FS test (Paul et al., 2021). The implementation is based on the FStest, MTFStest, and AFStest implementations from the HDLSSkST package.

Usage

FStest(X1, X2, ..., n.clust, randomization = TRUE, version = "original", 
        mult.test = "Holm", kmax = 2 * n.clust, s.psi = 1, s.h = 1, 
        lb = 1, n.perm = 1/alpha, alpha = 0.05, seed = 42)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

...

Optionally more datasets as matrices or data.frames

n.clust

Number of clusters (only applicable for version = "original").

randomization

Should a randomized test be performed? (default: TRUE, ranomized test is performed)

version

Which version of the test should be performed? Possible options are "original" (default) for the FS test, "modified" for the MFS test (number of clusters is estimated), "multiscale" for the MSFS test (all numbers of clusters up to kmax are tried and results are summarized), "aggregated-knw" (all pairwise comparisons are tested with the FS test and results are aggregated), and "aggregated-est" (all pairwise comparisons are tested with the MFS test and results are aggregated).

mult.test

Multiple testing adjustment for AFS test and MSFS test. Possible options are "Holm" (default) and "BenHoch".

kmax

Maximum number of clusters to try for estimating the number of clusters (default: 2*n.clust).

s.psi

Numeric code for function required for calculating the distance for K-means clustering. The value 1 corresponds to \psi(t) = t^2 (the default), 2 corresponds to \psi(t) = 1 - \exp(-t), 3 corresponds to \psi(t) = 1 - \exp(-t^2), 4 corresponds to \psi(t) = \log(1 + t), 5 corresponds to \psi(t) = t.

s.h

Numeric code for function required for calculating the distance for K-means clustering. The value 1 corresponds to h(t) = \sqrt{t} (the default), and 2 corresponds to h(t) = t.

lb

Length of smaller vectors into which each observation is partitioned (default: 1).

n.perm

Number of simulations of the test statistic (default: 1/alpha, minimum number required for running the test, set to a higher value for meaningful test results).

alpha

Test level (default: 0.05).

seed

Random seed (default: 42)

Details

The tests are intended for the high dimension low sample size (HDLSS) setting. The idea is to cluster the pooled sample using a clustering algorithm that is suitable for the HDLSS setting and then to compare the clustering to the true dataset membership and test for dependence using a generalized Fisher test on the contingency table of clustering and dataset membership. For the original FS test, the number of clusters has to be specified. If no number is specified it is set to the number of samples. This is a reasonable number of clusters in many cases.

However, in some cases, different numbers of clusters might be needed. For example in case of multimodal distributions in the datasets, there might be multiple clusters within each dataset. Therefore, the modified (MFS) test allows to estimate the number of clusters from the data.

In case of a really unclear number of clusters, the multiscale (MSFS) test can be applied which calculates the test for each number of clusters up to kmax and then summarizes the test results using some adjustment for multiple testing.

These three tests take into account all samples simultaneously. The aggregated (AFS) test instead performs all pairwise FS or MFS tests on the samples and aggregates those results by taking the minimum test statistic value and applying a multiple testing procedure.

For clustering, a K-means algorithm using the generalized version of the Mean Absolute Difference of Distances (MADD) (Sarkar and Ghosh, 2020) is applied. The MADD is defined as

\rho_{h,\varphi}(z_i, z_j) = \frac{1}{N-2} \sum_{m\in \{1,\dots, N\}\setminus\{i,j\}} \left| \varphi_{h,\psi}(z_i, z_m) - \varphi_{h,\psi}(z_j, z_m)\right|,

where z_i \in\mathbb{R}^p, i = 1,\dots,N, denote points from the pooled sample and

\varphi_{h,\psi}(z_i, z_j) = h\left(\frac{1}{p}\sum_{i=l}^p\psi|z_{il} - z_{jl}|\right),

with h:\mathbb{R}^{+} \to\mathbb{R}^{+} and \psi:\mathbb{R}^{+} \to\mathbb{R}^{+} continuous and strictly increasing functions. The functions h and \psi can be set via changing s.psi and s.h.

In all cases, high values of the test statistic correspond to similarity between the datasets. Therefore, the null hypothesis of equal distributions is rejected for low values.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

Asymptotic p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

est.cluster.label

The estimated cluster label (not for AFS and MSFS)

observed.cont.table

The observed contingency table of dataset membership and estimated cluster label (not for AFS)

crit.value

The critical value of the test (not for MSFS)

random.gamma

The randomization constant of the test (not for MSFS)

decision

The (overall) test decision

decision.per.k

The test decisions of all individual tests (only for MSFS)

est.cluster.no

The estimated number of clusters (not for MSFS)

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No Yes

Note

In case of version = "multiscale" the output is a list object and not of class htest as there are multiple test statistic values and corresponding p values.

Note that the aggregated test cannot handle univariate data.

References

Paul, B., De, S. K. and Ghosh, A. K. (2021). Some clustering based exact distribution-free k-sample tests applicable to high dimension, low sample size data, Journal of Multivariate Analysis, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/j.jmva.2021.104897")}

Mehta, C. R. and Patel, N.R. (1983). A network algorithm for performing Fisher's exact test in rxc contingency tables, Journal of the American Statistical Association, 78(382):427-434, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.2307/2288652")}

Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian journal of statistics, 65-70

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological) 57.1: 289-300, \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.2517-6161.1995.tb02031.x")}

Sarkar, S. and Ghosh, A. K. (2020). On Perfect Clustering of High Dimension, Low Sample Size Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 2257-2272. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1109/TPAMI.2019.2912599")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

RItest

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
if(requireNamespace("HDLSSkST", quietly = TRUE)) {
  # Perform FS test 
  FStest(X1, X2, n.clust = 2)
  # Perform MFS test
  FStest(X1, X2, version = "modified")
  # Perform MSFS
  FStest(X1, X2, version = "multiscale")
  # Perform AFS test 
  FStest(X1, X2, n.clust = 2, version = "aggregated-knw")
  FStest(X1, X2, version = "aggregated-est")
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.