RISE: Rank In Similarity Graph Edge-count two-sample test (RISE)

View source: R/RISE.R

RISER Documentation

Rank In Similarity Graph Edge-count two-sample test (RISE)

Description

Performs the Rank In Similarity Graph Edge-count two-sample test (RISE) for multivariate data (Zhou and Chen, 2023). The implementation here uses the RISE implementation from the GraphRankTest package.

Usage

RISE(X1, X2, sim.fun = function(x, ...) -as.matrix(stats::dist(x, ...)), K = 10, 
     rank.type = "RgNN", n.perm = 0, dist.args = NULL, seed = NULL)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

sim.fun

Function for calculating a similarity matrix on the pooled dataset (default: negative value of stats::dist, Euclidean distance).

K

Parameter K of the chosen graph (default: 10). Must be a positive number.

rank.type

Character specifying the similarity graph (K-nearest neighbor graph, K-NN; K-minimum spanning tree, K-MST; K-minimum-distance non-bipartite matching, K-MDP) and the rank definition (graph-induced ranks, Rg; overall ranks, Ro; see Details). Possible options are all combinations of the aforementioned options: "RgNN" (default), "RoNN", "RgMST", "RoMST", "RgMDP", "RoMDP".

n.perm

Number of permutations for permutation test (default: 0, asymptotic test is performed).

dist.args

Named list of further arguments passed to dist.fun (default: NULL).

seed

Random seed (default: NULL). A random seed will only be set if one is provided.

Details

Zhou and Chen, 2023 define the following two graph-based rank matrices R = (R_{ij})_{i,j=1}^N using sequences of similarity graphs G_l based on the distance matrix S of the two datasets. The graph-induced ranks are defined as

R_{ij} = \sum_{l=1}^K \boldsymbol{1}\left(\left(i,j\right)\in G_l\right).

They can be interpreted as the number of graphs that contain the edge (i,j) in the sequence of graphs. The overall ranks are defined as

R_{ij} = \text{rank}\left(S\left(Z_i, Z_j\right), G_K\right),

where \text{rank}\left(S\left(Z_i, Z_j\right), G_K\right) denotes the rank of S\left(Z_i, Z_j\right) among the values \{S\left(Z_u, Z_v\right)\}_{(u,v)\in G_K} if (i,j)\in G_k and zero otherwise. The overall rank can be interpreted as the rank of the similarity of edges in the graph G_K. Both rank definitions depend on the choice of the parameter K that defines the length of the graph sequence. For the test, the symmetrized rank matrix 1/2(R+R^T) is used, which is also denoted by $R$ for convenience.

For the test statistic, the within-sample rank sums of the first and second sample are defined as

U_x = \sum_{i,j=1}^{n_1} R_{ij}, U_y = \sum_{i,j=n_1 + 1}^{N} R_{ij}.

Using these, the rank in similarity graph edge-count two-sample test (RISE) statistic is defined as

T_R = (U_{X1} - \mu_{X1}, U_{X2} - \mu_{X2})\Sigma^{-1}(U_{X1} - \mu_{X1}, U_{X2} - \mu_{X2})^T,

where \mu_{X1} = \mathbb{E}(U_{X1}), \mu_{X2} = \mathbb{E}(U_{X2}), and \Sigma = \mathbb{C}\text{ov}((U_{X1}, U_{X2})^T) can be calculated explicitly under the permutation null hypothesis.

For small samples, the exact permutation null distribution can be used for testing. For large samples and under several assumptions on the similarity graphs, the asymptotic \chi^2_2-distribution of T_R can be used for testing.

High values of the test statistic indicate dissimilarity of the datasets. Thus, the null hypothesis of equal distributions is rejected for large values.

For n.perm = 0, an asymptotic test using the asymptotic \chi^2_2 approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function RISE that modifies the in- and output of that function to match the other functions provided in this package. For more details see the RISE.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

parameter

Degrees of freedom for asymptotic test

p.value

Asymptotic or permutation p value

alternative

The alternative hypothesis

method

Description of the test

data.name

The dataset names

Applicability

Target variable? Numeric? Categorical? K-sample?
No Yes No No

Note

Because this method cannot handle missing data, any missing values are removed automatically and a warning is issued.

References

Zhou, D. and Chen, H. (2023). A new ranking scheme for modern data and its application to two-sample hypothesis testing. In Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023), PMLR, pp. 3615–3668.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform RISE
if(requireNamespace("GraphRankTest", quietly = TRUE)) {
  # Using 10-NNG and graph-induced ranks
  RISE(X1, X2)
  # Using 10-NNG and overall ranks
  RISE(X1, X2, rank.type = "RoNN")
  # Using 5-MST and graph-induced ranks
  RISE(X1, X2, K = 5, rank.type = "RgMST")
}

DataSimilarity documentation built on May 15, 2026, 9:07 a.m.