RISE: Rank In Similarity Graph Edge-count two-sample test (RISE)
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/RISE.R

RISE	R Documentation

Rank In Similarity Graph Edge-count two-sample test (RISE)

Description

Performs the Rank In Similarity Graph Edge-count two-sample test (RISE) for multivariate data (Zhou and Chen, 2023). The implementation here uses the RISE implementation from the GraphRankTest package.

Usage

RISE(X1, X2, sim.fun = function(x, ...) -as.matrix(stats::dist(x, ...)), K = 10, 
     rank.type = "RgNN", n.perm = 0, dist.args = NULL, seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`sim.fun`	Function for calculating a similarity matrix on the pooled dataset (default: negative value of `stats::dist`, Euclidean distance).
`K`	Parameter `K` of the chosen graph (default: 10). Must be a positive number.
`rank.type`	Character specifying the similarity graph (`K`-nearest neighbor graph, `K`-NN; `K`-minimum spanning tree, `K`-MST; `K`-minimum-distance non-bipartite matching, `K`-MDP) and the rank definition (graph-induced ranks, Rg; overall ranks, Ro; see Details). Possible options are all combinations of the aforementioned options: `"RgNN"` (default), `"RoNN"`, `"RgMST"`, `"RoMST"`, `"RgMDP"`, `"RoMDP"`.
`n.perm`	Number of permutations for permutation test (default: 0, asymptotic test is performed).
`dist.args`	Named list of further arguments passed to `dist.fun` (default: `NULL`).
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

Zhou and Chen, 2023 define the following two graph-based rank matrices R = (R_{ij})_{i,j=1}^N using sequences of similarity graphs G_l based on the distance matrix S of the two datasets. The graph-induced ranks are defined as

R_{ij} = \sum_{l=1}^K \boldsymbol{1}\left(\left(i,j\right)\in G_l\right).

They can be interpreted as the number of graphs that contain the edge (i,j) in the sequence of graphs. The overall ranks are defined as

R_{ij} = \text{rank}\left(S\left(Z_i, Z_j\right), G_K\right),

where \text{rank}\left(S\left(Z_i, Z_j\right), G_K\right) denotes the rank of S\left(Z_i, Z_j\right) among the values \{S\left(Z_u, Z_v\right)\}_{(u,v)\in G_K} if (i,j)\in G_k and zero otherwise. The overall rank can be interpreted as the rank of the similarity of edges in the graph G_K. Both rank definitions depend on the choice of the parameter K that defines the length of the graph sequence. For the test, the symmetrized rank matrix 1/2(R+R^T) is used, which is also denoted by $R$ for convenience.

For the test statistic, the within-sample rank sums of the first and second sample are defined as

U_x = \sum_{i,j=1}^{n_1} R_{ij}, U_y = \sum_{i,j=n_1 + 1}^{N} R_{ij}.

Using these, the rank in similarity graph edge-count two-sample test (RISE) statistic is defined as

T_R = (U_{X1} - \mu_{X1}, U_{X2} - \mu_{X2})\Sigma^{-1}(U_{X1} - \mu_{X1}, U_{X2} - \mu_{X2})^T,

where \mu_{X1} = \mathbb{E}(U_{X1}), \mu_{X2} = \mathbb{E}(U_{X2}), and \Sigma = \mathbb{C}\text{ov}((U_{X1}, U_{X2})^T) can be calculated explicitly under the permutation null hypothesis.

For small samples, the exact permutation null distribution can be used for testing. For large samples and under several assumptions on the similarity graphs, the asymptotic \chi^2_2-distribution of T_R can be used for testing.

High values of the test statistic indicate dissimilarity of the datasets. Thus, the null hypothesis of equal distributions is rejected for large values.

For n.perm = 0, an asymptotic test using the asymptotic \chi^2_2 approximation of the null distribution is performed. For n.perm > 0, a permutation test is performed.

This implementation is a wrapper function around the function RISE that modifies the in- and output of that function to match the other functions provided in this package. For more details see the RISE.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`parameter`	Degrees of freedom for asymptotic test
`p.value`	Asymptotic or permutation p value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	No

Note

Because this method cannot handle missing data, any missing values are removed automatically and a warning is issued.

References

Zhou, D. and Chen, H. (2023). A new ranking scheme for modern data and its application to two-sample hypothesis testing. In Proceedings of the 36th Annual Conference on Learning Theory (COLT 2023), PMLR, pp. 3615–3668.

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform RISE
if(requireNamespace("GraphRankTest", quietly = TRUE)) {
  # Using 10-NNG and graph-induced ranks
  RISE(X1, X2)
  # Using 10-NNG and overall ranks
  RISE(X1, X2, rank.type = "RoNN")
  # Using 5-MST and graph-induced ranks
  RISE(X1, X2, K = 5, rank.type = "RgMST")
}

DataSimilarity documentation built on May 15, 2026, 9:07 a.m.