DiProPerm: Direction-Projection-Permutation (DiProPerm) Test
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

DiProPerm

R Documentation

Direction-Projection-Permutation (DiProPerm) Test

Description

Performs the Direction-Projection-Permutation (DiProPerm) two-sample test for high-dimensional data (Wei et al., 2016).

Usage

DiProPerm(X1, X2, n.perm = 0, dipro.fun = dwdProj, stat.fun = MD, 
            direction = "two.sided", seed = NULL)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`n.perm`	Number of permutations for permutation test (default: 0, no permutation test performed)
`dipro.fun`	Function performing the direction and projection step using a linear classifier. Implemented options are `dwdProj` (default, distance weighted discrimination, DWD), and `svmProj` (support vector machine). Must take the two datasets as input and output the calculated scores for the pooled sample.
`stat.fun`	Function that calculates a univariate two-sample statistic from two vectors. Implemented options are `MD` (default, mean difference, recommended for detecting mean differendes), `tStat` (t test statistic) and `AUC` (area under the receiver operating curve). Must take the two numeric vectors as input and output the two sample statistic as a numeric scalar.
`direction`	Character indicating for which values of the univariate test statistic the test should reject the null hypothesis. Possible options are `"two.sided"` (reject both for low and high values, appropriate for `MD` and `tStat`), `"greater"` (reject for high values, appropriate for `AUC`), or `"smaller"` (reject for low values).
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.

Details

The DiProPerm test works by first combining the datasets into a pooled dataset and creating a target variable with the dataset membership of each observation. A binary linear classifier is then trained on the class labels and the normal vector of the separating hyperplane is calculated. The data from both samples is projected onto this normal vector. This gives a scalar score for each observation. On these projection scores, a univariate two-sample statistic is calculated. The permutation null distribution of this statistic is calculated by permuting the dataset labels and repeating the whole procedure with the permuted labels.

At the moment, distance weighted discrimination (DWD), and support vector machine (SVM) are implemented as binary linear classifiers.

The DWD model implementation genDWD in the DWDLargeR package is used with the penalty parameter C calculated with penaltyParameter using the recommended default values. More details on the algorithm can be found in Lam et al. (2018).

For the SVM, the implementation svm in the e1071 package is used with default parameters.

Other classifiers can be used by supplying a suitable function for dipro.fun.

For the univariate test statistic, implemented options are the mean difference, t statistic and AUC. Other suitable statistics can be used by supplying a suitable function of stat.fun.

Whether high or low values of the test statistic correspond to similarity of the datasets depends on the chosen univariate statistic. This is reflected by the direction argument which modifies the behavior of the test to reject the null for appropriate values.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	Permutation p value
`alternative`	The alternative hypothesis
`method`	Description of the test
`data.name`	The dataset names

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
No	Yes	No	No

References

Lam, X. Y., Marron, J. S., Sun, D., & Toh, K.-C. (2018). Fast Algorithms for Large-Scale Generalized Distance Weighted Discrimination. Journal of Computational and Graphical Statistics, 27(2), 368-379. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/10618600.2017.1366915")}

Wei, S., Lee, C., Wichers, L., & Marron, J. S. (2016). Direction-Projection-Permutation for High-Dimensional Hypothesis Tests. Journal of Computational and Graphical Statistics, 25(2), 549-569. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/10618600.2015.1027773")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
# Perform DiProPerm test 
# Note: For real applications, n.perm should be set considerably higher than 10
# Low values for n.perm chosen for demonstration due to runtime

if(requireNamespace("DWDLargeR", quietly = TRUE)) {
  DiProPerm(X1, X2, n.perm = 10)
  DiProPerm(X1, X2, n.perm = 10, stat.fun = tStat)
  if(requireNamespace("pROC", quietly = TRUE)) {
    DiProPerm(X1, X2, n.perm = 10, stat.fun = AUC, direction = "greater")
  }
}

if(requireNamespace("e1071", quietly = TRUE)) {
  DiProPerm(X1, X2, n.perm = 10, dipro.fun = svmProj)
  DiProPerm(X1, X2, n.perm = 10, dipro.fun = svmProj, stat.fun = tStat)
  if(requireNamespace("pROC", quietly = TRUE)) {
    DiProPerm(X1, X2, n.perm = 10, dipro.fun = svmProj, stat.fun = AUC, direction = "greater")
  }
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.