hhg.test: Heller-Heller-Gorfine Tests of Independence and Equality of...
In HHG: Heller-Heller-Gorfine Tests of Independence and Equality of Distributions

Description Usage Arguments Details Value Note Author(s) References Examples

These functions perform Heller-Heller-Gorfine (HHG) tests. Implemented are tests of independence between two random vectors (x and y) and tests of equality of 2 or more multivariate distributions.

hhg.test(Dx, Dy, ties = T, w.sum = 0, w.max = 2, nr.perm = 10000, 
  is.sequential = F, seq.total.nr.tests = 1, 
  seq.alpha.hyp = NULL, seq.alpha0 = NULL, seq.beta0 = NULL, seq.eps = NULL, 
  nr.threads = 0, tables.wanted = F, perm.stats.wanted = F)
  
hhg.test.k.sample(Dx, y, w.sum = 0, w.max = 2, nr.perm = 10000, 
  is.sequential = F, seq.total.nr.tests = 1, 
  seq.alpha.hyp = NULL, seq.alpha0 = NULL, seq.beta0 = NULL, seq.eps = NULL, 
  nr.threads = 0, tables.wanted = F, perm.stats.wanted = F)

hhg.test.2.sample(Dx, y, w.sum = 0, w.max = 2, nr.perm = 10000, 
  is.sequential = F, seq.total.nr.tests = 1, 
  seq.alpha.hyp = NULL, seq.alpha0 = NULL, seq.beta0 = NULL, seq.eps = NULL, 
  nr.threads = 0, tables.wanted = F, perm.stats.wanted = F)

`Dx`	a symmetric matrix of doubles, where element [i, j] is a norm-based distance between the `i`'th and `j`'th `x` samples.
`Dy`	same as `Dx`, but for distances between `y`'s (the user may choose any norm when computing `Dx`, `Dy`).
`y`	a numeric or factor vector, whose values or levels are in (0, 1, ..., `K` - 1), for performing K-sample tests (including `K` = 2).
`ties`	a boolean specifying whether ties in `Dx` and/or `Dy` exist and are to be properly handled (requires more computation).
`w.sum`	minimum expected frequency taken into account when computing the `sum.chisq` statistic (must be non-negative, contribution of tables having cells with smaller values will be truncated to zero).
`w.max`	minimum expected frequency taken into account when computing the `max.chisq` statistic (must be non-negative, contribution of tables having cells with smaller values will be truncated to zero).
`nr.perm`	number of permutations from which a p-value is to be estimated (must be non-negative). Can be specified as zero if only the observed statistics are wanted, without p-values. The actual number of permutations used may be slightly larger when using multiple processing cores. A Wald sequential probability ratio test is optionally implemented, which may push the p-value to 1 and stop permuting if it becomes clear that it is going to be high. See Details below.
`is.sequential`	boolean flag specifying whether Wald's sequential test is desired (see Details), otherwise a simple Monte-Carlo computation of `nr.perm` permutations is performed. When this argument is `TRUE`, either `seq.total.nr.tests` or (`seq.alpha.hyp`, `seq.alpha0`, `seq.beta0`, `seq.eps`) must be supplied by the user.
`seq.total.nr.tests`	the total number of hypotheses in the family of hypotheses simultaneously tested. When this optional argument is supplied, it is used to derive default values for the parameters of the Wald sequential test. The default derivation is done assuming a nominal `0.05` FDR level, and sets: `seq.alpha.hyp = 0.05 / max(1, log(seq.total.nr.tests))`, `seq.alpha0 = 0.05`, `seq.beta0 = min(0.01, 0.05 / seq.total.nr.tests)`, `seq.eps = 0.01`. Alternatively, one can specify their own values for these parameters using the following arguments.
`seq.alpha.hyp`	the nominal test size for this single test within the multiple testing procedure.
`seq.alpha0`	the nominal test size for testing the side null hypothesis of p-value > `seq.alpha.hyp`.
`seq.beta0`	one minus the power for testing the side null hypothesis of p-value > `seq.alpha.hyp`.
`seq.eps`	approximation margin around `seq.alpha.hyp` that defines the p-value regions for the side null `p > seq.alpha.hyp * (1 + seq.eps)` and side alternative `p < seq.alpha.hyp * (1 - seq.eps)`.
`nr.threads`	number of processing cores to use for p-value permutation. If left as zero, will try to use all available cores.
`tables.wanted`	boolean flag determining whether to output detailed local 2x2 contingency tables.
`perm.stats.wanted`	boolean flag determining whether to output statistics values computed for all permutations (representing null distributions).

The HHG test (Heller et al., 2013) is a powerful nonparametric test for association (or, alternatively, independence) between two random vectors (say, x and y) of arbitrary dimensions. It is consistent against virtually any form of dependence, and has been shown to offer significantly more power than alternative approaches in the face of simple, and, more importantly, complex high-dimensional dependence. The test relies on norm-based distance metrics in x and (separately) in y. The choice of metric for either variable is up to the user (e.g. Euclidean, Mahalanobis, Manhattan, or whatever is appropriate for the data). The general implementation in hhg.test takes the distance matrices computed on an observed sample, Dx and Dy, and starts form there.

hhg.test.k.sample and hhg.test.2.sample are optimized special cases of the general test, where y is a partition of samples in x to K or 2 groups, respectively.

When enabled by is.sequential, Wald's sequential test is implemented as suggested by Fay et al. (2007) in order to reduce the O(nr.perm * n^2 * log(n)) computational compelxity of the permutation test to a more managable size. Especially when faced with high multiplicity, say M simultaneous tests, the necessary number of iterations may be quite large. For example, if it is desired to control the family-wise error rate (FWER) at level alpha using the Bonferroni correction, one needs a p-value of alpha / M to establish significance. This seems to suggest that the minimum number of permutations required is nr.perm = M / alpha. However, if it becomes clear after a smaller number of permutations that the null cannot be rejected, no additional permutations are needed, and the p-value can be conservatively estimated as 1. Often, only a handful of hypotheses in a family are expected to be non-null. In this case the number of permutations for testing all hypotheses using Wald's procedure is expected to be much lower than the full M^2 / alpha.

The target significance level of the sequential test is specified in the argument seq.alpha.hyp. It depends on the number of hypotheses M, and the type of multiplicity correction wanted. For the Bonferroni correction, the threshold is alpha / M. For the less conservative procedure of Benjamini & Hochberg (1995), it is M1 * q / M, where q is the desired false discovery rate (FDR), and M1 is the (unknwon) number of true non-null hypotheses. Although M1 is unknown, the investigator can sometimes estimate it conservatively (e.g., if at most 0.02 of the hypotheses are expected to be non-null, set M1 = 0.02 * M).

Four statistics described in the original HHG paper are returned:

sum.chisq - sum of Pearson chi-squared statistics from the 2x2 contingency tables considered.

sum.lr - sum of liklihood ratio ("G statistic") values from the 2x2 tables.

max.chisq - maximum Pearson chi-squared statistic from any of the 2x2 tables.

max.lr - maximum G statistic from any of the 2x2 tables.

If nr.perm > 0, then estimated permutation p-values are returned as:

perm.pval.hhg.sc, perm.pval.hhg.sl, perm.pval.hhg.mc, perm.pval.hhg.ml

In order to give information that may help localize where in the support of the distributions of x and y there is departure from independence, if tables.wanted is true, the 2x2 tables themselves are provided in:

extras.hhg.tbls

This is a n^2 by 4 matrix, whose columns are A11, A12, A21, A22 as denoted in the original HHG paper. Row r of the matrix corresponds to S_ij in the same paper, where i = 1 + floor((r - 1) / n), and j = 1 + ((r - 1) %% n). Since S_ij is never computed for i == j, rows (0:(n - 1)) * n + (1:n) contain NAs on purpose. The only other case where NAs will occur are for the 2 and K-sample tests, where only one table is given for any x-tied samples (the other tables at indices with the same x value are redundant).

Finally, as a means of estimating the null distributions of computed statistics, if perm.stats.wanted is true, the statistics computed for every permutation of the data performed during testing is outputted as:

extras.perm.stats

A data.frame with one variable per statistic and one sample per permutation.

n - The sample size

y - Group sizes for hhg.test.2.sample and hhg.test.k.sample tests.

stat.type - String holding the type of test used: 'hhg.test', 'hhg.test.2.sample' or 'hhg.test.k.sample'

The computational complexity of the test is n^2*log(n), where n is the number of samples. Thus, when the sample size is large, computing the test for many permutations may take a long time.

P-values are reproducible when setting set.seed(seed_value) before peforming the permutation test (also when computation is done in multithread). This feature is currently implemented only in hhg.test and not in hhg.test.k.sample and hhg.test.2.sample.

Shachar Kaufman, based in part on an earlier version by Ruth Heller and Yair Heller.

Heller, R., Heller, Y., & Gorfine, M. (2013). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2), 503-510.

Fay, M., Kim., H., & Hachey, M. (2007). On using truncated sequential probability ratio test boundaries for Monte Carlo implementation of hypothesis tests. Journal of Computational and Graphical Statistics, 16(4), 946-967.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289-300.

## Not run: 
## 1. The test of independence

## 1.1. A non-null univariate example

## Generate some data from the Circle example
n = 50
X = hhg.example.datagen(n, 'Circle')
plot(X[1,], X[2,])

## Compute distance matrices, on which the HHG test will be based
Dx = as.matrix(dist((X[1,]), diag = TRUE, upper = TRUE))
Dy = as.matrix(dist((X[2,]), diag = TRUE, upper = TRUE))

## Compute HHG statistics, and p-values using 1000 random permutations
set.seed(1) #set the seed for the random permutations
hhg = hhg.test(Dx, Dy, nr.perm = 1000)

## Print the  statistics and their permutation p-value

hhg

## 1.2. A null univariate example

n = 50
X = hhg.example.datagen(n, '4indclouds') 

Dx = as.matrix(dist((X[1,]), diag = TRUE, upper = TRUE))
Dy = as.matrix(dist((X[2,]), diag = TRUE, upper = TRUE))

set.seed(1) #set the seed for the random permutations
hhg = hhg.test(Dx, Dy, nr.perm = 1000)

hhg

## 1.3. A multivariate example
library(MASS)

n = 50
p = 5
x = t(mvrnorm(n, rep(0, p), diag(1, p)))
y = log(x ^ 2)
Dx = as.matrix(dist((t(x)), diag = TRUE, upper = TRUE))
Dy = as.matrix(dist((t(y)), diag = TRUE, upper = TRUE))

set.seed(1) #set the seed for the random permutations
hhg = hhg.test(Dx, Dy, nr.perm = 1000)

hhg

## 2. The k-sample test

n = 50
D = hhg.example.datagen(n, 'FourClassUniv')
Dx = as.matrix(dist(D$x, diag = TRUE, upper = TRUE))

hhg = hhg.test.k.sample(Dx, D$y, nr.perm = 1000)

hhg

## End(Not run)