# eqdist.etest: Multisample E-statistic (Energy) Test of Equal Distributions In mariarizzo/energy: E-Statistics: Multivariate Inference via the Energy of Data

## Description

Performs the nonparametric multisample E-statistic (energy) test for equality of multivariate distributions.

## Usage

 1 2 3 4 5 6 eqdist.etest(x, sizes, distance = FALSE, method=c("original","discoB","discoF"), R) eqdist.e(x, sizes, distance = FALSE, method=c("original","discoB","discoF")) ksample.e(x, sizes, distance = FALSE, method=c("original","discoB","discoF"), ix = 1:sum(sizes)) 

## Arguments

 x data matrix of pooled sample sizes vector of sample sizes distance logical: if TRUE, first argument is a distance matrix method use original (default) or distance components (discoB, discoF) R number of bootstrap replicates ix a permutation of the row indices of x

## Details

The k-sample multivariate E-test of equal distributions is performed. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or the corresponding distance matrix. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc.

The test is implemented by nonparametric bootstrap, an approximate permutation test with R replicates.

The function eqdist.e returns the test statistic only; it simply passes the arguments through to eqdist.etest with R = 0.

The k-sample multivariate E-statistic for testing equal distributions is returned. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or from the distance matrix x of the original data. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc.

The two-sample E-statistic proposed by Szekely and Rizzo (2004) is the e-distance e(S_i,S_j), defined for two samples S_i, S_j of size n_i, n_j by

e(S_i, S_j) = (n_i n_j)(n_i+n_j)[2M_(ij)-M_(ii)-M_(jj)],

where

M_{ij} = 1/(n_i n_j) sum[1:n_i, 1:n_j] ||X_(ip) - X_(jq)||,

|| || denotes Euclidean norm, and X_(ip) denotes the p-th observation in the i-th sample.

The original (default method) k-sample E-statistic is defined by summing the pairwise e-distances over all k(k-1)/2 pairs of samples:

\emph{E} = sum[i<j] e(S_i,S_j).

Large values of \emph{E} are significant.

The discoB method computes the between-sample disco statistic. For a one-way analysis, it is related to the original statistic as follows. In the above equation, the weights n_i n_j/(n_i+n_j) are replaced with

(n_i + n_j)/(2N) n_i n_j/(n_i+n_j) = n_i n_j/(2N)

where N is the total number of observations: N=n_1+...+n_k.

The discoF method is based on the disco F ratio, while the discoB method is based on the between sample component.

Also see disco and disco.between functions.

## Value

A list with class htest containing

 method description of test statistic observed value of the test statistic p.value approximate p-value of the test data.name description of data

eqdist.e returns test statistic only.

## Note

The pairwise e-distances between samples can be conveniently computed by the edist function, which returns a dist object.

## Author(s)

Maria L. Rizzo mrizzo @ bgsu.edu and Gabor J. Szekely

## References

Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).

M. L. Rizzo and G. J. Szekely (2010). DISCO Analysis: A Nonparametric Extension of Analysis of Variance, Annals of Applied Statistics, Vol. 4, No. 2, 1034-1055.
http://dx.doi.org/10.1214/09-AOAS245

Szekely, G. J. (2000) Technical Report 03-05: E-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.

ksample.e, edist, disco, disco.between, energy.hclust.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  data(iris) ## test if the 3 varieties of iris data (d=4) have equal distributions eqdist.etest(iris[,1:4], c(50,50,50), R = 199) ## example that uses method="disco" x <- matrix(rnorm(100), nrow=20) y <- matrix(rnorm(100), nrow=20) X <- rbind(x, y) d <- dist(X) # should match edist default statistic set.seed(1234) eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199) # comparison with edist edist(d, sizes=c(20, 10), distance=TRUE) # for comparison g <- as.factor(rep(1:2, c(20, 20))) set.seed(1234) disco(d, factors=g, distance=TRUE, R=199) # should match statistic in edist method="discoB", above set.seed(1234) disco.between(d, factors=g, distance=TRUE, R=199)