hhg.univariate.ind.stat: The independence test statistics for all partition sizes

Description Usage Arguments Details Value Author(s) References Examples

View source: R/HHG_univariate.R

Description

These statistics are used in the omnibus distribution-free test of independence between two univariate random variables, as described in Heller et al. (2016).

Usage

1
2
3
hhg.univariate.ind.stat(x, y, variant = 'ADP',aggregation.type='sum',
score.type='LikelihoodRatio', mmax = max(floor(sqrt(length(x))/2),2),
mmin =2, w.sum = 0, w.max = 2,nr.atoms = nr_bins_equipartition(length(x)))

Arguments

x

a numeric vector with observed X values (tied observations are broken at random).

y

a numeric vector with observed Y values (tied observations are broken at random).

variant

a character string specifying the partition type, must be one of "ADP" (default) or "DDP", "ADP-ML", "ADP-EQP", "ADP-EQP-ML".

aggregation.type

a character string specifying the aggregation type, must be one of "sum" (default), "max", or "both".

score.type

a character string specifying the score type, must be one of "LikelihoodRatio" (default), "Pearson", or "both".

mmax

The partition size of the ranked observations. The default size is half the square root of the number of observations

mmin

The partition size of the ranked observations. The default size is half the square root of the number of observations

w.sum

The minimum number of observations in a partition, only relevant for type="Independence", aggregation.type="Sum" and score.type="Pearson", default value 0.

w.max

The minimum number of observations in a partition, only relevant for type="Independence", aggregation.type="Max" and score.type="Pearson", default value 2.

nr.atoms

For "ADP-EQP" and "ADP-EQP-ML" type tests, sets the number of possible split points in the data. The default value is the minimum between n and 60+0.5*√{n}.

Details

For each partition size m= mmin,…,mmax, the function computes the scores in each of the partitions (according to score type), and aggregates all scores according to the aggregation type (see details in Heller et al. , 2014). If the score type is one of "LikelihoodRatio" or "Pearson", and the aggregation type is one of "sum" or "max", then the computed statistic will be in statistic, otherwise the computed statistics will be in the appropriate subset of sum.chisq, sum.lr, max.chisq, and max.lr. Note that if the variant is "ADP", all partition sizes are computed together in O(N^4), so the score computational complexity is O(N^4). For "DDP" and mmax>4,the score computational complexity is O(N^4)*(mmax-mmin+1).

For the 'sum' aggregation type (default), The test statistic is the sum of log likelihood (or Pearson Chi-square) scores, of all partitions of size m X m of the data, normalized by the number of partitions and the data size (thus, being an estimator of the Mutual Information). For the 'max' aggregation type, the test statistic is the maximum log likelihood (or Pearson Chi-square) score acheived by a partition of data of size m, normalized by the data size. For variant type "ADP-ML", the statistics calculated include not only the sum over mXm tables (symmetric tables, same number of cells on each axis), but also assymetric tables (i.e. mXl tables).

Variant types "ADP-EQP" and "ADP-EQP-ML", are the computationally efficient versions of the "ADP" and "ADP-ML". EQP type variants reduce calculation time by summing over a subset of partitions, where a split between cells may be performed only every n/nr.atoms observations. This allows for a complexity of O(nr.atoms^4). These variants are only available for aggregation.type=='sum' type aggregation.

For large data (n>100), it is recommended to used Fast.independence.test, which is an optimized version of the hhg.univariate.ind.stat and hhg.univariate.ind.combined.test tests.

Value

Returns a UnivariateStatistic class object, with the following entries:

statistic

The value of the computed statistic if the score type is one of "LikelihoodRatio" or "Pearson", and the aggregation type is one of "sum" or "max". One of sum.chisq, sum.lr, max.chisq, and max.lr.

sum.chisq

A vector of size mmax-mmin+1, where the m-mmin+1 entry is the average over all Pearson chi-squared statistics from all the m X m contingency tables considered, divided by the total number of observations.

sum.lr

A vector of size mmax-mmin+1, where the m-mmin+1 entry is the average over all LikelihoodRatio statistics from all the m X m contingency tables considered, divided by the total number of observations.

max.chisq

A vector of size mmax-mmin+1, where the m-mmin+1 entry is the maximum over all Pearson chi-squared statistics from all the m X m contingency tables considered.

max.lr

A vector of size mmax-mmin+1, where the m-mmin+1 entry is the maximum over all Pearson chi-squared statistics from all the m X m contingency tables considered.

type

"Independence"

stat.type

"Independence-Stat"

size

The sample size

score.type

The input score.type.

aggregation.type

The input aggregation.type.

mmin

The input mmin.

mmax

The input mmax.

additional

A vector with the input w.sum and w.max.

nr.atoms

The input nr.atoms.

Author(s)

Barak Brill and Shachar Kaufman.

References

Heller, R., Heller, Y., Kaufman S., Brill B, & Gorfine, M. (2016). Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables, JMLR 17(29):1-54

Brill B. (2016) Scalable Non-Parametric Tests of Independence (master's thesis)

http://primage.tau.ac.il/libraries/theses/exeng/free/2899741.pdf

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
## Not run: 
N = 35
data = hhg.example.datagen(N, 'Parabola')
X = data[1,]
Y = data[2,]
plot(X,Y)


#I) Computing test statistics , with default parameters(ADP statistic):

hhg.univariate.ADP.Likelihood.result = hhg.univariate.ind.stat(X,Y)

hhg.univariate.ADP.Likelihood.result

#II) Computing test statistics , with summation over Data Derived Partitions (DDP),
#using Pearson scores, and partition sizes up to 5:

hhg.univariate.DDP.Pearson.result = hhg.univariate.ind.stat(X,Y,variant = 'DDP',
  score.type = 'Pearson', mmax = 5)
hhg.univariate.DDP.Pearson.result

#III) Computing test statistics, for all M X L tables:
hhg.univariate.ADP.ML.Likelihood.result = hhg.univariate.ind.stat(X,Y,
variant='ADP-ML', mmax = 5)

hhg.univariate.ADP.ML.Likelihood.result

#IV) Computing test statistics, using efficient variants (for large data sets):
#Note : for independence testing with n>100, Fast.ADP.test is suggested
#rather than hhg.univariate.ind.stat.

N_Large = 1000
data_Large = hhg.example.datagen(N_Large, 'W')
X_Large = data_Large[1,]
Y_Large = data_Large[2,]
plot(X_Large,Y_Large)

hhg.univariate.ADP.EQP.Likelihood.result = hhg.univariate.ind.stat(X_Large
,Y_Large,variant='ADP-EQP', mmax = 20)

hhg.univariate.ADP.EQP.Likelihood.result

#note how only nr.atoms=76 are used - only 75 possible cell split locations are
#taken into consideration when computing the sum over all possible log likelihood scores.
#this can be changed using the nr.atoms argument:

hhg.univariate.ADP.EQP.Likelihood.result = hhg.univariate.ind.stat(X_Large,Y_Large,
variant='ADP-EQP',mmax = 20, nr.atoms =100)

hhg.univariate.ADP.EQP.Likelihood.result

#V) Computing the efficient sum over all MXL tables:

hhg.univariate.ADP.EQP.ML.Likelihood.result = hhg.univariate.ind.stat(X_Large,Y_Large,
variant='ADP-EQP-ML',mmax = 5)

hhg.univariate.ADP.EQP.ML.Likelihood.result

## End(Not run)

HHG documentation built on Nov. 17, 2017, 7:07 a.m.