Description Usage Arguments Details Value Author(s) References Examples
View source: R/HHG_univariate.R
These statistics are used in the omnibus distribution-free test of independence between two univariate random variables, as described in Heller et al. (2016).
1 2 3 | hhg.univariate.ind.stat(x, y, variant = 'ADP',aggregation.type='sum',
score.type='LikelihoodRatio', mmax = max(floor(sqrt(length(x))/2),2),
mmin =2, w.sum = 0, w.max = 2,nr.atoms = nr_bins_equipartition(length(x)))
|
x |
a numeric vector with observed |
y |
a numeric vector with observed |
variant |
a character string specifying the partition type, must be one of |
aggregation.type |
a character string specifying the aggregation type, must be one of |
score.type |
a character string specifying the score type, must be one of |
mmax |
The partition size of the ranked observations. The default size is half the square root of the number of observations |
mmin |
The partition size of the ranked observations. The default size is half the square root of the number of observations |
w.sum |
The minimum number of observations in a partition, only relevant for |
w.max |
The minimum number of observations in a partition, only relevant for |
nr.atoms |
For |
For each partition size m= mmin,…,mmax, the function computes the scores in each of the partitions (according to score type), and aggregates all scores according to the aggregation type (see details in Heller et al. , 2014). If the score type is one of "LikelihoodRatio"
or "Pearson"
, and the aggregation type is one of "sum"
or "max"
, then the computed statistic will be in statistic
, otherwise the computed statistics will be in the appropriate subset of sum.chisq
, sum.lr
, max.chisq
, and max.lr
. Note that if the variant is "ADP"
, all partition sizes are computed together in O(N^4), so the score computational complexity is O(N^4). For "DDP"
and mmax>4,the score computational complexity is O(N^4)*(mmax-mmin+1).
For the 'sum' aggregation type (default), The test statistic is the sum of log likelihood (or Pearson Chi-square) scores, of all partitions of size m X m of the data, normalized by the number of partitions and the data size (thus, being an estimator of the Mutual Information). For the 'max' aggregation type, the test statistic is the maximum log likelihood (or Pearson Chi-square) score acheived by a partition of data of size m
, normalized by the data size. For variant type "ADP-ML"
, the statistics calculated include not only the sum over mXm tables (symmetric tables, same number of cells on each axis), but also assymetric tables (i.e. mXl tables).
Variant types "ADP-EQP"
and "ADP-EQP-ML"
, are the computationally efficient versions of the "ADP"
and "ADP-ML"
. EQP type variants reduce calculation time by summing over a subset of partitions, where a split between cells may be performed only every n/nr.atoms observations. This allows for a complexity of O(nr.atoms^4). These variants are only available for aggregation.type=='sum'
type aggregation.
For large data (n>100), it is recommended to used Fast.independence.test
, which is an optimized version of the hhg.univariate.ind.stat
and hhg.univariate.ind.combined.test
tests.
Returns a UnivariateStatistic
class object, with the following entries:
statistic |
The value of the computed statistic if the score type is one of |
sum.chisq |
A vector of size mmax-mmin+1, where the m-mmin+1 entry is the average over all Pearson chi-squared statistics from all the m X m contingency tables considered, divided by the total number of observations. |
sum.lr |
A vector of size mmax-mmin+1, where the m-mmin+1 entry is the average over all LikelihoodRatio statistics from all the m X m contingency tables considered, divided by the total number of observations. |
max.chisq |
A vector of size mmax-mmin+1, where the m-mmin+1 entry is the maximum over all Pearson chi-squared statistics from all the m X m contingency tables considered. |
max.lr |
A vector of size mmax-mmin+1, where the m-mmin+1 entry is the maximum over all Pearson chi-squared statistics from all the m X m contingency tables considered. |
type |
"Independence" |
stat.type |
"Independence-Stat" |
size |
The sample size |
score.type |
The input |
aggregation.type |
The input |
mmin |
The input |
mmax |
The input |
additional |
A vector with the input |
nr.atoms |
The input |
Barak Brill and Shachar Kaufman.
Heller, R., Heller, Y., Kaufman S., Brill B, & Gorfine, M. (2016). Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables, JMLR 17(29):1-54 https://www.jmlr.org/papers/volume17/14-441/14-441.pdf
Brill B. (2016) Scalable Non-Parametric Tests of Independence (master's thesis) http://primage.tau.ac.il/libraries/theses/exeng/free/2899741.pdf
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | ## Not run:
N = 35
data = hhg.example.datagen(N, 'Parabola')
X = data[1,]
Y = data[2,]
plot(X,Y)
#I) Computing test statistics , with default parameters(ADP statistic):
hhg.univariate.ADP.Likelihood.result = hhg.univariate.ind.stat(X,Y)
hhg.univariate.ADP.Likelihood.result
#II) Computing test statistics , with summation over Data Derived Partitions (DDP),
#using Pearson scores, and partition sizes up to 5:
hhg.univariate.DDP.Pearson.result = hhg.univariate.ind.stat(X,Y,variant = 'DDP',
score.type = 'Pearson', mmax = 5)
hhg.univariate.DDP.Pearson.result
#III) Computing test statistics, for all M X L tables:
hhg.univariate.ADP.ML.Likelihood.result = hhg.univariate.ind.stat(X,Y,
variant='ADP-ML', mmax = 5)
hhg.univariate.ADP.ML.Likelihood.result
#IV) Computing test statistics, using efficient variants (for large data sets):
#Note : for independence testing with n>100, Fast.ADP.test is suggested
#rather than hhg.univariate.ind.stat.
N_Large = 1000
data_Large = hhg.example.datagen(N_Large, 'W')
X_Large = data_Large[1,]
Y_Large = data_Large[2,]
plot(X_Large,Y_Large)
hhg.univariate.ADP.EQP.Likelihood.result = hhg.univariate.ind.stat(X_Large
,Y_Large,variant='ADP-EQP', mmax = 20)
hhg.univariate.ADP.EQP.Likelihood.result
#note how only nr.atoms=76 are used - only 75 possible cell split locations are
#taken into consideration when computing the sum over all possible log likelihood scores.
#this can be changed using the nr.atoms argument:
hhg.univariate.ADP.EQP.Likelihood.result = hhg.univariate.ind.stat(X_Large,Y_Large,
variant='ADP-EQP',mmax = 20, nr.atoms =100)
hhg.univariate.ADP.EQP.Likelihood.result
#V) Computing the efficient sum over all MXL tables:
hhg.univariate.ADP.EQP.ML.Likelihood.result = hhg.univariate.ind.stat(X_Large,Y_Large,
variant='ADP-EQP-ML',mmax = 5)
hhg.univariate.ADP.EQP.ML.Likelihood.result
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.