testHSIC | R Documentation |
testHSIC
allows to test independence among all input-output pairs (Xi, Y)
after a preliminary sensitivity analysis based on HSIC indices. testHSIC
takes an object of class sensiHSIC
(produced by a prior call to the function sensiHSIC
that estimates HSIC indices) and it returns the estimated p-values after testing independence among all input-output pairs. For each input-output pair, having access to the p-value helps the user decide whether the null hypothesis H0
: "Xi
and Y
are independent" must be accepted or rejected. If the kernels selected in sensiHSIC
are all characteristic, H0
can be rewritten "HSIC(Xi, Y)=0
" and this paves the way to several test procedures.
Depending on the sample size and the chosen test statistic (either a U-statistic or a V-statistic), there are up to four different methods to test H0
. The asymptotic test is recommended when the sample size n
is around a few hundreds (or more). When n
is smaller, a permutation-based test must be considered instead. As a general rule, permutation-based tests can always be applied but a much heavier computational load is to be expected. However, if HSIC indices were initially estimated with V-statistics, the Gamma test is a parametric method that offers an enticing tradeoff.
testHSIC(sensi, test.method = "Asymptotic", B = 3000,
seq.options = list(criterion = "screening", alpha = 0.05,
Bstart = 200, Bfinal = 5000, Bbatch = 100,
Bconv = 200, graph = TRUE) )
## S3 method for class 'testHSIC'
print(x, ...)
## S3 method for class 'testHSIC'
plot(x, ylim = c(0, 1), err, ...)
sensi |
An object of class |
test.method |
A string specifying the numerical procedure used to estimate the p-values of the HSIC-based independence tests. Available procedure include
|
B |
Number of random permutations carried out on the output samples before the non-parametric estimation of p-values. Only relevant if |
seq.options |
A list of options guiding the sequential procedure.
Only relevant if
|
x |
An object of class |
ylim |
A vector of two values specifying the y-coordinate plotting limits. |
err |
A scalar value (between |
... |
Additional options. |
For a given input-output pair of variables, the Hilbert-Schmidt independence criterion (HSIC) is a dissimilarity measure between the joint bivariate distribution and the product of marginal distributions. Dissimilarity between those two distributions is measured through the squared norm of the distance between their respective embeddings in a reproducing kernel Hilbert space (RKHS) that directly depends on the selected input kernel Ki
and the selected output kernel KY
.
It must always be kept in mind that this criterion allows to detect independence within the pair (Xi, Y)
provided that the two kernels are characteristic.
If both kernels are characteristic, H0
: "Xi
and Y
are independent" is equivalent to H0
: "HSIC(Xi, Y)=0
" and any estimator of HSIC(Xi, Y)
emerges as a relevant test statistic.
If they are not, testing H0
: "HSIC(Xi, Y)=0
" is no longer sufficient for testing H0
: "Xi
and Y
are independent".
The reader is referred to Fukumizu et al. (2004) for the mathematical definition of a characteristic kernel and to Sriperumbur et al. (2010) for an overview of the major related results.
Responsability for kernel selection is left to the user while calling the function sensiHSIC
. Let us simply recall that:
The Gaussian kernel, the exponential kernel, the Matern 3/2
kernel and the Matern 5/2
kernel (all defined on R^2
) are characteristic. They remain characteristic when they are restricted to a compact domain D
within R^2
.
The transformed versions of the four abovementioned kernels (all defined on [0,1]^2
) are characteristic.
All Sobolev kernels (defined on [0,1]^2
) are characteristic.
The categorical kernel (defined on any discrete probability space) is characteristic.
The test statistic for the pair (Xi, Y)
is either the U-statistic or the V-statistic associated to HSIC(Xi, Y)
.
If a V-statistic was used in sensiHSIC
, four different test methods can be considered.
The asymptotic test can be used if the sample size n
is large enough (at least a hundred of samples). The asymptotic distribution of the test statistic is approximated by a Gamma distribution whose parameters are estimated with the method of moments. See Gretton et al. (2007) for more details about how to estimate the first two moments of the asymptotic Gamma distribution.
The permutation-based test is more expensive in terms of computational cost but it can be used whatever the sample size n
is. The initial output samples (stored in the object of class sensiHSIC
) are randomly permuted B
times and the test statistic is recomputed as many times. This allows to simulate B
observations of the test statistic under H0
and to estimate the p-value in a non-parametric way. See Meynaoui (2019) for more details on how to correctly estimate the p-value in order to preserve the expected level of the test.
The sequential permutation-based test is a goal-oriented variant of the previous test. The main idea is to reduce the computational cost by stopping permutations as soon as the estimation of the p-value has sufficiently converged so that it can be compared to a reference threshold or be given a final ranking. See El Amri and Marrel (2022) for more details on how to implement this sequential approach for the three stopping criteria (namely "ranking"
, "screening"
or "both"
).
The Gamma test is a parametric alternative to permutation-based tests when n
is not large enough to resort to the asymptotic test. The permutation-based test reveals the test statistic under H0
follows a unimodal distribution having significant positive skewness. Thus, it seems quite natural to estimate the p-value with a Gamma distribution, especially in view of the fact that the asymptotic distribution is properly approximated by this parametric family. See El Amri and Marrel (2021) for more details on how to estimate the parameters of the Gamma distribution with the method of moments. In particular, the first two moments of the test statistic under H0
are computed thanks to the formulas that were initially provided in Kazi-Aoual et al. (1995).
If a U-statistic was used in sensiHSIC
, the estimated value of HSIC(Xi,Y)
may be negative.
The asymptotic test can no longer be conducted with a Gamma distribution (whose support is limited to [0,+\infty[
). It is replaced by a Pearson III distribution (which is a left-shifted Gamma distribution).
The permutation-based test and the sequential permutation-based test can be applied directly.
The Gamma test has no longer any theoretical justification.
In Marrel and Chabridon (2021), HSIC indices were adapted to target sensitivity analysis (thus becoming T-HSIC indices) and to conditional sensitivity analysis (thus becoming C-HSIC indices). Tests of independence can still be useful after estimating T-HSIC indices or C-HSIC indices.
For T-HSIC indices, the null hypothesis is H0
: "Xi
and w(Y)
are independent" where w
is the weight function selected in target
and passed to the function sensiHSIC
. Everything works just as for basic HSIC indices (apart from the fact that w
is applied on the original output variable Y
). Available test methods include "Asymptotic"
, "Permutation"
, "Seq_Permutation"
and "Gamma"
(for V-statistics only).
For C-HSIC indices, the null hypothesis is H0
: "Xi
and Y
are independent if the event described in cond
occurs". In this specific context, testing conditional independence is only relevant if the weight function is an indicator function. For this reason, if conditional independence has to be tested, the user must select type="indicTh"
in cond
while calling the function sensiHSIC
. Let us recall that only V-statistic estimators can be used for C-HSIC indices. As a result, available test methods include "Asymptotic"
, "Permutation"
, "Seq_Permutation"
and "Gamma"
.
testHSIC
returns a list of class "testHSIC"
. It contains test.method
, B
(for the permutation-based test), seq.options
(for the sequential permutation-based test) and the following objects:
call |
The matched call. |
pval |
The estimated p-values after testing independence for all input-output pairs. |
prop |
A vector of two strings.
|
family |
Only if |
param |
Only if |
Hperm |
Only if |
paths |
Only if |
Sebastien Da Veiga, Amandine Marrel, Anouar Meynaoui, Reda El Amri and Gabriel Sarazin.
El Amri, M. R. and Marrel, A. (2022), Optimized HSIC-based tests for sensitivity analysis: application to thermalhydraulic simulation of accidental scenario on nuclear reactor, Quality and Reliability Engineering International, 38(3), 1386-1403.
El Amri, M. R. and Marrel, A. (2021), More powerful HSIC-based independence tests, extension to space-filling designs and functional data. https://cea.hal.science/cea-03406956/
Fukumizu, K., Bach, F. R. and Jordan, M. I. (2004), Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces, Journal of Machine Learning Research, 5(Jan), 73-99.
Gretton, A., Fukumizu, K., Teo, C., Song, L., Scholkopf, B. and Smola, A. (2007), A kernel statistical test of independence, Advances in Neural Information Processing Systems, 20.
Kazi-Aoual, F., Hitier, S., Sabatier, R. and Lebreton, J. D. (1995), Refined approximations to permutation tests for multivariate inference, Computational Statistics & Data Analysis, 20(6), 643-656.
Marrel, A. and Chabridon, V. (2021), Statistical developments for target and conditional sensitivity analysis: application on safety studies for nuclear reactor, Reliability Engineering & System Safety, 214, 107711.
Meynaoui, A. (2019), New developments around dependence measures for sensitivity analysis: application to severe accident studies for generation IV reactors (Doctoral dissertation, INSA de Toulouse).
Sriperumbudur, B., Fukumizu, K. and Lanckriet, G. (2010), On the relation between universality, characteristic kernels and RKHS embedding of measures, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (pp. 773-780). JMLR Workshop and Conference Proceedings.
sensiHSIC, weightTSA
# Test case: the Ishigami function.
n <- 20 # very few input-output samples
p <- 3 # nb of input variables
########################################
### PRELIMINARY SENSITIVITY ANALYSIS ###
########################################
X <- matrix(runif(n*p), n, p)
sensi <- sensiHSIC(model=ishigami.fun, X)
print(sensi)
plot(sensi)
title("GSA for the Ishigami function")
#############################
### TESTS OF INDEPENDENCE ###
#############################
test.asymp <- testHSIC(sensi)
test.perm <- testHSIC(sensi, test.method="Permutation")
test.seq.screening <- testHSIC(sensi, test.method="Seq_Permutation")
test.seq.ranking <- testHSIC(sensi, test.method="Seq_Permutation",
seq.options=list(criterion="ranking"))
test.seq.both <- testHSIC(sensi, test.method="Seq_Permutation",
seq.options=list(criterion="both"))
test.gamma <- testHSIC(sensi, test.method="Gamma")
# comparison of p-values
res <- rbind( t(as.matrix(test.asymp$pval)), t(as.matrix(test.perm$pval)),
t(as.matrix(test.seq.screening$pval)), t(as.matrix(test.seq.ranking$pval)),
t(as.matrix(test.seq.both$pval)), t(as.matrix(test.gamma$pval)) )
rownames(res) <- c("asymp", "perm", "seq_perm_screening",
"seq_perm_ranking", "seq_perm_both", "gamma")
res
# Conclusion: n is too small for the asymptotic test.
# Take n=200 and all four test methods will provide very close p-values.
#####################
### VISUALIZATION ###
#####################
# simulated values of HSIC indices under H0 (random permutations)
Hperm <- t(unname(test.perm$Hperm))
for(i in 1:p){
# histogram of the test statistic under H0 (random permutations)
title <- paste0("Histogram of S", i, " = HSIC(X", i, ",Y)")
hist(Hperm[,i], probability=TRUE,
nclass=70, main=title, xlab="", ylab="", col="cyan")
# asymptotic Gamma distribution
shape.asymp <- test.asymp$param[i, "shape"]
scale.asymp <- test.asymp$param[i, "scale"]
xx <- seq(0, max(Hperm[,i]), length.out=200)
dens.asymp <- dgamma(xx, shape=shape.asymp, scale=scale.asymp)
lines(xx, dens.asymp, lwd=2, col="darkorchid")
# finite-sample Gamma distribution
shape.perm <- test.gamma$param[i, "shape"]
scale.perm <- test.gamma$param[i, "scale"]
dens.perm <- dgamma(xx, shape=shape.perm, scale=scale.perm)
lines(xx, dens.perm, lwd=2, col="blue")
all.cap <- c("Asymptotic Gamma distribution", "Finite-sample Gamma distribution")
all.col <- c("darkorchid", "blue")
legend("topright", legend=all.cap, col=all.col, lwd=2, y.intersp=1.3)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.