sensiHSIC: Sensitivity Indices based on the Hilbert-Schmidt Independence...

View source: R/sensiHSIC.R

sensiHSICR Documentation

Sensitivity Indices based on the Hilbert-Schmidt Independence Criterion (HSIC)

Description

sensiHSIC allows to conduct global sensitivity analysis (GSA) in many different contexts thanks to several sensitivity measures based on the Hilbert-Schmidt independence criterion (HSIC). The so-called HSIC sensitivity indices depend on the kernels which are affected to the input variables Xi as well as on the kernel which is affected to the output object Y. For each random entity, a reproducing kernel Hilbert space (RKHS) is associated to the chosen kernel and allows to represent probability distributions in an appropriate function space. The influence of Xi on Y is then measured through the distance between the joint probability distribution (true impact of Xi on Y in the numerical model) and the product of marginal distributions (no impact of Xi on Y) after embedding those distributions into a bivariate RKHS. Such a GSA approach has three main advantages:

  • The input variables Xi may be correlated.

  • Any kind of mathematical object is supported (provided that a kernel function is available).

  • Accurate estimation is possible even in presence of very few data (no more than a hundred of input-output samples).

In sensiHSIC, each input variable Xi is expected to be scalar (either discrete or continous). On the contrary, a much wider collection of mathematical objects are supported for the output variable Y. In particular, Y may be:

  • A scalar output (either discrete or continous). If so, one single kernel family is selected among the kernel collection.

  • A low-dimensional vector output. If so, a kernel is selected for each output variable and the final output kernel is built by tensorization.

  • A high-dimensional vector output or a functional output. In this case, the output data must be seen as time series observations. Three different methods are proposed.

    1. A preliminary dimension reduction may be performed. In order to achieve this, a principal component analysis (PCA) based on the empirical covariance matrix helps identify the first terms of the Kharunen-Loeve expansion. The final output kernel is then built in the reduced subspace where the functional data are projected.

    2. The dynamic time warping (DTW) algorithm may be combined with a translation-invariant kernel. The resulting DTW-based output kernel is well-adapted to measure similarity between two given time series.

    3. The global alignment kernel (GAK) may be directly applied on the functional data. Unlike the DTW kernel, it was specifically designed to deal with time series and to measure the impact of one input variable on the shape of the output curve.

Many variants of the original HSIC indices are now available in sensiHSIC.

  • Normalized HSIC indices (R2-HSIC)
    The original HSIC indices defined in Gretton et al. (2005) may be hard to interpret because they do not admit a universal upper bound. A first step to overcome this difficulty was enabled by Da Veiga (2015) with the definition of the R2-HSIC indices. The resulting sensitivity indices can no longer be greater than 1.

  • Target HSIC indices (T-HSIC)
    They were thought by Marrel and Chabridon (2021) to meet the needs of target sensitivity analysis (TSA). The idea is to measure the impact of each input variable Xi on a specific part of the output distribution (for example the upper tail). To achieve this, a weight function w is applied on Y before computing HSIC indices.

  • Conditional HSIC indices (C-HSIC)
    They were thought by Marrel and Chabridon (2021) to meet the needs of conditional sensitivity analysis (CSA). The idea is to measure the impact of each input variable Xi on Y when a specific event occurs. This conditioning event is detected on the output variable Y and its amplitude is taken into account thanks to a weight function w.

  • HSIC-ANOVA indices
    To improve the interpretability of HSIC indices, Da Veiga (2021) came up with an ANOVA-like decomposition that allows to establish a strict separation of main effects and interaction effects in the HSIC paradigm. The first-order and total-order HSIC-ANOVA indices are then defined just as the first-order and total-order Sobol' indices. However, this framework only holds if two major assumptions are verified: the input variables Xi must be mutually independent and all input kernels must belong to the very restrained class of ANOVA kernels.

As most sensitivity measures, HSIC indices allow to rank the input variables Xi according to the influence they have on the output variable Y. They can also be used for a screening purpose, that is to distinguish between truly influential input variables and non-influential input variables. The user who is interested in this topic is invited to consult the documentation of the function testHSIC.

Usage

sensiHSIC(model = NULL, X, target = NULL, cond = NULL, 
          kernelX = "rbf", paramX = NA,
          kernelY = "rbf", paramY = NA,
          estimator.type = "V-stat",
          nboot = 0, conf = 0.95,
          anova = list(obj = "no", is.uniform = TRUE),
          sensi = NULL, 
          save.GM = list(KX = TRUE, KY = TRUE), ...)
          
## S3 method for class 'sensiHSIC'
tell(x, y = NULL, ...)

## S3 method for class 'sensiHSIC'
print(x, ...)

## S3 method for class 'sensiHSIC'
plot(x, ylim = c(0, 1), ...)

Arguments

model

A function, or a statistical model with a predict method. It defines the input-output model that needs to be studied.

X

A n-by-p matrix containing all input samples. It comprises n joint observations of the p input variables.

  • If the user is only wanting to estimate HSIC indices or R2-HSIC indices, the input variables can be correlated.

  • If the user is also wanting to estimate HSIC-ANOVA indices, the input variables have to be mutually independent.

target

A list of options to perform TSA. At least, target must contain an option named "c". For other options, there exist default assignments.

  • type is a string specifying the weight function. Available choices include "indicTh", "zeroTh", "logistic" and "exp1side". Default value is "exp1side".

    • "indicTh" and "zeroTh" only depend on a threshold parameter.

    • "logistic" and "exp1side" depend on both a threshold parameter and a smoothness parameter.

  • c is a scalar value specifying the threshold parameter.

  • upper is a boolean indicating whether the target region is located above (TRUE) or below (FALSE) the threshold parameter c. Only relevant when type is "indicTh", "zeroTh" or "exp1side". Default value is TRUE.

  • param is a scalar value specifying the smoothness parameter. Only relevant when type is "logistic" or "exp1side". Default value is 1.

cond

A list of options to perform CSA. At least, cond must contain an option named "c". For other options, there exist default assignments.

  • type is a string specifying the weight function. Available choices include "indicTh", "zeroTh", "logistic" and "exp1side". Default value is "exp1side".

    • "indicTh" and "zeroTh" only depend on a threshold parameter.

    • "logistic" and "exp1side" depend on both a threshold parameter and a smoothness parameter.

  • c is a scalar value specifying the threshold parameter.

  • upper is a boolean indicating whether the conditioning region is located above (TRUE) or below (FALSE) the threshold parameter c. Only relevant when type is "indicTh", "zeroTh" or "exp1side". Default value is TRUE.

  • param is a scalar value specifying the smoothness parameter. Only relevant if type is "logistic" or "exp1side". Default value is 1.

kernelX

A string or a vector of p strings that specifies how to choose input kernels.

  • If only one string is provided, the associated kernel is affected to all inputs.

  • For dimension-wise kernel selection, a vector of p strings must be provided.

For each input variable, available choices include "categ" (categorical kernel), "dcov" (covariance kernel of the fractional Brownian motion), "invmultiquad" (inverse multiquadratic kernel), "laplace" (exponential kernel), "linear" (dot-product kernel), "matern3" (Matern 3/2 kernel), "matern5" (Matern 5/2 kernel), "raquad" (rationale quadratic kernel), "rbf" (Gaussian kernel), "sobolev1" (Sobolev kernel with smoothness parameter r=1) and "sobolev2" (Sobolev kernel with smoothness parameter r=2).

In addition, let us assume that all input variables are uniformly distributed on [0,1]. Under this assumption, the kernels "laplace", "matern3", "matern5" and "rbf" can be easily transformed into ANOVA kernels. The resulting kernels are respectively called "laplace_anova", "matern3_anova", "matern5_anova" and "rbf_anova".

  • One-parameter kernels: "categ", "dcov", "invmultiquad", "laplace", "laplace_anova", "matern3", "matern3_anova", "matern5", "matern5_anova", "raquad", "rbf" and "rbf_anova".

  • Parameter-free kernels: "linear", "sobolev1" and "sobolev2".

paramX

A scalar value or a vector of p values with input kernel parameters.

  • If paramX=NA, input kernel parameters are computed automatically with rules of thumb.

  • If paramX is a scalar value, it is affected to all input kernels.

  • For dimension-wise kernel parametrization, a vector of p values must be provided. If kernelX combines one-parameter kernels and parameter-free kernels, NA must be specified for parameter-free kernels.

kernelY

A string, a vector of q strings or a list of options that specifies how to construct the output kernel. Regardless of its mathematical nature, the model output must be envisioned as a q-dimensional random vector.

To deal with a scalar output or a low-dimensional vector output, it is advised to select one kernel per output dimension and to tensorize all selected kernels. In this case, kernelY must be a string or a vector of q strings.

  • If only one string is provided, the associated kernel is repeated q times.

  • For dimension-wise kernel selection, a vector of q strings must be provided.

Have a look at kernelX for an exhaustive list of available kernels.

To deal with a high-dimensional vector output or a functional output, it is advised to reduce dimension or to use a dedicated kernel. In this case, kernelY must be specified as a list of options. At least, kernelY must contain an option named "method". For other options, there exist default assignments.

  • method is a string indicating the strategy used to construct the output kernel. Available choices include "PCA" (dimension reduction through principal component analysis), "DTW" (dynamic type warping) and "GAK" (global alignment kernel).

  1. If method="PCA", the following options may also be specified:

    • data.centering is a boolean indicating whether the input samples must be centered before performing the preliminary PCA. Default value is TRUE.

    • data.scaling is a boolean indicating whether the input samples must be scaled before performing the preliminary PCA. Default value is TRUE.

    • fam is a string specifying the input kernel which is applied on principal components. Available choices only include "dcov", "invmultiquad", "laplace", "linear", "matern3", "matern5", "raquad" and "rbf". Default value is "rbf".

    • expl.var is a scalar value (between 0 and 1) specifying the expected percentage of output variance that must be explained by PCA. Default value is 0.95.

    • PC is the expected number of principal components in PCA. Default value is NA.

    • combi is a string indicating how the final output kernel is built in the reduced subspace. Available options include "sum" or "prod". The chosen kernel in fam is applied on all principal components before summation (if "sum") or tensorization (if "prod").

    • position is a string indicating whether weights have to be involved in the construction of the final output kernel in the reduced subspace. Available choices include "nowhere" (no weights), "intern" (weights applied on principal components) or "extern" (weights applied on kernels). Default value is "intern".

    Remark: expl.var and PC are conflicting options. Only one of both needs to be specified and the other one must be set to NA. If both are specified, expl.var is prioritized. If both are set to NA, expl.var is then set to its default value.

  2. If method="DTW", the following option may also be specified:

    • fam is a string specifying the translation-invariant kernel which is combined with DTW. Available choices only include "invmultiquad", "laplace", "matern3", "matern5", "raquad" and "rbf". Default value is "rbf".

  3. If method="GAK", there is no other option to specify.

paramY

A scalar value or a vector of values with output kernel parameters.

  • If paramY=NA, output kernel parameters are computed automatically with rules of thumb.

In other cases, paramY must be specified in agreement with kernelY.

Case 1: kernelY is a string or a vector of q strings.

paramY must be a scalar value or a vector of q values with output kernel parameters.

  • If paramY is a scalar value, it is affected to all output kernels.

  • For dimension-wise kernel parametrization, a vector of q values must be provided. If kernelY combines one parameter kernels and parameter-free kernels, NA must be specified for parameter-free kernels.

Case 2: kernelY is a list of options with method="PCA".

paramY must be set to NA because the parameters involved in the final output kernel are computed automatically. Their optimal tuning depends on the reduced subspace given by PCA.

Case 3: kernelY is a list of options with method="DTW".

paramY must be set to NA.

Case 4: kernelY is a list of options with method="GAK".

paramY must be a vector of 2 values. If the user only wants to specify one parameter, the other one must be set to NA. The two parameters correspond to the arguments sigma and window.size in the function gak from the package dtwclust. However, automatical computation (specified by paramY=NA) is strongly advised for this kind of output kernel.

estimator.type

A string specifying the kind of estimator used for HSIC indices. Available choices include "U-stat" (U-stastics) and "V-stat" (V-statistics). U-statistics are unbiased estimators. V-statistics are biased estimators but they become unbiased asymptotically. In the specific case of HSIC indices, V-statistics are non-negative estimators and they offer more flexibility for further test procedures (see testHSIC). Both kinds of estimators can be computed with complexity O(n^2) where n denotes the sample size.

nboot

Number of bootstrap replicates.

conf

A scalar value (between 0 and 1) specifying the level of confidence intervals.

anova

A list of parameters to achieve an ANOVA-like decomposition based on HSIC indices. At least, anova must contain an option named "obj". For other options, there exist default assignments.

  • obj is a string specifying which kinds of HSIC-ANOVA indices are expected. Available choices include "no" (anova is disabled), "FO" (first-order only), "TO" (total-order only) and "both" (first-order and total-order).

  • is.uniform is a boolean indicating whether the samples stored in X come from random variables that are uniformly distributed on [0,1]. Let us recall that HSIC-ANOVA indices can only be computed by means of ANOVA kernels. Among available kernels, only "laplace_anova", "matern3_anova", "matern5_anova", "rbf_anova", "sobolev1" and "sobole2" verify this constraint (provided that all input variables are uniformly distributed on [0,1]).

    • If is.uniform=TRUE, it is checked that each input data stored in X actually lies in [0,1]. If this condition is not verified, an error is returned.

    • If is.uniform=FALSE, non-parametric rescaling (based on empirical distribution functions) is operated.

sensi

An object of class "sensiHSIC" resulting from a prior call to the hereby function. If an object of class "sensiHSIC" is indeed provided, the following happens:

  • If sensi contains an object named "KX", it is extracted from sensi and the input Gram matrices (required to estimate HSIC indices) are not built from X, kernelX and paramX.

  • If sensi contains an object named "KY", it is extracted from sensi and the output Gram matrix (required to estimate HSIC indices) is not built from model, kernelY and paramY.

save.GM

A list of parameters indicating whether Gram matrices have to be saved. The list save.GM must contain options named "KX" and "KY".

  • KX is a boolean indicating whether the input Gram matrices have to be saved.

  • KY is a boolean indicating whether the output Gram matrix has to be saved.

x

An object of class "sensiHSIC" storing the state of the sensitivity study (parameters, data, estimates).

y

A n-by-q matrix containing all output samples. It comprises n observations of the q output variables.

ylim

A vector of two values specifying the y-coordinate plotting limits.

...

Any other arguments for model which are passed unchanged each time model is called.

Details

Let (Xi,Y) be an input-output pair. The kernels assigned to Xi and Y are respectively denoted by Ki and KY.

For many global sensitivity measures, the influence of Xi on Y is measured in the light of the probabilistic dependence that exists within the input-output pair (Xi,Y). For this, a dissimilarity measure is applied between the joint probability distribution (true impact of Xi and Y in the numerical model) and the product of marginal distributions (no impact of Xi on Y). For instance, Borgonovo's sensitivity measure is built upon the total variation distance between those two probability distributions. See Borgonovo and Plischke (2016) for further details.

The HSIC-based sensitivity measure can be understood in this way since the index HSIC(Xi,Y) results from the application of the Hilbert-Schmidt independence criterion (HSIC) on the pair (Xi,Y). This criterion is nothing but a special kind of dissimilarity measure between the joint probability distribution and the product of marginal distributions. This dissimilarity measure is called the maximum mean discrepancy (MMD) and its definition relies on the selected kernels Ki and KY. According to the theory of reproducing kernels, every kernel K is related to a reproducing kernel Hilbert space (RKHS).Then, if K is affected to a random variable Z, any probability distribution describing the random behavior of Z may be represented within the induced RKHS. In this setup, the dissimilarity between the joint probability distribution and the product of marginal distributions is then measured through the squared norm of their images into the bivariate RKHS. The user is referred to Gretton et al. (2006) for additional details on the mathematical construction of HSIC indices.

In practice, it may be difficult to understand how HSIC(Xi,Y) measures dependence within (Xi,Y). An alternative definition relies on the concept of feature map. Let us recall that the value taken by a kernel function can always be seen as the scalar product of two feature functions lying in a feature space. Definition 1 in Gretton et al. (2005) introduces HSIC(Xi,Y) as the Hilbert-Schmidt norm of a covariance-like operator between random features. For this reason, having access to the input and output feature maps may help identify the dependence patterns captured by HSIC(Xi,Y).

Kernels must be chosen very carefully. There exists a wide variety of kernels but only a few f them meet the needs of GSA. As HSIC(Xi,Y) is supposed to be a dependence measure, it must be equal to 0 if and only if Xi and Y are independent. A sufficient condition to enable this equivalence is to take two characteristic kernels. The reader is referred to Fukumizu et al. (2004) for the mathematical definition of a characteristic kernel and to Sriperumbur et al. (2010) for an overview of the major related results. In particular:

  • The Gaussian kernel, the Laplace kernel, the Matern 3/2 kernel and the Matern 5/2 kernel (all defined on R^2) are characteristic.

  • The transformed versions of the four abovementioned kernels (all defined on [0,1]^2) are characteristic.

  • All Sobolev kernels (defined on [0,1]^2) are characteristic.

  • The categorical kernel (defined on any discrete probability space) is characteristic.

Lemma 1 in Gretton et al. (2005) provides a third way of defining HSIC(Xi,Y). Since the associated formula is only based on three expectation terms, the corresponding estimation procedures are very simple and they do not ask for a large amount of input-output samples to be accurate. Two kinds of estimators may be used for HSIC(Xi,Y): the V-statistic estimator (which is non negative, biased and asymptotically unbiased) or the U-statistic estimator (unbiased). For both estimators, the computational complexity is O(n^2) where n is the sample size.

The user must always keep in mind the key steps leading to the estimation of HSIC(Xi,Y):

  • Input samples are simulated and the corresponding output samples are computed with the numerical model.

  • An input kernel Ki and an output kernel KY are selected.

  • In case of target sensitivity analysis: output samples are transformed by means of a weight function w.

  • The input and output Gram matrices are constructed.

  • In case of conditional sensitivity analysis: conditioning weights are computed by means of a weight function w.

  • The final estimate is computed. It depends on the selected estimator type (either a U-statistic or a V-statistic).

Kernel functions for random variables

All what follows is written for a scalar output Y but the same is true for any scalar input Xi.

Let D denote the support of the output probability distribution. A kernel is a symmetric and positive definite function defined on the domain D. Different kernel families are available in sensiHSIC.

  • To deal with continuous probability distributions on R, one can use:

    • The covariance kernel of the fractional Browian motion ("dcov"), the inverse multiquadratic kernel ("invmultiquad"), the exponential kernel ("laplace"), the dot-product kernel ("linear"), the Matern 3/2 kernel ("matern3"), the Matern 5/2 kernel ("matern5"), the rationale quadratic kernel ("raquad") and the Gaussian kernel ("rbf").

  • To deal with continuous probability distributions on [0,1], one can use:

    • Any of the abovementioned kernel (restricted to [0,1]).

    • The transformed exponential kernel ("laplace_anova"), the transformed Matern 3/2 kernel ("matern3_anova"), the transformed Matern 5/2 kernel ("matern5_anova"), the transformed Gaussian kernel ("rbf_anova"), the Sobolev kernel with smoothness parameter r=1 ("sobolev1") and the Sobolev kernel with smoothness parameter r=2 ("sobolev2").

  • To deal with any discrete probability distribution, the categorical kernel ("categ") must be used.

Two kinds of kernels must be distinguished:

  • Parameter-free kernels: the dot-product kernel ("linear"), the Sobolev kernel with smoothness parameter r=1 ("sobolev1") and the Sobolev kernel with smoothness parameter r=2 ("sobolev2").

  • One-parameter kernels: the categorical kernel ("categ"), the covariance kernel of the fractional Brownian motion kernel ("dcov"), the inverse multiquadratic kernel ("invmultiquad"), the exponential kernel ("laplace"), the transformed exponential kernel ("laplace_anova"), the Matern 3/2 kernel ("matern3"), the transformed Matern 3/2 kernel ("matern3_anova"), the Matern 5/2 kernel ("matern5"), the transformed Matern 5/2 kernel ("matern5_anova"), the rationale quadratic kernel ("raquad"), the Gaussian kernel ("rbf") and the transformed Gaussian kernel ("rbf_anova").

A major issue related to one-parameter kernels is how to set the parameter. It mainly depends on the role played by the parameter in the kernel expression.

  • For translation-invariant kernels and their ANOVA variants (that is all one-parameter kernels except "categ" and "dcov"), the parameter may be interpreted as a correlation length (or a scale parameter). The rule of thumb is to compute the empirical standard deviation of the provided samples.

  • For the covariance kernel of the fractional Brownian motion ("dcov"), the parameter is an exponent. Default value is 1.

  • For the categorical kernel ("categ"), the parameter has no physical sense. It is just a kind of binary encoding.

    • 0 means the user wants to use the basic categorical kernel.

    • 1 means the user wants to use the weighted variant of the categorical kernel.

How to deal with a low-dimensional vector output?

Let us assume that the output vector Y is composed of q random variables Y1,...,Yq.

A kernel Kj is affected to each output variable Yj and this leads to embed the j-th output probability distribution in a RKHS denoted by Hj. Then, the tensorization of H1,...,Hq allows to build the final RKHS, that is the RKHS where the q-variate output probability distribution describing the overall random behavior of Y will be embedded. In this situation:

  • The final output kernel is the tensor product of all output kernels.

  • The final output Gram matrix is the Hadamard product of all output Gram matrices.

Once the final output Gram matrix is built, HSIC indices can be estimated, just as in the case of a scalar output.

How to deal with a high-dimensional vector output or a functional output?

In sensiHSIC, three different methods are proposed in order to compute HSIC-based sensitivity indices in presence of functional outputs.

Dimension reduction

This approach was initially proposed by Da Veiga (2015). The key idea is to approximate the random functional output by the first terms of its Kharunen-Loeve expansion. This can be achived with a principal component analysis (PCA) that is carried out on the empirical covariance matrix.

  • The eigenvectors (or principal directions) allow to approximate the (deterministic) functional terms involved in the Kharunen-Loeve decomposition.

  • The eigenvalues allow to determine how many principal directions are sufficient in order to accurately represent the random function by means of its truncated Kharunen-Loeve expansion. The key idea behind dimension reduction is to keep as few principal directions as possible while preserving a prescribed level of explained variance.

The principal components are the coordinates of the functional output in the low-dimensional subspace resulting from PCA. There are computed for all output samples (time series observations). See Le Maitre and Knio (2010) for more detailed explanations.

The last step consists in constructing a kernel in the reduced subspace. One single kernel family is selected and affected to all principal directions. Moreover, all kernel parameters are computed automatically (with appropriate rules of thumb). Then, several strategies may be considered.

  • The initial method described in Da Veiga (2015) is based on a direct tensorization. One can also decide to sum kernels.

  • This approach was improved by El Amri and Marrel (2021). For each principal direction, a weight coefficient (equal the ratio between the eigenvalue and the sum of all selected eigenvalues) is computed.

    • The principal components are multiplied by their respective weight coefficients before summing kernels or tensorizing kernels.

    • The kernels can also be directly applied on the principal components before being linearly combined according to the weight coefficients.

In sensiHSIC, all these strategies correspond to the following specifications in kernelY:

  • Direct tensorization: kernelY=list(method="PCA", combi="prod", position="nowhere")

  • Direct sum: kernelY=list(method="PCA", combi="sum", position="nowhere")

  • Rescaled tensorization: kernelY=list(method="PCA", combi="prod", position="intern")

  • Rescaled sum: kernelY=list(method="PCA", combi="sum", position="intern")

  • Weighted linear combination: kernelY=list(method="PCA", combi="sum", position="extern")

Dynamic Time Warping (DTW)

The DTW algorithm developed by Sakoe and Chiba (1978) can be combined with a translation-invariant kernel in order to create a kernel function for times series. The resulting DTW-based output kernel is well-adapted to measure similarity between two given time series.

Suitable translation-invariant kernels include the inverse multiquadratic kernel ("invmultiquad"), the exponential kernel ("laplace"), the Matern 3/2 kernel ("matern3"), the Matern 5/2 kernel ("matern5"), the rationale quadratic kernel ("raquad") and the Gaussian kernel ("rbf").

The user is warned against the fact that DTW-based kernels are not positive definite functions. As a consequence, many theoretical properties do not hold anymore for HSIC indices.

For faster computations, sensiHSIC is using the function dtw_dismat from the package incDTW.

Global Alignment Kernel (GAK)

Unlike DTW-based kernels, the GAK is a positive definite function. This time-series kernel was originally introduced in Cuturi et al. (2007) and further investigated in Cuturi (2011). It was used to compute HSIC indices on a simplified compartmental epidemiological model in Da Veiga (2021).

For faster computations, sensiHSIC is using the function gak from the package dtwclust.

In sensiHSIC, two GAK-related parameters may be tuned by the user with paramY. They exactly correspond to the arguments sigma and window.size in the function gak.

About normalized HSIC indices (R2-HSIC)

No doubt interpretability is the major drawback of HSIC indices. This shortcoming led Da Veiga (2021) to introduce a normalized version of HSIC(Xi,Y). The so-called R2-HSIC index is thus defined as the ratio between HSIC(Xi,Y) and the square root of a normalizing constant equal to HSIC(Xi,Xi)*HSIC(Y,Y).

This normalized sensitivity measure is inspired from the distance correlation measure proposed by Szekely et al. (2007) and the resulting sensitivity indices are easier to interpret since they all fall in the interval [0,1].

About target HSIC indices (T-HSIC)

T-HSIC indices were designed by Marrel and Chabridon (2021) for TSA. They are only defined for a scalar output. Vector and functional outputs are not supported. The main idea of TSA is to measure the influence of each input variable Xi on a modified version of Y. To do so, a preliminary mathematical transform w (called the weight function) is applied on Y. The collection of HSIC indices is then estimated with respect to w(Y). Here are two examples of situations where TSA is particularly relevant:

  • How to measure the impact of Xi on the upper values taken by Y (for example the values above a given threshold T)?

    • To answer this question, one may take w(Y)=Y*1_{Y>T} (zero-thresholding).
      This can be specified in sensiHSIC with target=list(c=T, type="zeroTh", upper=TRUE).

  • How to measure the influence of Xi on the occurrence of the event {Y>T}?

    • To answer this question, one may take w(Y)=1_{Y<T} (indicator-thresholding).
      This can be specified in sensiHSIC with target=list(c=T, type="indicTh", upper=FALSE).

In Marrel and Chabridon (2021), the two situations described above are referred to as "hard thresholding". To avoid using discontinuous weight functions, "smooth thresholding" may be used instead.

  • Spagnol et al. (2019): logistic transformation on both sides of the threshold T.

  • Marrel and Chabridon (2021): exponential transformation above or below the threshold T.

These two smooth relaxation functions depend on a tuning parameter that helps control smoothness. For further details, the user is invited to consult the documentation of the function weightTSA.

Remarks:

  • When type="indicTh" (indicator-thesholding), w(Y) becomes a binary random variable. Accordingly, the output kernel selected in kernelY must be the categorical kernel.

  • In the spirit of R2-HSIC indices, T-HSIC indices can be normalized. The associated normalizing constant is equal to the square root of HSIC(Xi,Xi)*HSIC(w(Y),w(Y)).

  • T-HSIC indices can be very naturally combined with the HSIC-ANOVA decomposition proposed by Da Veiga (2021). As a consequence, the arguments target and anova in sensiHSIC can be enabled simultaneously. Compared with basic HSIC indices, there are three main differences: the input variables must be mutually independent, ANOVA kernels must be used for all input variables and the output of interest is w(Y).

  • T-HSIC indices can be very naturally combined with the tests of independence proposed in testHSIC. In this context, the null hypothesis is H0: "Xi and w(Y) are independent".

About conditional HSIC indices (C-HSIC)

C-HSIC indices were designed by Marrel and Chabridon (2021) for CSA. They are only defined for a scalar output. Vector and functional outputs are not supported. The idea is to measure the impact of each input variable Xi on Y when a specific event occurs. This conditioning event is defined on Y thanks to a weight function w. In order to compute the conditioning weights, w is applied on the output samples and an empirical normalization is carried out (so that the overall sum of conditioning weights is equal to 1). The conditioning weights are then combined with the simulated Gram matrices in order to estimate C-HSIC indices. All formulas can be found in Marrel and Chabridon (2021). Here is an exemple of a situation where CSA is particularly relevant:

  • Let us imagine that the event {Y>T} coincides with a system failure.
    How to measure the influence of Xi on Y when failure occurs?

    • To answer this question, one may take w(Y) = 1_{Y>T} (indicator-thresholding).
      This can be specified in sensiHSIC with cond=list(c=T, type="indicTh", upper=TRUE).

The three other weight functions proposed for TSA (namely "zeroTh", "logistic" and "exp1side") can also be used but the role they play is less intuitive to understand. See Marrel and Chabridon (2021) for better explanations.

Remarks:

  • Unlike what is pointed out for TSA, when type="thresholding", the output of interest Y remains a continuous random variable. The categorical kernel is thus inappropriate. A continuous kernel must be used instead.

  • In the spirit of R2-HSIC indices, C-HSIC indices can be normalized. The associated normalizing constant is equal to the square root of C-HSIC(Xi,Xi)*C-HSIC(Y,Y).

  • Only V-statistics are supported to estimate C-HSIC indices. The reason is because the normalized version of C-HSIC indices cannot always be estimated with U-statistics. In particular, the estimates of C-HSIC(Xi,Xi)*C-HSIC(Y,Y) may be negative.

  • C-HSIC indices cannot be combined with the HSIC-ANOVA decomposition proposed in Da Veiga (2021). In fact, the conditioning operation is feared to introduce statistical dependence among input variables, which forbids using HSIC-ANOVA indices. As a consequence, the arguments cond and anova in sensiHSIC cannot be enabled simultaneously.

  • C-HSIC indices can harly be combined with the tests of inpendence proposed in testHSIC. This is only possible if type="indicTh". In this context, the null hypothesis is H0: "Xi and Y are independent if the event described in cond occurs".

About HSIC-ANOVA indices

In comparison with HSIC indices, R2-HSIC indices are easier to interpret. However, in terms of interpretability, Sobol' indices remain much more convenient since they can be understood as shares of the total output variance. Such an interpretation is made possible by the Hoeffding decomposition, also known as ANOVA decomposition.

It was proved in Da Veiga (2021) that an ANOVA-like decomposition can be achived for HSIC indices under certain conditions:

  • The input variables must be mutually independent (which was not required to compute all other kinds of HSIC indices).

  • ANOVA kernels must be assigned to all input variables.

This ANOVA setup allows to establish a strict separation between main effects and interaction effects in the HSIC sense. The first-order and total-order HSIC-ANOVA indices are then defined in the same fashion than first-order and total-order Sobol' indices. It is worth noting that the HSIC-ANOVA normalizing constant is equal to HSIC(X,Y) and is thus different from the one used for R2-HSIC indices.

For a given probability measure P, an ANOVA kernel K is a kernel that can rewritten 1+k where k is an orthogonal kernel with respect to P. Among the well-known parametric families of probability distributions and kernel functions, there are very few examples of orthogonal kernels. One example is given by Sobolev kernels when there are matched with the uniform probability measure on [0,1]. See Wahba et al. (1995) for further details on Sobolev kernels.

Moreover, several strategies to construct orthogonal kernels from non-orthogonal kernels are recalled in Da Veiga (2021). One of them consists in translating the feature map so that the resulting kernel becomes centered at the prescribed probability measure P. This can be done analytically for some basic kernels (Gaussian, exponential, Matern 3/2 and Matern 5/2) when P is the uniform measure on [0,1]. See Section 9 in Ginsbourger et al. (2016) for the corresponding formulas.

In sensiHSIC, ANOVA kernels are only available for the uniform probability measure on [0,1]. This includes the Sobolev kernel with parameter r=1 ("sobolev1"), the Sobolev kernel with parameter r=2 ("sobolev2"), the transformed Gaussian kernel ("rbf_anova"), the transformed exponential kernel ("laplace_anova"), the transformed Matern 3/2 kernel ("matern3_anova") and the transformed Matern 5/2 kernel ("matern5_anova").

As explained above, the HSIC-ANOVA indices can only be computed if all input variables are uniformly distributed on [0,1]. Because of this limitation, a preliminary reformulation is needed if the GSA problem includes other kinds of input probability distributions. The probability integral transform (PIT) must be applied on each input variable Xi. In addition, all quantile functions must be encapsulated in the numerical model, which may lead to reconsider the way model is specified. In sensiHSIC, if check=TRUE is selected in anova, it is checked that all input samples lie in [0,1]. If this is not the case, a non-parametric rescaling (based on empirical distribution functions) is operated.

HSIC-ANOVA indices can be used for TSA. The only difference with GSA is the use of a weight function w. On the contrary, CSA cannot be conducted with HSIC-ANOVA indices. Indeed, the conditioning operation is feared to introduce statistical independence among the input variables, which prevents using the HSIC-ANOVA approach.

Value

sensiHSIC returns a list of class "sensiHSIC". It contains all the input arguments detailed before, except sensi which is not kept. It must be noted that some of them might have been altered, corrected or completed.

kernelX

A vector of p strings with input kernels.

paramX

A vector of p values with input kernel parameters. For each one-parameter kernel, a real number is returned. It is either the original value (if correct), a corrected value (if not) or the default value (computed from a rule of thumb when NA is specified). For each parameter-free kernel, NA is returned.

kernelY

A vector of q strings or a list of options that specifies how the output kernel was constructed. In the case where kernelY is a list of options with method="PCA", kernelY contains additional information resulting from PCA.

  • If kernelY initally contained an option named "expl.var", kernelY now also contains an option named "PC" that provides the associated number of principal components.

  • If kernelY initially contained an option named "PC", kernelY now also contains an option named "expl.var" that provides the associated percentage of output variance that is explained by PCA.

  • If kernelY initally contained an option named "position" that was set to "intern" or "extern", kernelY now contains an option named "ratios" that provides the weights used to combine kernels in the reduced subspace given by PCA.

paramY

A vector of values with output kernel parameters.

Case 1: kernelY is a list of q strings.

paramY is a vector of q values. For each one-parameter kernel, a real number is returned. It is either the original value (if correct), a corrected value or the default value (computed with a rule of thumb if NA was initially specified). For each parameter-free kernel, NA is returned.

Case 2: kernelY is a list of options with method="PCA".

paramY is a vector of PC values. For this method, let us recall that all kernels belong to the same family which is specified by an option named "fam" within kernelY. For each dimension in the reduced subspace, the kernel parameter is computed (with a rule of thumb) from the corresponding principal component. If the kernel in fam is parameter-free, paramY is a vector where NA is repeated PC times.

Case 3: kernelY is a list of options with method="DTW".

paramY remains equal to NA.

Case 4: kernelY is a list of options with method="GAK".

paramY is a vector of 2 values. For each parameter, the returned value is either the original value (if correct), a corrected value or the default value (computed with a rule of thumb if NA was initially specified).

More importantly, the list of class "sensiHSIC" contains all expected results (output samples, sensitivity measures and conditioning weights).

call

The matched call.

y

A n-row matrix containing all output samples. The i-th row in y is obtained from the i-th row in X after computing the model response. If target is passed to sensiHSIC, output samples in y are obtained after applying consecutively model and the specified weight function.

HSICXY

The estimated HSIC indices.

S

The estimated R2-HSIC indices (also called normalized HSIC indices).

weights

Only if cond is passed to sensiHSIC.
A vector of n values containing all conditioning weights. In the CSA context, the conditioning factor is defined by w(Y)/E[w(Y)]. See Marrel and Chabridon (2021) for further explanations.

Depending on what is specified in anova, the list of class "sensiHSIC" may also contain the following objects:

FO

The estimated first-order HSIC-ANOVA indices.

TO

The estimated total-order HSIC-ANOVA indices.

TO.num

The estimated numerators of total-order HSIC-ANOVA indices.

denom

The estimated common denominator of HSIC-ANOVA indices.

Author(s)

Sebastien Da Veiga, Amandine Marrel, Anouar Meynaoui, Reda El Amri and Gabriel Sarazin.

References

Borgonovo, E. and Plischke, E. (2016), Sensitivity analysis: a review of recent advances, European Journal of Operational Research, 248(3), 869-887.

Cuturi, M., Vert, J. P., Birkenes, O. and Matsui, T. (2007), A kernel for time series based on global alignments, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 (Vol. 2, pp. II-413), IEEE.

Cuturi, M. (2011), Fast global alignment kernels, Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 929-936).

Da Veiga, S. (2015), Global sensitivity analysis with dependence measures, Journal of Statistical Computation and Simulation, 85(7), 1283-1305.

Da Veiga, S. (2021). Kernel-based ANOVA decomposition and Shapley effects: application to global sensitivity analysis, arXiv preprint arXiv:2101.05487.

El Amri, M. R. and Marrel, A. (2021), More powerful HSIC-based independence tests, extension to space-filling designs and functional data. https:/cea.hal.science/cea-03406956/

Fukumizu, K., Bach, F. R. and Jordan, M. I. (2004), Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces, Journal of Machine Learning Research, 5(Jan), 73-99.

Ginsbourger, D., Roustant, O., Schuhmacher, D., Durrande, N. and Lenz, N. (2016), On ANOVA decompositions of kernels and Gaussian random field paths, Monte Carlo and Quasi-Monte Carlo Methods (pp. 315-330), Springer, Cham.

Gretton, A., Bousquet, O., Smola, A., and Scholkopf, B. (2005), Measuring statistical dependence with Hilbert-Schmidt norms, International Conference on Algorithmic Learning Theory (pp. 63-77), Springer, Berlin, Heidelberg.

Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B. and Smola, A. (2006), A kernel method for the two-sample-problem, Advances in Neural Information Processing Systems, 19.

Le Maitre, O. and Knio, O. M. (2010), Spectral methods for uncertainty quantification with applications to computational fluid dynamics, Springer Science & Business Media.

Marrel, A. and Chabridon, V. (2021), Statistical developments for target and conditional sensitivity analysis: application on safety studies for nuclear reactor, Reliability Engineering & System Safety, 214, 107711.

Sakoe, H. and Chiba, S. (1978), Dynamic programming algorithm optimization for spoken word recognition, IEEE International Conference on Acoustics, Speech and Signal, 26(1), 43-49.

Spagnol, A., Riche, R. L. and Veiga, S. D. (2019), Global sensitivity analysis for optimization with variable selection, SIAM/ASA Journal on Uncertainty Quantification, 7(2), 417-443.

Sriperumbudur, B., Fukumizu, K. and Lanckriet, G. (2010), On the relation between universality, characteristic kernels and RKHS embedding of measures, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (pp. 773-780). JMLR Workshop and Conference Proceedings.

Szekely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007), Measuring and testing dependence by correlation of distances, The Anals of Statistics, 35(6), 2769-2794.

Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995), Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy: the 1994 Neyman Memorial Lecture, The Annals of Statistics, 23(6), 1865-1895.

See Also

testHSIC, weightTSA

Examples

 

############################
### HSIC indices for GSA ###
############################

# Test case 1: the Friedman function
# --> 5 input variables

### GSA with a given model ###

n <- 800
p <- 5
X <- matrix(runif(n*p), n, p)

kernelX <- c("rbf", "rbf", "laplace", "laplace", "sobolev1")
paramX <- c(0.2, 0.3, 0.4, NA, NA)

# kernel for X1: Gaussian kernel with given parameter 0.2
# kernel for X2: Gaussian kernel with given parameter 0.3
# kernel for X3: exponential kernel with given parameter 0.4
# kernel for X4: exponential kernel with automatic computation of the parameter
# kernel for X5: Sobolev kernel (r=1) with no parameter

kernelY <- "raquad"
paramY <- NA 

sensi <- sensiHSIC(model=friedman.fun, X,
                   kernelX=kernelX, paramX=paramX, 
                   kernelY=kernelY, paramY=paramY)

print(sensi)
plot(sensi)
title("GSA for the Friedman function")

### GSA with given data ###

Y <- friedman.fun(X)
sensi <- sensiHSIC(model=NULL, X,
                   kernelX=kernelX, paramX=paramX, 
                   kernelY=kernelY, paramY=paramY)
tell(sensi, y=Y)

print(sensi)

### GSA from a prior object of class "sensiHSIC" ###

new.sensi <- sensiHSIC(model=friedman.fun, X,
                       kernelX=kernelX, paramX=paramX, 
                       kernelY=kernelY, paramY=paramY,
                       estimator.type="U-stat", 
                       sensi=sensi,
                       save.GM=list(KX=FALSE, KY=FALSE))

print(new.sensi)

# U-statistics are computed without rebuilding all Gram matrices.
# Those Gram matrices are not saved a second time.

##################################
### HSIC-ANOVA indices for GSA ###
##################################

# Test case 2: the Matyas function with Gaussian input variables
# --> 3 input variables (including 1 dummy variable)

n <- 10^3
p <- 2

X <- matrix(rnorm(n*p), n, p)

# The Sobolev kernel (with r=1) is used to achieve the HSIC-ANOVA decomposition.
# Both first-order and total-order HSIC-ANOVA indices are expected.

### AUTOMATIC RESCALING ###

kernelX <- "sobolev1"
anova <- list(obj="both", is.uniform=FALSE)

sensi.A <- sensiHSIC(model=matyas.fun, X, kernelX=kernelX, anova=anova)

print(sensi.A)
plot(sensi.A)
title("GSA for the Matyas function")

### PROBLEM REFORMULATION ###

U <- matrix(runif(n*p), n, p)
new.matyas.fun <- function(U){ matyas.fun(qnorm(U)) }

kernelX <- "sobolev1"
anova <- list(obj="both", is.uniform=TRUE)

sensi.B <- sensiHSIC(model=new.matyas.fun, U, kernelX=kernelX, anova=anova)

print(sensi.B)

####################################
### T-HSIC indices for target SA ###
####################################

# Test case 3: the Sobol function
# --> 8 input variables

n <- 10^3
p <- 8

X <- matrix(runif(n*p), n, p)

kernelY <- "categ"
target <- list(c=0.4, type="indicTh")

sensi <- sensiHSIC(model=sobol.fun, X, kernelY=kernelY, target=target)

print(sensi)
plot(sensi)
title("TSA for the Sobol function")

#########################################
### C-HSIC indices for conditional SA ###
#########################################

# Test case 3: the Sobol function
# --> 8 input variables

n <- 10^3
p <- 8

X <- matrix(runif(n*p), n, p)

cond <- list(c=0.2, type="exp1side", upper=FALSE)

sensi <- sensiHSIC(model=sobol.fun, X, cond=cond)

print(sensi)
plot(sensi)
title("CSA for the Sobol function")

##########################################
### How to deal with discrete outputs? ###
##########################################

# Test case 4: classification of the Ishigami output
# --> 3 input variables
# --> 3 categories

classif <- function(X){
  
  Ytemp <- ishigami.fun(X) 
  Y <- rep(NA, n)
  Y[Ytemp<0] <- 0
  Y[Ytemp>=0 & Ytemp<10] <- 1                
  Y[Ytemp>=10] <- 2  
  
  return(Y)
  
}

###

n <- 10^3
p <- 3

X <- matrix(runif(n*p, -pi, pi), n, p)

kernelY <- "categ"
paramY <- 0

sensi <- sensiHSIC(model=classif, X, kernelY=kernelY, paramY=paramY)
print(sensi)
plot(sensi)
title("GSA for the classified Ishigami function")

############################################
### How to deal with functional outputs? ###
############################################

# Test case 5: the arctangent temporal function
# --> 3 input variables (including 1 dummy variable)

n <- 500
p <- 3

X <- matrix(runif(n*p,-7,7), n, p)

### with a preliminary dimension reduction by PCA ###

kernelY <- list(method="PCA", 
                data.centering=TRUE, data.scaling=TRUE,
                fam="rbf", expl.var=0.95, combi="sum", position="extern")

sensi <- sensiHSIC(model=atantemp.fun, X, kernelY=kernelY)

print(sensi)
plot(sensi)
title("PCA-based GSA for the arctangent temporal function")

### with a kernel based on dynamic time warping ###

kernelY <- list(method="DTW", fam="rbf")

sensi <- sensiHSIC(model=atantemp.fun, X, kernelY=kernelY)

print(sensi)
plot(sensi)
title("DTW-based GSA for the arctangent temporal function")



### with the global alignment kernel ###

kernelY <- list(method="GAK")

sensi <- sensiHSIC(model=atantemp.fun, X, kernelY=kernelY)

print(sensi)
plot(sensi)
title("GAK-based GSA for the arctangent temporal function")

  

sensitivity documentation built on Sept. 11, 2024, 9:09 p.m.