pme: Data-given proportional marginal effects estimation via...

pme_knnR Documentation

Data-given proportional marginal effects estimation via nearest-neighbors procedure

Description

pme_knn computes the proportional marginal effects (PME), from Herin et al. (2024) via a nearest neighbor estimation. Parallelized computations are possible to accelerate the estimation process. It can be used with categorical inputs (which are transformed with one-hot encoding before computing the nearest-neighbors), dependent inputs and multiple outputs. For large sample sizes, the nearest neighbour algorithm can be significantly accelerated by using approximate nearest neighbour search.

Usage

pme_knn(model=NULL, X, method = "knn", tol = NULL, marg = T, n.knn = 2, 
          n.limit = 2000, noise = F, rescale = F, nboot = NULL, 
          boot.level = 0.8, conf=0.95, parl=NULL, ...)
## S3 method for class 'pme_knn'
tell(x, y, ...)
## S3 method for class 'pme_knn'
print(x, ...)
## S3 method for class 'pme_knn'
plot(x, ylim = c(0,1), ...)
## S3 method for class 'pme_knn'
ggplot(data, mapping = aes(), ylim = c(0, 1), ..., environment
                 = parent.frame())

Arguments

model

a function defining the model to analyze, taking X as an argument.

X

a matrix or data frame containing the observed inputs.

method

the algorithm to be used for estimation, either "rank" or "knn", see details. Default is method="knn".

tol

tolerance under which an input is considered as being a zero input. See details.

marg

whether to chose the closed Sobol' (FALSE) or total Sobol' (TRUE) indices as value functions.

n.knn

the number of nearest neighbours used for estimation.

n.limit

sample size limit above which approximate nearest neighbour search is activated.

noise

a logical which is TRUE if the model or the output sample is noisy. See details.

rescale

a logical indicating if continuous inputs must be rescaled before distance computations. If TRUE, continuous inputs are first whitened with the ZCA-cor whitening procedure (cf. whiten() function in package whitening). If the inputs are independent, this first step will have a very limited impact. Then, the resulting whitened inputs are individually modified via a copula transform such that each input has the same scale.

nboot

the number of bootstrap resamples for the bootstrap estimate of confidence intervals. See details.

boot.level

a numeric between 0 and 1 for the proportion of the bootstrap sample size.

conf

the confidence level of the bootstrap confidence intervals.

parl

number of cores on which to parallelize the computation. If NULL, then no parallelization is done.

x

the object returned by pme_knn.

data

the object returned by pme_knn.

y

a numeric univariate vector containing the observed outputs.

ylim

the y-coordinate limits for plotting.

mapping

Default list of aesthetic mappings to use for plot. If not specified, must be supplied in each layer added to the plot.

environment

[Deprecated] Used prior to tidy evaluation.

...

additional arguments to be passed to model, or to the methods, such as graphical parameters (see par).

Details

For method="rank", the estimator is defined in Gamboa et al. (2020) following Chatterjee (2019).For first-order indices it is based on an input ranking (same algorithm as in sobolrank) while for higher orders, it uses an approximate heuristic solution of the traveling salesman problem applied to the input sample distances (cf. TSP() function in package TSP). For method="knn", ranking and TSP are replaced by a nearest neighbour search as proposed in Broto et al. (2020) and in Azadkia & Chatterjee (2020) for a similar coefficient.

The computation is done using the subset procedure, defined in Broto, Bachoc and Depecker (2020), that is computing all the Sobol' closed indices for all possible sub-models first, and then computing the proportional values recursively, as detailed in Feldman (2005), but using an extension to non strictly positive games (Herin et al., 2024).

Since boostrap creates ties which are not accounted for in the algorithm, confidence intervals are obtained by sampling without replacement with a proportion of the total sample size boot.level, drawn uniformly.

If the outputs are noisy, the argument noise can be used: it only has an impact on the estimation of one specific sensitivity index, namely Var(E(Y|X1,\ldots,Xp))/Var(Y). If there is no noise this index is equal to 1, while in the presence of noise it must be estimated.

The distance used for subsets with mixed inputs (continuous and categorical) is the Euclidean distance, thanks to a one-hot encoding of categorical inputs.

If too many cores for the machine are passed on to the parl argument, the chosen number of cores is defaulted to the available cores minus one.

If marg = TRUE (default), the chosen value function to compute the proportional values are the total Sobol' indices (dual of the underlying cooperative game). If marg = FALSE, then the closed Sobol' indices are used instead. Differences may appear between the two.

Zero inputs are defined by the tol argument. If null, then inputs with:

S^T_{\{i\}}) = 0

are considered as zero input in the detection of spurious variables. If provided, zero inputs are detected when:

S^T_{\{i\}} \leq \textrm{tol}

Value

pme_knn returns a list of class "pme_knn":

call

the matched call.

PME

the estimations of the PME indices.

VE

the estimations of the closed Sobol' indices for all possible sub-models.

indices

list of all subsets corresponding to the structure of VE.

method

which estimation method has been used.

conf_int

a matrix containing the estimations, biais and confidence intervals by bootstrap (if nboot>0).

X

the observed covariates.

y

the observed outcomes.

n.knn

value of the n.knn argument.

rescale

wheter the design matrix has been rescaled.

n.limit

value of the n.limit argument.

boot.level

value of the boot.level argument.

noise

wheter the PME must sum up to one or not.

boot

logical, wheter bootstrap confidence interval estimates have been performed.

nboot

value of the nboot argument.

parl

value of the parl argument.

conf

value of the conf argument.

marg

value of the marg argument.

tol

value of the tol argument.

Author(s)

Marouane Il Idrissi, Margot Herin

References

Azadkia M., Chatterjee S., 2021), A simple measure of conditional dependence, Ann. Statist. 49(6):3070-3102.

Chatterjee, S., 2021, A new coefficient of correlation, Journal of the American Statistical Association, 116:2009-2022.

Gamboa, F., Gremaud, P., Klein, T., & Lagnoux, A., 2022, Global Sensitivity Analysis: a novel generation of mighty estimators based on rank statistics, Bernoulli 28: 2345-2374.

Broto B., Bachoc F. and Depecker M. (2020) Variance Reduction for Estimation of Shapley Effects and Adaptation to Unknown Input Distribution. SIAM/ASA Journal on Uncertainty Quantification, 8(2).

M. Herin, M. Il Idrissi, V. Chabridon and B. Iooss, Proportional marginal effects for sensitivity analysis with correlated inputs, Proceedings of the 10th International Conferenceon Sensitivity Analysis of Model Output (SAMO 2022), p 42-43, Tallahassee, Florida, March 2022.

M. Herin, M. Il Idrissi, V. Chabridon and B. Iooss, Proportional marginal effects for global sensitivity analysis, SIAM/ASA Journal of Uncertainty Quantification, 12:667-692 2024

M. Il Idrissi, V. Chabridon and B. Iooss (2021). Developments and applications of Shapley effects to reliability-oriented sensitivity analysis with correlated inputs. Environmental Modelling & Software, 143, 105115.

B. Iooss, V. Chabridon and V. Thouvenot, Variance-based importance measures for machine learning model interpretability, Congres lambda-mu23, Saclay, France, 10-13 octobre 2022 https://hal.science/hal-03741384

Feldman, B. (2005) Relative Importance and Value SSRN Electronic Journal.

See Also

sobolrank, shapleysobol_knn, shapleyPermEx, shapleySubsetMc, lmg, pmvd

Examples

  
  
library(parallel)
library(doParallel)
library(foreach)
library(gtools)
library(boot)
library(RANN)

###########################################################
# Linear Model with Gaussian correlated inputs

library(mvtnorm)

set.seed(1234)
n <- 1000
beta<-c(1,-1,0.5)
sigma<-matrix(c(1,0,0,
                0,1,-0.8,
                0,-0.8,1),
              nrow=3,
              ncol=3)

X <-rmvnorm(n, rep(0,3), sigma)
colnames(X)<-c("X1","X2", "X3")


y <- X%*%beta + rnorm(n,0,2)

# Without Bootstrap confidence intervals
x<-pme_knn(model=NULL, X=X,
            n.knn=3,
            noise=TRUE)
tell(x,y)
print(x)
plot(x)

# With Boostrap confidence intervals
x<-pme_knn(model=NULL, X=X,
            nboot=10, 
            n.knn=3,
            noise=TRUE,
            boot.level=0.7, 
            conf=0.95)
tell(x,y)
print(x)
plot(x)

#####################################################
# Test case: the Ishigami function
# Example with given data and the use of approximate nearest neighbour search
n <- 5000
X <- data.frame(matrix(-pi+2*pi*runif(3 * n), nrow = n))
Y <- ishigami.fun(X)
x <- pme_knn(model = NULL, X = X,  method = "knn", n.knn = 5, 
                       n.limit = 2000)
tell(x,Y)
plot(x)

library(ggplot2) ; ggplot(x)

######################################################
# Test case : Linear model (3 Gaussian inputs including 2 dependent) with scaling
# See Iooss and Prieur (2019)
library(mvtnorm) # Multivariate Gaussian variables
library(whitening) # For scaling
modlin <- function(X) apply(X,1,sum)
d <- 3
n <- 10000
mu <- rep(0,d)
sig <- c(1,1,2)
ro <- 0.9
Cormat <- matrix(c(1,0,0,0,1,ro,0,ro,1),d,d)
Covmat <- ( sig %*% t(sig) ) * Cormat
Xall <- function(n) mvtnorm::rmvnorm(n,mu,Covmat)
X <- Xall(n)
x <- pme_knn(model = modlin, X = X, method = "knn", n.knn = 5, 
                       rescale = TRUE, n.limit = 2000)
print(x)
plot(x)


sensitivity documentation built on Sept. 11, 2024, 9:09 p.m.