View source: R/sobolshap_knn.R
sobolshap_knn | R Documentation |
WARNING: DEPRECATED function: use shapleysobol_knn
instead.
sobolshap_knn
implements the estimation of several sensitivity indices using
only N model evaluations via ranking (following Gamboa et al. (2020) and Chatterjee (2019))
or nearest neighbour search (Broto et al. (2020) and Azadkia & Chatterjee (2020)).
It can be used with categorical inputs (which are transformed with one-hot encoding),
dependent inputs and multiple outputs. Sensitivity indices of any group of inputs can be computed,
which means that in particular first-order/total Sobol indices and Shapley effects are accessible.
For large sample sizes, the nearest neightbour algorithm can be significantly accelerated
by using approximate nearest neighbour search. It is also possible to estimate Shapley effects
with the random permutation approach of Castro et al.(2009), where all the terms are obtained
with ranking or nearest neighbours.
sobolshap_knn(model = NULL, X, id.cat = NULL, U = NULL, method = "knn",
n.knn = 2, return.shap = FALSE, randperm = FALSE, n.perm = 1e4,
rescale = FALSE, n.limit = 2000, noise = FALSE, ...)
## S3 method for class 'sobolshap_knn'
tell(x, y = NULL, ...)
## S3 method for class 'sobolshap_knn'
extract(x, ...)
## S3 method for class 'sobolshap_knn'
print(x, ...)
## S3 method for class 'sobolshap_knn'
plot(x, ylim = c(0, 1), type.multout = "lines", ...)
## S3 method for class 'sobolshap_knn'
ggplot(data, mapping = aes(), ylim = c(0, 1),
type.multout = "lines", ..., environment = parent.frame())
model |
a function, or a model with a |
X |
a random sample of the inputs. |
id.cat |
a vector with the indices of the categorical inputs. |
U |
an integer equal to 0 (total Sobol indices) or 1 (first-order Sobol indices) or a list of vector indices defining the subsets of inputs whose sensitivity indices must be computed or a matrix of 0s and 1s where each row encodes a subset of inputs whose sensitivity indices must be computed (see examples) or NULL (all possible subsets). |
method |
the algorithm to be used for estimation, either "rank" or "knn", see details. |
n.knn |
the number of nearest neighbours used for estimation if |
return.shap |
a logical indicating if Shapley effects must be estimated,
can only be TRUE if |
randperm |
a logical indicating if random permutations are used to estimate Shapley effects,
only if |
n.perm |
the number of random permutations used for estimation if |
rescale |
a logical indicating if continuous inputs must be rescaled before distance computations.
If TRUE, continuous inputs are first whitened with the ZCA-cor whitening procedure
(cf. whiten() function in package |
n.limit |
the sample size limit above which approximate nearest neighbour search is activated,
only used if |
noise |
a logical which is TRUE if the model or the output sample is noisy, see details. |
x |
a list of class |
data |
a list of class |
y |
a vector of model responses. |
ylim |
y-coordinate plotting limits. |
type.multout |
the plotting method in the case of multiple outputs, either "points" or "lines", see examples. |
mapping |
Default list of aesthetic mappings to use for plot. If not specified, must be supplied in each layer added to the plot. |
environment |
[Deprecated] Used prior to tidy evaluation. |
... |
any other arguments for |
For method="rank"
, the estimator is defined in Gamboa et al. (2020)
following Chatterjee (2019). For first-order indices it is based on an input ranking
(same algorithm as in sobolrank
) while for higher orders,
it uses an approximate heuristic solution of the traveling salesman problem
applied to the input sample distances (cf. TSP() function in package TSP
).
For method="knn"
, ranking and TSP are replaced by a nearest neighbour search
as proposed in Broto et al. (2020) and in Azadkia & Chatterjee (2020) for a similar coefficient.
The algorithm is the same as in shapleySubsetMc
but with an optimized implementation.
In particular, the distance used for subsets with mixed inputs (continuous and categorical)
are the same but here the additional one-hot encoding of categorical variables makes it possible to
work only with Euclidean distances. Furthermore, a fast approximate nearest neighbour search is also
available, which is strongly recommended for large sample sizes. The main difference
with shapleySubsetMc
is that here we use the entire N sample to compute all indices,
while in shapleySubsetMc
the user can specify a total cost Ntot which performs
a specific allocation of sample sizes to the estimation of each index.
In addition, the weights
option is not available here yet.
If the outputs are noisy, the argument noise
can be used: it only has an impact on the
estimation of one specific sensitivity index, namely Var(E(Y|X1,\ldots,Xp))/Var(Y)
.
If there is no noise this index is equal to 1, while in the presence of noise it must be estimated.
When randperm=TRUE
, Shapley effects are no longer estimated by computing all the possible
subsets of variables but only on subsets obtained with random permutations as proposed in Castro et al.(2009).
This is useful for problems with a large number of inputs, since the number of subsets increases exponentially
with dimension.
The extract
method is useful if in a first step the Shapley effects have been computed
and thus sensitivity indices for all possible subsets are available.
The resulting sobolshap_knn
object can be post-treated by extract
to get first-order and total Sobol indices very easily.
sobolshap_knn
returns a list of class "sobolshap_knn"
, containing all
the input arguments detailed before, plus the following components:
call |
the matched call. |
X |
a |
y |
a vector of model responses. |
U |
the subsets of inputs for which sensitivity indices have been computed. |
S |
the estimations of the Sobol sensitivity indices (see details). |
Shap |
the estimations of Shapley effects, if return.shap was set to TRUE. |
order |
0 (total indices), 1 (first-order indices) or NULL. Used for plotting defaults. |
Sebastien Da Veiga
Azadkia M., Chatterjee S., 2021), A simple measure of conditional dependence, Ann. Statist. 49(6):3070-3102.
Broto B., Bachoc F., Depecker M. (2020), Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution, SIAM/ASA Journal of Uncertainty Quantification, 8:693-716.
Castro J., Gomez D, Tejada J. (2009). Polynomial calculation of the Shapley value based on sampling. Computers & Operations Research, 36(5):1726-1730.
Chatterjee, S., 2021, A new coefficient of correlation, Journal of the American Statistical Association, 116:2009-2022.
Gamboa, F., Gremaud, P., Klein, T., & Lagnoux, A., 2022, Global Sensitivity Analysis: a novel generation of mighty estimators based on rank statistics, Bernoulli 28: 2345-2374.
sobolrank, shapleysobol_knn, shapleySubsetMc
# Test case: the non-monotonic Sobol g-function
# Example with a call to a numerical model
# First compute first-order indices with ranking
n <- 1000
X <- data.frame(matrix(runif(8 * n), nrow = n))
x <- sobolshap_knn(model = sobol.fun, X = X, U = 1, method = "rank")
print(x)
library(ggplot2)
ggplot(x)
# We can use the output sample generated for this estimation to compute
# total indices without additional calls to the model
x2 <- sobolshap_knn(model = NULL, X = X, U = 0, method = "knn", n.knn = 5)
tell(x2,x$y)
ggplot(x2)
# Test case: the Ishigami function
# Example with given data and the use of approximate nearest neighbour search
library(RANN)
n <- 5000
X <- data.frame(matrix(-pi+2*pi*runif(3 * n), nrow = n))
Y <- ishigami.fun(X)
x <- sobolshap_knn(model = NULL, X = X, U = NULL, method = "knn", n.knn = 5,
return.shap = TRUE, n.limit = 2000)
tell(x,Y)
library(ggplot2)
ggplot(x)
# We can also extract first-order and total Sobol indices
x1 <- extract(x)
print(x1)
# Test case : Linear model (3 Gaussian inputs including 2 dependent) with scaling
# See Iooss and Prieur (2019)
library(mvtnorm) # Multivariate Gaussian variables
library(whitening) # For scaling
modlin <- function(X) apply(X,1,sum)
d <- 3
n <- 10000
mu <- rep(0,d)
sig <- c(1,1,2)
ro <- 0.9
Cormat <- matrix(c(1,0,0,0,1,ro,0,ro,1),d,d)
Covmat <- ( sig %*% t(sig) ) * Cormat
Xall <- function(n) mvtnorm::rmvnorm(n,mu,Covmat)
X <- Xall(n)
x <- sobolshap_knn(model = modlin, X = X, U = NULL, method = "knn", n.knn = 5,
return.shap = TRUE, rescale = TRUE, n.limit = 2000)
print(x)
# Test case: functional toy fct 'Arctangent temporal function'
n <- 3000
X <- data.frame(matrix(runif(2*n,-7,7), nrow = n))
Y <- atantemp.fun(X)
x <- sobolshap_knn(model = NULL, X = X, U = NULL, method = "knn", n.knn = 5,
return.shap = TRUE, n.limit = 2000)
tell(x,Y)
library(ggplot2)
library(reshape2)
ggplot(x, type.multout="lines")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.