| vimp_anova | R Documentation | 
Compute estimates of and confidence intervals for nonparametric ANOVA-based
intrinsic variable importance. This is a wrapper function for cv_vim,
with type = "anova". This type
has limited functionality compared to other
types; in particular, null hypothesis tests
are not possible using type = "anova".
If you want to do null hypothesis testing
on an equivalent population parameter, use
vimp_rsquared instead.
vimp_anova(
  Y = NULL,
  X = NULL,
  cross_fitted_f1 = NULL,
  cross_fitted_f2 = NULL,
  indx = 1,
  V = 10,
  run_regression = TRUE,
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  alpha = 0.05,
  delta = 0,
  na.rm = FALSE,
  cross_fitting_folds = NULL,
  stratified = FALSE,
  C = rep(1, length(Y)),
  Z = NULL,
  ipc_weights = rep(1, length(Y)),
  scale = "logit",
  ipc_est_type = "aipw",
  scale_est = TRUE,
  cross_fitted_se = TRUE,
  ...
)
| Y | the outcome. | 
| X | the covariates. If  | 
| cross_fitted_f1 | the predicted values on validation data from a
flexible estimation technique regressing Y on X in the training data. Provided as
either (a) a vector, where each element is
the predicted value when that observation is part of the validation fold;
or (b) a list of length V, where each element in the list is a set of predictions on the
corresponding validation data fold.
If sample-splitting is requested, then these must be estimated specially; see Details. However,
the resulting vector should be the same length as  | 
| cross_fitted_f2 | the predicted values on validation data from a
flexible estimation technique regressing either (a) the fitted values in
 | 
| indx | the indices of the covariate(s) to calculate variable importance for; defaults to 1. | 
| V | the number of folds for cross-fitting, defaults to 5. If
 | 
| run_regression | if outcome Y and covariates X are passed to
 | 
| SL.library | a character vector of learners to pass to
 | 
| alpha | the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. | 
| delta | the value of the  | 
| na.rm | should we remove NAs in the outcome and fitted values
in computation? (defaults to  | 
| cross_fitting_folds | the folds for cross-fitting. Only used if
 | 
| stratified | if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds) | 
| C | the indicator of coarsening (1 denotes observed, 0 denotes unobserved). | 
| Z | either (i) NULL (the default, in which case the argument
 | 
| ipc_weights | weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). | 
| scale | should CIs be computed on original ("identity") or another scale? (options are "log" and "logit") | 
| ipc_est_type | the type of procedure used for coarsened-at-random
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if  | 
| scale_est | should the point estimate be scaled to be greater than or equal to 0?
Defaults to  | 
| cross_fitted_se | should we use cross-fitting to estimate the standard
errors ( | 
| ... | other arguments to the estimation tool, see "See also". | 
We define the population ANOVA
parameter for the group of features (or  single feature) s by
\psi_{0,s} := E_0\{f_0(X) - f_{0,s}(X)\}^2/var_0(Y),
where f_0 is the population conditional mean using all features,
f_{0,s} is the population conditional mean using the features with
index not in s, and E_0 and var_0 denote expectation and
variance under the true data-generating distribution, respectively.
Cross-fitted ANOVA estimates are computed by first
splitting the data into K folds; then using each fold in turn as a
hold-out set, constructing estimators f_{n,k} and f_{n,k,s} of
f_0 and f_{0,s}, respectively on the training data and estimator
E_{n,k} of E_0 using the test data; and finally, computing
\psi_{n,s} := K^{(-1)}\sum_{k=1}^K E_{n,k}\{f_{n,k}(X) - f_{n,k,s}(X)\}^2/var_n(Y),
where var_n is the empirical variance.
See the paper by Williamson, Gilbert, Simon, and Carone for more
details on the mathematics behind this function.
An object of classes vim and vim_anova.
See Details for more information.
SuperLearner for specific usage of the
SuperLearner function and package.
# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))
# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y <- smooth + stats::rnorm(n, 0, 1)
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
# estimate (with a small number of folds, for illustration only)
est <- vimp_anova(y, x, indx = 2,
           alpha = 0.05, run_regression = TRUE,
           SL.library = learners, V = 2, cvControl = list(V = 2))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.