# cv_vim: Nonparametric Intrinsic Variable Importance Estimates and... In vimp: Perform Inference on Algorithm-Agnostic Variable Importance

## Description

Compute estimates and confidence intervals using cross-fitting for nonparametric intrinsic variable importance based on the population-level contrast between the oracle predictiveness using the feature(s) of interest versus not.

## Usage

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 cv_vim( Y = NULL, X = NULL, cross_fitted_f1 = NULL, cross_fitted_f2 = NULL, f1 = NULL, f2 = NULL, indx = 1, V = length(unique(cross_fitting_folds)), sample_splitting = TRUE, sample_splitting_folds = NULL, cross_fitting_folds = NULL, stratified = FALSE, type = "r_squared", run_regression = TRUE, SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"), alpha = 0.05, delta = 0, scale = "identity", na.rm = FALSE, C = rep(1, length(Y)), Z = NULL, ipc_weights = rep(1, length(Y)), ipc_est_type = "aipw", scale_est = TRUE, cross_fitted_se = TRUE, bootstrap = FALSE, b = 1000, ... ) 

## Arguments

 Y the outcome. X the covariates. cross_fitted_f1 the predicted values on validation data from a flexible estimation technique regressing Y on X in the training data; a list of length V, where each object is a set of predictions on the validation data. If sample-splitting is requested, then these must be estimated specially; see Details. cross_fitted_f2 the predicted values on validation data from a flexible estimation technique regressing either (a) the fitted values in cross_fitted_f1, or (b) Y, on X withholding the columns in indx; a list of length V, where each object is a set of predictions on the validation data. If sample-splitting is requested, then these must be estimated specially; see Details. f1 the fitted values from a flexible estimation technique regressing Y on X. If sample-splitting is requested, then these must be estimated specially; see Details. If cross_fitted_se = TRUE, then this argument is not used. f2 the fitted values from a flexible estimation technique regressing either (a) f1 or (b) Y on X withholding the columns in indx. If sample-splitting is requested, then these must be estimated specially; see Details. If cross_fitted_se = TRUE, then this argument is not used. indx the indices of the covariate(s) to calculate variable importance for; defaults to 1. V the number of folds for cross-fitting, defaults to 5. If sample_splitting = TRUE, then a special type of V-fold cross-fitting is done. See Details for a more detailed explanation. sample_splitting should we use sample-splitting to estimate the full and reduced predictiveness? Defaults to TRUE, since inferences made using sample_splitting = FALSE will be invalid for variable with truly zero importance. sample_splitting_folds the folds to use for sample-splitting; if entered, these should result in balance within the cross-fitting folds. Only used if run_regression = FALSE and sample_splitting = TRUE. A vector of length 2 * V. cross_fitting_folds the folds for cross-fitting. Only used if run_regression = FALSE. stratified if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-fitting folds) type the type of parameter (e.g., ANOVA-based is "anova"). run_regression if outcome Y and covariates X are passed to cv_vim, and run_regression is TRUE, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values. SL.library a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean. alpha the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. delta the value of the δ-null (i.e., testing if importance < δ); defaults to 0. scale should CIs be computed on original ("identity", default) or logit ("logit") scale? na.rm should we remove NA's in the outcome and fitted values in computation? (defaults to FALSE) C the indicator of coarsening (1 denotes observed, 0 denotes unobserved). Z either (i) NULL (the default, in which case the argument C above must be all ones), or (ii) a character vector specifying the variable(s) among Y and X that are thought to play a role in the coarsening mechanism. ipc_weights weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). ipc_est_type the type of procedure used for coarsened-at-random settings; options are "ipw" (for inverse probability weighting) or "aipw" (for augmented inverse probability weighting). Only used if C is not all equal to 1. scale_est should the point estimate be scaled to be greater than 0? Defaults to TRUE. cross_fitted_se should we use cross-fitting to estimate the standard errors (TRUE, the default) or not (FALSE)? bootstrap should bootstrap-based standard error estimates be computed? Defaults to FALSE (and currently may only be used if sample_splitting = FALSE and cross_fitted_se = FALSE). b the number of bootstrap replicates (only used if bootstrap = TRUE and sample_splitting = FALSE). ... other arguments to the estimation tool, see "See also".

## Details

We define the population variable importance measure (VIM) for the group of features (or single feature) s with respect to the predictiveness measure V by

ψ_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),

where f_0 is the population predictiveness maximizing function, f_{0,s} is the population predictiveness maximizing function that is only allowed to access the features with index not in s, and P_0 is the true data-generating distribution.

Cross-fitted VIM estimates are computed differently if sample-splitting is requested versus if it is not. We recommend using sample-splitting in most cases, since only in this case will inferences be valid if the variable(s) of interest have truly zero population importance. The purpose of cross-fitting is to estimate f_0 and f_{0,s} on independent data from estimating P_0; this can result in improved performance, especially when using flexible learning algorithms. The purpose of sample-splitting is to estimate f_0 and f_{0,s} on independent data; this allows valid inference under the null hypothesis of zero importance.

Without sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into K folds; then using each fold in turn as a hold-out set, constructing estimators f_{n,k} and f_{n,k,s} of f_0 and f_{0,s}, respectively on the training data and estimator P_{n,k} of P_0 using the test data; and finally, computing

ψ_{n,s} := K^{(-1)}∑_{k=1}^K \{V(f_{n,k},P_{n,k}) - V(f_{n,k,s}, P_{n,k})\}.

With sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into 2K folds. These folds are further divided into 2 groups of folds. Then, for each fold k in the first group, estimator f_{n,k} of f_0 is constructed using all data besides the kth fold in the group (i.e., (2K - 1)/(2K) of the data) and estimator P_{n,k} of P_0 is constructed using the held-out data (i.e., 1/2K of the data); then, computing

v_{n,k} = V(f_{n,k},P_{n,k}).

Similarly, for each fold k in the second group, estimator f_{n,k,s} of f_{0,s} is constructed using all data besides the kth fold in the group (i.e., (2K - 1)/(2K) of the data) and estimator P_{n,k} of P_0 is constructed using the held-out data (i.e., 1/2K of the data); then, computing

v_{n,k,s} = V(f_{n,k,s},P_{n,k}).

Finally,

ψ_{n,s} := K^{(-1)}∑_{k=1}^K \{v_{n,k} - v_{n,k,s}\}.

See the paper by Williamson, Gilbert, Simon, and Carone for more details on the mathematics behind the cv_vim function, and the validity of the confidence intervals.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list including:

s

the column(s) to calculate variable importance for

SL.library

the library of learners passed to SuperLearner

full_fit

the fitted values of the chosen method fit to the full data (a list, for train and test data)

red_fit

the fitted values of the chosen method fit to the reduced data (a list, for train and test data)

est

the estimated variable importance

naive

the naive estimator of variable importance

eif

the estimated efficient influence function

eif_full

the estimated efficient influence function for the full regression

eif_reduced

the estimated efficient influence function for the reduced regression

se

the standard error for the estimated variable importance

ci

the (1-α) \times 100% confidence interval for the variable importance estimate

test

a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test

p_value

a p-value based on the same test as test

full_mod

the object returned by the estimation procedure for the full data regression (if applicable)

red_mod

the object returned by the estimation procedure for the reduced data regression (if applicable)

alpha

the level, for confidence interval calculation

sample_splitting_folds

the folds used for hypothesis testing

cross_fitting_folds

the folds used for cross-fitting

y

the outcome

ipc_weights

the weights

mat

a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

## Value

An object of class vim. See Details for more information.

SuperLearner for specific usage of the SuperLearner function and package.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 n <- 100 p <- 2 # generate the data x <- data.frame(replicate(p, stats::runif(n, -5, 5))) # apply the function to the x's smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2 # generate Y ~ Normal (smooth, 1) y <- as.matrix(smooth + stats::rnorm(n, 0, 1)) # set up a library for SuperLearner; note simple library for speed library("SuperLearner") learners <- c("SL.glm") # ----------------------------------------- # using Super Learner (with a small number of folds, for illustration only) # ----------------------------------------- set.seed(4747) est <- cv_vim(Y = y, X = x, indx = 2, V = 2, type = "r_squared", run_regression = TRUE, SL.library = learners, cvControl = list(V = 2), alpha = 0.05) # ------------------------------------------ # doing things by hand, and plugging them in # (with a small number of folds, for illustration only) # ------------------------------------------ # set up the folds indx <- 2 V <- 2 Y <- matrix(y) set.seed(4747) # Note that the CV.SuperLearner should be run with an outer layer # of 2*V folds (for V-fold cross-fitted importance) full_cv_fit <- suppressWarnings(SuperLearner::CV.SuperLearner( Y = Y, X = x, SL.library = learners, cvControl = list(V = 2 * V), innerCvControl = list(list(V = V)) )) # use the same cross-fitting folds for reduced reduced_cv_fit <- suppressWarnings(SuperLearner::CV.SuperLearner( Y = Y, X = x[, -indx, drop = FALSE], SL.library = learners, cvControl = SuperLearner::SuperLearner.CV.control( V = 2 * V, validRows = full_cv_fit$folds ), innerCvControl = list(list(V = V)) )) # extract the predictions on split portions of the data, # for hypothesis testing cross_fitting_folds <- get_cv_sl_folds(full_cv_fit$folds) set.seed(1234) sample_splitting_folds <- make_folds(unique(cross_fitting_folds), V = 2) full_cv_preds <- extract_sampled_split_predictions( full_cv_fit, sample_splitting_folds = sample_splitting_folds, full = TRUE ) reduced_cv_preds <- extract_sampled_split_predictions( reduced_cv_fit, sample_splitting_folds = sample_splitting_folds, full = FALSE ) set.seed(5678) est <- cv_vim(Y = y, cross_fitted_f1 = full_cv_preds, cross_fitted_f2 = reduced_cv_preds, indx = 2, delta = 0, V = V, type = "r_squared", cross_fitting_folds = cross_fitting_folds, sample_splitting_folds = sample_splitting_folds, run_regression = FALSE, alpha = 0.05, na.rm = TRUE)