Description Usage Arguments Details Value See Also Examples
Compute estimates and confidence intervals using crossfitting for nonparametric intrinsic variable importance based on the populationlevel contrast between the oracle predictiveness using the feature(s) of interest versus not.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  cv_vim(
Y = NULL,
X = NULL,
cross_fitted_f1 = NULL,
cross_fitted_f2 = NULL,
f1 = NULL,
f2 = NULL,
indx = 1,
V = length(unique(cross_fitting_folds)),
sample_splitting = TRUE,
sample_splitting_folds = NULL,
cross_fitting_folds = NULL,
stratified = FALSE,
type = "r_squared",
run_regression = TRUE,
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
alpha = 0.05,
delta = 0,
scale = "identity",
na.rm = FALSE,
C = rep(1, length(Y)),
Z = NULL,
ipc_weights = rep(1, length(Y)),
ipc_est_type = "aipw",
scale_est = TRUE,
cross_fitted_se = TRUE,
bootstrap = FALSE,
b = 1000,
...
)

Y 
the outcome. 
X 
the covariates. 
cross_fitted_f1 
the predicted values on validation data from a flexible estimation technique regressing Y on X in the training data; a list of length V, where each object is a set of predictions on the validation data. If samplesplitting is requested, then these must be estimated specially; see Details. 
cross_fitted_f2 
the predicted values on validation data from a
flexible estimation technique regressing either (a) the fitted values in

f1 
the fitted values from a flexible estimation technique
regressing Y on X. If samplesplitting is requested, then these must be
estimated specially; see Details. If 
f2 
the fitted values from a flexible estimation technique
regressing either (a) 
indx 
the indices of the covariate(s) to calculate variable importance for; defaults to 1. 
V 
the number of folds for crossfitting, defaults to 5. If

sample_splitting 
should we use samplesplitting to estimate the full and
reduced predictiveness? Defaults to 
sample_splitting_folds 
the folds to use for samplesplitting; if entered,
these should result in balance within the crossfitting folds. Only used
if 
cross_fitting_folds 
the folds for crossfitting. Only used if

stratified 
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across crossfitting folds) 
type 
the type of parameter (e.g., ANOVAbased is 
run_regression 
if outcome Y and covariates X are passed to

SL.library 
a character vector of learners to pass to

alpha 
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. 
delta 
the value of the δnull (i.e., testing if importance < δ); defaults to 0. 
scale 
should CIs be computed on original ("identity", default) or logit ("logit") scale? 
na.rm 
should we remove NA's in the outcome and fitted values in
computation? (defaults to 
C 
the indicator of coarsening (1 denotes observed, 0 denotes unobserved). 
Z 
either (i) NULL (the default, in which case the argument

ipc_weights 
weights for the computed influence curve (i.e., inverse probability weights for coarsenedatrandom settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). 
ipc_est_type 
the type of procedure used for coarsenedatrandom
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if 
scale_est 
should the point estimate be scaled to be greater than 0?
Defaults to 
cross_fitted_se 
should we use crossfitting to estimate the standard
errors ( 
bootstrap 
should bootstrapbased standard error estimates be computed?
Defaults to 
b 
the number of bootstrap replicates (only used if 
... 
other arguments to the estimation tool, see "See also". 
We define the population variable importance measure (VIM) for the group of features (or single feature) s with respect to the predictiveness measure V by
ψ_{0,s} := V(f_0, P_0)  V(f_{0,s}, P_0),
where f_0 is the population predictiveness maximizing function, f_{0,s} is the population predictiveness maximizing function that is only allowed to access the features with index not in s, and P_0 is the true datagenerating distribution.
Crossfitted VIM estimates are computed differently if samplesplitting is requested versus if it is not. We recommend using samplesplitting in most cases, since only in this case will inferences be valid if the variable(s) of interest have truly zero population importance. The purpose of crossfitting is to estimate f_0 and f_{0,s} on independent data from estimating P_0; this can result in improved performance, especially when using flexible learning algorithms. The purpose of samplesplitting is to estimate f_0 and f_{0,s} on independent data; this allows valid inference under the null hypothesis of zero importance.
Without samplesplitting, crossfitted VIM estimates are obtained by first splitting the data into K folds; then using each fold in turn as a holdout set, constructing estimators f_{n,k} and f_{n,k,s} of f_0 and f_{0,s}, respectively on the training data and estimator P_{n,k} of P_0 using the test data; and finally, computing
ψ_{n,s} := K^{(1)}∑_{k=1}^K \{V(f_{n,k},P_{n,k})  V(f_{n,k,s}, P_{n,k})\}.
With samplesplitting, crossfitted VIM estimates are obtained by first splitting the data into 2K folds. These folds are further divided into 2 groups of folds. Then, for each fold k in the first group, estimator f_{n,k} of f_0 is constructed using all data besides the kth fold in the group (i.e., (2K  1)/(2K) of the data) and estimator P_{n,k} of P_0 is constructed using the heldout data (i.e., 1/2K of the data); then, computing
v_{n,k} = V(f_{n,k},P_{n,k}).
Similarly, for each fold k in the second group, estimator f_{n,k,s} of f_{0,s} is constructed using all data besides the kth fold in the group (i.e., (2K  1)/(2K) of the data) and estimator P_{n,k} of P_0 is constructed using the heldout data (i.e., 1/2K of the data); then, computing
v_{n,k,s} = V(f_{n,k,s},P_{n,k}).
Finally,
ψ_{n,s} := K^{(1)}∑_{k=1}^K \{v_{n,k}  v_{n,k,s}\}.
See the paper by Williamson, Gilbert, Simon, and Carone for more
details on the mathematics behind the cv_vim
function, and the
validity of the confidence intervals.
In the interest of transparency, we return most of the calculations
within the vim
object. This results in a list including:
the column(s) to calculate variable importance for
the library of learners passed to SuperLearner
the fitted values of the chosen method fit to the full data (a list, for train and test data)
the fitted values of the chosen method fit to the reduced data (a list, for train and test data)
the estimated variable importance
the naive estimator of variable importance
the estimated efficient influence function
the estimated efficient influence function for the full regression
the estimated efficient influence function for the reduced regression
the standard error for the estimated variable importance
the (1α) \times 100% confidence interval for the variable importance estimate
a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
a pvalue based on the same test as test
the object returned by the estimation procedure for the full data regression (if applicable)
the object returned by the estimation procedure for the reduced data regression (if applicable)
the level, for confidence interval calculation
the folds used for hypothesis testing
the folds used for crossfitting
the outcome
the weights
a tibble with the estimate, SE, CI, hypothesis testing decision, and pvalue
An object of class vim
. See Details for more information.
SuperLearner
for specific usage of the
SuperLearner
function and package.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63  n < 100
p < 2
# generate the data
x < data.frame(replicate(p, stats::runif(n, 5, 5)))
# apply the function to the x's
smooth < (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y < as.matrix(smooth + stats::rnorm(n, 0, 1))
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners < c("SL.glm")
# 
# using Super Learner (with a small number of folds, for illustration only)
# 
set.seed(4747)
est < cv_vim(Y = y, X = x, indx = 2, V = 2,
type = "r_squared", run_regression = TRUE,
SL.library = learners, cvControl = list(V = 2), alpha = 0.05)
# 
# doing things by hand, and plugging them in
# (with a small number of folds, for illustration only)
# 
# set up the folds
indx < 2
V < 2
Y < matrix(y)
set.seed(4747)
# Note that the CV.SuperLearner should be run with an outer layer
# of 2*V folds (for Vfold crossfitted importance)
full_cv_fit < suppressWarnings(SuperLearner::CV.SuperLearner(
Y = Y, X = x, SL.library = learners, cvControl = list(V = 2 * V),
innerCvControl = list(list(V = V))
))
# use the same crossfitting folds for reduced
reduced_cv_fit < suppressWarnings(SuperLearner::CV.SuperLearner(
Y = Y, X = x[, indx, drop = FALSE], SL.library = learners,
cvControl = SuperLearner::SuperLearner.CV.control(
V = 2 * V, validRows = full_cv_fit$folds
),
innerCvControl = list(list(V = V))
))
# extract the predictions on split portions of the data,
# for hypothesis testing
cross_fitting_folds < get_cv_sl_folds(full_cv_fit$folds)
set.seed(1234)
sample_splitting_folds < make_folds(unique(cross_fitting_folds), V = 2)
full_cv_preds < extract_sampled_split_predictions(
full_cv_fit, sample_splitting_folds = sample_splitting_folds, full = TRUE
)
reduced_cv_preds < extract_sampled_split_predictions(
reduced_cv_fit, sample_splitting_folds = sample_splitting_folds, full = FALSE
)
set.seed(5678)
est < cv_vim(Y = y, cross_fitted_f1 = full_cv_preds,
cross_fitted_f2 = reduced_cv_preds, indx = 2, delta = 0, V = V, type = "r_squared",
cross_fitting_folds = cross_fitting_folds,
sample_splitting_folds = sample_splitting_folds,
run_regression = FALSE, alpha = 0.05, na.rm = TRUE)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.