vim  R Documentation 
Compute estimates of and confidence intervals for nonparametric intrinsic variable importance based on the populationlevel contrast between the oracle predictiveness using the feature(s) of interest versus not.
vim( Y = NULL, X = NULL, f1 = NULL, f2 = NULL, indx = 1, type = "r_squared", run_regression = TRUE, SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"), alpha = 0.05, delta = 0, scale = "identity", na.rm = FALSE, sample_splitting = TRUE, sample_splitting_folds = NULL, final_point_estimate = "split", stratified = FALSE, C = rep(1, length(Y)), Z = NULL, ipc_scale = "identity", ipc_weights = rep(1, length(Y)), ipc_est_type = "aipw", scale_est = TRUE, nuisance_estimators_full = NULL, nuisance_estimators_reduced = NULL, exposure_name = NULL, bootstrap = FALSE, b = 1000, boot_interval_type = "perc", ... )
Y 
the outcome. 
X 
the covariates. If 
f1 
the fitted values from a flexible estimation technique
regressing Y on X. A vector of the same length as 
f2 
the fitted values from a flexible estimation technique
regressing either (a) 
indx 
the indices of the covariate(s) to calculate variable importance for; defaults to 1. 
type 
the type of importance to compute; defaults to

run_regression 
if outcome Y and covariates X are passed to

SL.library 
a character vector of learners to pass to

alpha 
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. 
delta 
the value of the δnull (i.e., testing if importance < δ); defaults to 0. 
scale 
should CIs be computed on original ("identity") or another scale? (options are "log" and "logit") 
na.rm 
should we remove NAs in the outcome and fitted values
in computation? (defaults to 
sample_splitting 
should we use samplesplitting to estimate the full and
reduced predictiveness? Defaults to 
sample_splitting_folds 
the folds used for samplesplitting;
these identify the observations that should be used to evaluate
predictiveness based on the full and reduced sets of covariates, respectively.
Only used if 
final_point_estimate 
if sample splitting is used, should the final point estimates
be based on only the samplesplit folds used for inference ( 
stratified 
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across crossvalidation folds) 
C 
the indicator of coarsening (1 denotes observed, 0 denotes unobserved). 
Z 
either (i) NULL (the default, in which case the argument

ipc_scale 
what scale should the inverse probability weight correction be applied on (if any)? Defaults to "identity". (other options are "log" and "logit") 
ipc_weights 
weights for the computed influence curve (i.e., inverse probability weights for coarsenedatrandom settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). 
ipc_est_type 
the type of procedure used for coarsenedatrandom
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if 
scale_est 
should the point estimate be scaled to be greater than or equal to 0?
Defaults to 
nuisance_estimators_full 
(only used if 
nuisance_estimators_reduced 
(only used if 
exposure_name 
(only used if 
bootstrap 
should bootstrapbased standard error estimates be computed?
Defaults to 
b 
the number of bootstrap replicates (only used if 
boot_interval_type 
the type of bootstrap interval (one of 
... 
other arguments to the estimation tool, see "See also". 
We define the population variable importance measure (VIM) for the group of features (or single feature) s with respect to the predictiveness measure V by
ψ_{0,s} := V(f_0, P_0)  V(f_{0,s}, P_0),
where f_0 is the population predictiveness maximizing function, f_{0,s} is the population predictiveness maximizing function that is only allowed to access the features with index not in s, and P_0 is the true datagenerating distribution. VIM estimates are obtained by obtaining estimators f_n and f_{n,s} of f_0 and f_{0,s}, respectively; obtaining an estimator P_n of P_0; and finally, setting ψ_{n,s} := V(f_n, P_n)  V(f_{n,s}, P_n).
In the interest of transparency, we return most of the calculations
within the vim
object. This results in a list including:
the column(s) to calculate variable importance for
the library of learners passed to SuperLearner
the type of riskbased variable importance measured
the fitted values of the chosen method fit to the full data
the fitted values of the chosen method fit to the reduced data
the estimated variable importance
the naive estimator of variable importance (only used if type = "anova"
)
the estimated efficient influence function
the estimated efficient influence function for the full regression
the estimated efficient influence function for the reduced regression
the standard error for the estimated variable importance
the (1α) \times 100% confidence interval for the variable importance estimate
a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
a pvalue based on the same test as test
the object returned by the estimation procedure for the full data regression (if applicable)
the object returned by the estimation procedure for the reduced data regression (if applicable)
the level, for confidence interval calculation
the folds used for samplesplitting (used for hypothesis testing)
the outcome
the weights
a tibble with the estimate, SE, CI, hypothesis testing decision, and pvalue
An object of classes vim
and the type of riskbased measure.
See Details for more information.
SuperLearner
for specific usage of the
SuperLearner
function and package.
# generate the data # generate X p < 2 n < 100 x < data.frame(replicate(p, stats::runif(n, 1, 1))) # apply the function to the x's f < function(x) 0.5 + 0.3*x[1] + 0.2*x[2] smooth < apply(x, 1, function(z) f(z)) # generate Y ~ Bernoulli (smooth) y < matrix(rbinom(n, size = 1, prob = smooth)) # set up a library for SuperLearner; note simple library for speed library("SuperLearner") learners < c("SL.glm") # using Y and X; use classbalanced folds est_1 < vim(y, x, indx = 2, type = "accuracy", alpha = 0.05, run_regression = TRUE, SL.library = learners, cvControl = list(V = 2), stratified = TRUE) # using precomputed fitted values set.seed(4747) V < 2 full_fit < SuperLearner::CV.SuperLearner(Y = y, X = x, SL.library = learners, cvControl = list(V = 2), innerCvControl = list(list(V = V))) full_fitted < SuperLearner::predict.SuperLearner(full_fit)$pred # fit the data with only X1 reduced_fit < SuperLearner::CV.SuperLearner(Y = full_fitted, X = x[, 2, drop = FALSE], SL.library = learners, cvControl = list(V = 2, validRows = full_fit$folds), innerCvControl = list(list(V = V))) reduced_fitted < SuperLearner::predict.SuperLearner(reduced_fit)$pred est_2 < vim(Y = y, f1 = full_fitted, f2 = reduced_fitted, indx = 2, run_regression = FALSE, alpha = 0.05, stratified = TRUE, type = "accuracy", sample_splitting_folds = get_cv_sl_folds(full_fit$folds))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.