cv_vim | R Documentation |
Compute estimates and confidence intervals using cross-fitting for nonparametric intrinsic variable importance based on the population-level contrast between the oracle predictiveness using the feature(s) of interest versus not.
cv_vim(
Y = NULL,
X = NULL,
cross_fitted_f1 = NULL,
cross_fitted_f2 = NULL,
f1 = NULL,
f2 = NULL,
indx = 1,
V = ifelse(is.null(cross_fitting_folds), 5, length(unique(cross_fitting_folds))),
sample_splitting = TRUE,
final_point_estimate = "split",
sample_splitting_folds = NULL,
cross_fitting_folds = NULL,
stratified = FALSE,
type = "r_squared",
run_regression = TRUE,
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
alpha = 0.05,
delta = 0,
scale = "identity",
na.rm = FALSE,
C = rep(1, length(Y)),
Z = NULL,
ipc_scale = "identity",
ipc_weights = rep(1, length(Y)),
ipc_est_type = "aipw",
scale_est = TRUE,
nuisance_estimators_full = NULL,
nuisance_estimators_reduced = NULL,
exposure_name = NULL,
cross_fitted_se = TRUE,
bootstrap = FALSE,
b = 1000,
boot_interval_type = "perc",
clustered = FALSE,
cluster_id = rep(NA, length(Y)),
...
)
Y |
the outcome. |
X |
the covariates. If |
cross_fitted_f1 |
the predicted values on validation data from a
flexible estimation technique regressing Y on X in the training data. Provided as
either (a) a vector, where each element is
the predicted value when that observation is part of the validation fold;
or (b) a list of length V, where each element in the list is a set of predictions on the
corresponding validation data fold.
If sample-splitting is requested, then these must be estimated specially; see Details. However,
the resulting vector should be the same length as |
cross_fitted_f2 |
the predicted values on validation data from a
flexible estimation technique regressing either (a) the fitted values in
|
f1 |
the fitted values from a flexible estimation technique
regressing Y on X. If sample-splitting is requested, then these must be
estimated specially; see Details. If |
f2 |
the fitted values from a flexible estimation technique
regressing either (a) |
indx |
the indices of the covariate(s) to calculate variable importance for; defaults to 1. |
V |
the number of folds for cross-fitting, defaults to 5. If
|
sample_splitting |
should we use sample-splitting to estimate the full and
reduced predictiveness? Defaults to |
final_point_estimate |
if sample splitting is used, should the final point estimates
be based on only the sample-split folds used for inference ( |
sample_splitting_folds |
the folds used for sample-splitting;
these identify the observations that should be used to evaluate
predictiveness based on the full and reduced sets of covariates, respectively.
Only used if |
cross_fitting_folds |
the folds for cross-fitting. Only used if
|
stratified |
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds) |
type |
the type of importance to compute; defaults to
|
run_regression |
if outcome Y and covariates X are passed to
|
SL.library |
a character vector of learners to pass to
|
alpha |
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval. |
delta |
the value of the |
scale |
should CIs be computed on original ("identity") or another scale? (options are "log" and "logit") |
na.rm |
should we remove NAs in the outcome and fitted values
in computation? (defaults to |
C |
the indicator of coarsening (1 denotes observed, 0 denotes unobserved). |
Z |
either (i) NULL (the default, in which case the argument
|
ipc_scale |
what scale should the inverse probability weight correction be applied on (if any)? Defaults to "identity". (other options are "log" and "logit") |
ipc_weights |
weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]). |
ipc_est_type |
the type of procedure used for coarsened-at-random
settings; options are "ipw" (for inverse probability weighting) or
"aipw" (for augmented inverse probability weighting).
Only used if |
scale_est |
should the point estimate be scaled to be greater than or equal to 0?
Defaults to |
nuisance_estimators_full |
(only used if |
nuisance_estimators_reduced |
(only used if |
exposure_name |
(only used if |
cross_fitted_se |
should we use cross-fitting to estimate the standard
errors ( |
bootstrap |
should bootstrap-based standard error estimates be computed?
Defaults to |
b |
the number of bootstrap replicates (only used if |
boot_interval_type |
the type of bootstrap interval (one of |
clustered |
should the bootstrap resamples be performed on clusters
rather than individual observations? Defaults to |
cluster_id |
vector of the same length as |
... |
other arguments to the estimation tool, see "See also". |
We define the population variable importance measure (VIM) for the
group of features (or single feature) s
with respect to the
predictiveness measure V
by
\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),
where f_0
is
the population predictiveness maximizing function, f_{0,s}
is the
population predictiveness maximizing function that is only allowed to access
the features with index not in s
, and P_0
is the true
data-generating distribution.
Cross-fitted VIM estimates are computed differently if sample-splitting
is requested versus if it is not. We recommend using sample-splitting
in most cases, since only in this case will inferences be valid if
the variable(s) of interest have truly zero population importance.
The purpose of cross-fitting is to estimate f_0
and f_{0,s}
on independent data from estimating P_0
; this can result in improved
performance, especially when using flexible learning algorithms. The purpose
of sample-splitting is to estimate f_0
and f_{0,s}
on independent
data; this allows valid inference under the null hypothesis of zero importance.
Without sample-splitting, cross-fitted VIM estimates are obtained by first
splitting the data into K
folds; then using each fold in turn as a
hold-out set, constructing estimators f_{n,k}
and f_{n,k,s}
of
f_0
and f_{0,s}
, respectively on the training data and estimator
P_{n,k}
of P_0
using the test data; and finally, computing
\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{V(f_{n,k},P_{n,k}) - V(f_{n,k,s}, P_{n,k})\}.
With sample-splitting, cross-fitted VIM estimates are obtained by first
splitting the data into 2K
folds. These folds are further divided
into 2 groups of folds. Then, for each fold k
in the first group,
estimator f_{n,k}
of f_0
is constructed using all data besides
the kth fold in the group (i.e., (2K - 1)/(2K)
of the data) and
estimator P_{n,k}
of P_0
is constructed using the held-out data
(i.e., 1/2K
of the data); then, computing
v_{n,k} = V(f_{n,k},P_{n,k}).
Similarly, for each fold k
in the second group,
estimator f_{n,k,s}
of f_{0,s}
is constructed using all data
besides the kth fold in the group (i.e., (2K - 1)/(2K)
of the data)
and estimator P_{n,k}
of P_0
is constructed using the held-out
data (i.e., 1/2K
of the data); then, computing
v_{n,k,s} = V(f_{n,k,s},P_{n,k}).
Finally,
\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{v_{n,k} - v_{n,k,s}\}.
See the paper by Williamson, Gilbert, Simon, and Carone for more
details on the mathematics behind the cv_vim
function, and the
validity of the confidence intervals.
In the interest of transparency, we return most of the calculations
within the vim
object. This results in a list including:
the column(s) to calculate variable importance for
the library of learners passed to SuperLearner
the fitted values of the chosen method fit to the full data (a list, for train and test data)
the fitted values of the chosen method fit to the reduced data (a list, for train and test data)
the estimated variable importance
the naive estimator of variable importance
the estimated efficient influence function
the estimated efficient influence function for the full regression
the estimated efficient influence function for the reduced regression
the standard error for the estimated variable importance
the (1-\alpha) \times 100
% confidence interval for the variable importance estimate
a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
a p-value based on the same test as test
the object returned by the estimation procedure for the full data regression (if applicable)
the object returned by the estimation procedure for the reduced data regression (if applicable)
the level, for confidence interval calculation
the folds used for hypothesis testing
the folds used for cross-fitting
the outcome
the weights
the cluster IDs
a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value
An object of class vim
. See Details for more information.
SuperLearner
for specific usage of the
SuperLearner
function and package.
n <- 100
p <- 2
# generate the data
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))
# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y <- as.matrix(smooth + stats::rnorm(n, 0, 1))
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm")
# -----------------------------------------
# using Super Learner (with a small number of folds, for illustration only)
# -----------------------------------------
set.seed(4747)
est <- cv_vim(Y = y, X = x, indx = 2, V = 2,
type = "r_squared", run_regression = TRUE,
SL.library = learners, cvControl = list(V = 2), alpha = 0.05)
# ------------------------------------------
# doing things by hand, and plugging them in
# (with a small number of folds, for illustration only)
# ------------------------------------------
# set up the folds
indx <- 2
V <- 2
Y <- matrix(y)
set.seed(4747)
# Note that the CV.SuperLearner should be run with an outer layer
# of 2*V folds (for V-fold cross-fitted importance)
full_cv_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
Y = Y, X = x, SL.library = learners, cvControl = list(V = 2 * V),
innerCvControl = list(list(V = V))
))
full_cv_preds <- full_cv_fit$SL.predict
# use the same cross-fitting folds for reduced
reduced_cv_fit <- suppressWarnings(SuperLearner::CV.SuperLearner(
Y = Y, X = x[, -indx, drop = FALSE], SL.library = learners,
cvControl = SuperLearner::SuperLearner.CV.control(
V = 2 * V, validRows = full_cv_fit$folds
),
innerCvControl = list(list(V = V))
))
reduced_cv_preds <- reduced_cv_fit$SL.predict
# for hypothesis testing
cross_fitting_folds <- get_cv_sl_folds(full_cv_fit$folds)
set.seed(1234)
sample_splitting_folds <- make_folds(unique(cross_fitting_folds), V = 2)
set.seed(5678)
est <- cv_vim(Y = y, cross_fitted_f1 = full_cv_preds,
cross_fitted_f2 = reduced_cv_preds, indx = 2, delta = 0, V = V, type = "r_squared",
cross_fitting_folds = cross_fitting_folds,
sample_splitting_folds = sample_splitting_folds,
run_regression = FALSE, alpha = 0.05, na.rm = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.