Variable importance estimation using causal inference (targeted learning)


varimpact returns variable importance statistics ordered by statistical significance using a combination of data-adaptive target parameter


  A_names = colnames(data),
  V = 2L,
  Q.library = c("SL.glm", "SL.mean"),
  g.library = c("SL.glm", "SL.mean"),
  family = "binomial",
  minYs = 15L,
  minCell = 0L,
  adjust_cutoff = 10L,
  corthres = 0.8,
  impute = "median",
  miss.cut = 0.5,
  bins_numeric = 10L,
  quantile_probs_factor = c(0.1, 0.9),
  quantile_probs_numeric = quantile_probs_factor,
  verbose = FALSE,
  verbose_tmle = FALSE,
  verbose_reduction = FALSE,
  parallel = TRUE,
  digits = 4L



outcome of interest (numeric vector)


data frame of predictor variables of interest for which function returns VIM's. (possibly a matrix?)


Names of the variables for which we want to estimate importance, a subset of the data argument.


Number of cross-validation folds.


library used by SuperLearner for model of outcome versus predictors


library used by SuperLearner for model of predictor variable of interest versus other predictors


family ('binomial' or 'gaussian')


mininum # of obs with event - if it is < minYs, skip VIM


is the cut-off for including a category of A in analysis, and presents the minumum of cells in a 2x2 table of the indicator of that level versus outcome, separately by training and validation sample.


Maximum number of adjustment variables during TMLE. If more than this cutoff varimpact will attempt to reduce the dimensions to that number (using HOPACH hierarchical clustering). Must not be more than 15 due to HOPACH constraints. Set to NULL to disable any dimension reduction.


cut-off correlation with explanatory variable for inclusion of an adjustment variables


Type of missing value imputation to conduct. One of: "zero", "median", "knn" (default). Note: knn results in the covariate data being centered/scaled.


eliminates explanatory (X) variables with proportion of missing obs >


Numbers of bins when discretizing numeric variables.


Quantiles used to check if factors have sufficient variation.


Quantiles used to check if numerics have sufficient variation.


Boolean - if TRUE the method will display more detailed output.


Boolean - if TRUE, will display even more detail on the TMLE estimation process.


Boolean - if TRUE, will display more detail during variable reduction step (clustering).


Use parallel processing if a backend is registered; enabled by default.


Number of digits to round the value labels.


The function performs the following functions.

  1. Drops variables missing > miss.cut of time (tuneable).

  2. Separate out covariates into factors and continuous (ordered).

  3. Drops variables for which their distribution is uneven - e.g., all 1 value (tuneable) separately for factors and numeric variables (ADD MORE DETAIL HERE)

  4. Makes dummy variable basis for factors, including naming dummies to be traceable to original factor variables later.

  5. Makes new ordered variable of integers mapped to intervals defined by deciles for the ordered numeric variables (automatically makes) fewer categories if original variable has < 10 values.

  6. Creates associated list of number of unique values and the list of them for each variable for use in variable importance part.

  7. Makes missing covariate basis for both factors and ordered variables

  8. For each variable, after assigning it as A, uses optimal histogram function to combine values using the distribution of A | Y=1 to avoid very small cell sizes in distribution of Y vs. A (tuneable) (ADD DETAIL)

  9. Uses HOPACH* to cluster variables associated confounder/missingness basis for W, that uses specified minimum number of adjustment variables.

  10. Finds min and max estimate of E(Ya) w.r.t. a. after looping through all values of A* (after processed by histogram)

  11. Returns estimate of E(Ya(max)-Ya(min)) with SE using CV-TMLE.

*HOPACH is "Hierarchical Ordered Partitioning and Collapsing Hybrid"


Results object. TODO: add more detail here.


Alan E. Hubbard and Chris J. Kennedy, University of California, Berkeley


See Also

exportLatex, print.varimpact


# Create test dataset.
N <- 100
num_normal <- 5
X <- * num_normal), N, num_normal))
Y <- rbinom(N, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] + .1*X[, 3]*X[, 4] - .2*abs(X[, 4])))
# Add some missing data to X so we can test imputation.
for (i in 1:10) X[sample(nrow(X), 1), sample(ncol(X), 1)] <- NA

# Basic example

vim <- varimpact(Y = Y, data = X[, 1:3])

# Impute by median rather than knn.
## Not run: 
vim <- varimpact(Y = Y, data = X[, 1:3], impute = "median")

## End(Not run)

# Multicore parallel example.
## Not run: 
# Setup multicore parallelization.
plan("multisession", workers = 2)

vim <- varimpact(Y = Y, data = X[, 1:3])

## End(Not run)

# Cluster parallel example.
## Not run: 
cl = parallel::makeCluster(2L)
plan(cluster, workers = cl)
vim <- varimpact(Y = Y, data = X[, 1:3])

## End(Not run)

# mlbench BreastCancer example.
## Not run: 
data(BreastCancer, package="mlbench")
data <- BreastCancer

set.seed(1, "L'Ecuyer-CMRG")
# Reduce to a dataset of 100 observations to speed up testing.
# Create a numeric outcome variable.
data$Y <- as.numeric(data$Class == "malignant")
# Use multicore parallelization to speed up processing.
future::plan("multiprocess", workers = 2)
vim <- varimpact(Y = data$Y, data = subset(data, select=-c(Y, Class, Id)))

## End(Not run)

