varImpact: Variable importance estimation using causal inference...

varimpactR Documentation

Variable importance estimation using causal inference (targeted learning)

Description

varimpact returns variable importance statistics ordered by statistical significance using a combination of data-adaptive target parameter

Usage

varimpact(
  Y,
  data,
  A_names = colnames(data),
  V = 2L,
  Q.library = c("SL.glm", "SL.mean"),
  g.library = c("SL.glm", "SL.mean"),
  family = "binomial",
  minYs = 15L,
  minCell = 0L,
  adjust_cutoff = 10L,
  corthres = 0.8,
  impute = "median",
  miss.cut = 0.5,
  bins_numeric = 10L,
  quantile_probs_factor = c(0.1, 0.9),
  quantile_probs_numeric = quantile_probs_factor,
  verbose = FALSE,
  verbose_tmle = FALSE,
  verbose_reduction = FALSE,
  parallel = TRUE,
  digits = 4L
)

Arguments

Y

outcome of interest (numeric vector)

data

data frame of predictor variables of interest for which function returns VIM's. (possibly a matrix?)

A_names

Names of the variables for which we want to estimate importance, a subset of the data argument.

V

Number of cross-validation folds.

Q.library

library used by SuperLearner for model of outcome versus predictors

g.library

library used by SuperLearner for model of predictor variable of interest versus other predictors

family

family ('binomial' or 'gaussian')

minYs

mininum # of obs with event - if it is < minYs, skip VIM

minCell

is the cut-off for including a category of A in analysis, and presents the minumum of cells in a 2x2 table of the indicator of that level versus outcome, separately by training and validation sample.

adjust_cutoff

Maximum number of adjustment variables during TMLE. If more than this cutoff varimpact will attempt to reduce the dimensions to that number (using HOPACH hierarchical clustering). Must not be more than 15 due to HOPACH constraints. Set to NULL to disable any dimension reduction.

corthres

cut-off correlation with explanatory variable for inclusion of an adjustment variables

impute

Type of missing value imputation to conduct. One of: "zero", "median", "knn" (default). Note: knn results in the covariate data being centered/scaled.

miss.cut

eliminates explanatory (X) variables with proportion of missing obs > cut.off

bins_numeric

Numbers of bins when discretizing numeric variables.

quantile_probs_factor

Quantiles used to check if factors have sufficient variation.

quantile_probs_numeric

Quantiles used to check if numerics have sufficient variation.

verbose

Boolean - if TRUE the method will display more detailed output.

verbose_tmle

Boolean - if TRUE, will display even more detail on the TMLE estimation process.

verbose_reduction

Boolean - if TRUE, will display more detail during variable reduction step (clustering).

parallel

Use parallel processing if a backend is registered; enabled by default.

digits

Number of digits to round the value labels.

Details

The function performs the following functions.

  1. Drops variables missing > miss.cut of time (tuneable).

  2. Separate out covariates into factors and continuous (ordered).

  3. Drops variables for which their distribution is uneven - e.g., all 1 value (tuneable) separately for factors and numeric variables (ADD MORE DETAIL HERE)

  4. Makes dummy variable basis for factors, including naming dummies to be traceable to original factor variables later.

  5. Makes new ordered variable of integers mapped to intervals defined by deciles for the ordered numeric variables (automatically makes) fewer categories if original variable has < 10 values.

  6. Creates associated list of number of unique values and the list of them for each variable for use in variable importance part.

  7. Makes missing covariate basis for both factors and ordered variables

  8. For each variable, after assigning it as A, uses optimal histogram function to combine values using the distribution of A | Y=1 to avoid very small cell sizes in distribution of Y vs. A (tuneable) (ADD DETAIL)

  9. Uses HOPACH* to cluster variables associated confounder/missingness basis for W, that uses specified minimum number of adjustment variables.

  10. Finds min and max estimate of E(Ya) w.r.t. a. after looping through all values of A* (after processed by histogram)

  11. Returns estimate of E(Ya(max)-Ya(min)) with SE using CV-TMLE.

*HOPACH is "Hierarchical Ordered Partitioning and Collapsing Hybrid"

Value

Results object. TODO: add more detail here.

Authors

Alan E. Hubbard and Chris J. Kennedy, University of California, Berkeley

References

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), 289-300.

Gruber, S., & van der Laan, M. J. (2012). tmle: An R Package for Targeted Maximum Likelihood Estimation. Journal of Statistical Software, 51(i13).

Hubbard, A. E., Kherad-Pajouh, S., & van der Laan, M. J. (2016). Statistical Inference for Data Adaptive Target Parameters. The international journal of biostatistics, 12(1), 3-19.

Hubbard, A., Munoz, I. D., Decker, A., Holcomb, J. B., Schreiber, M. A., Bulger, E. M., ... & Rahbar, M. H. (2013). Time-Dependent Prediction and Evaluation of Variable Importance Using SuperLearning in High Dimensional Clinical Data. The journal of trauma and acute care surgery, 75(1 0 1), S53.

Hubbard, A. E., & van der Laan, M. J. (2016). Mining with inference: data-adaptive target parameters (pp. 439-452). In P. Buhlmann et al. (Ed.), Handbook of Big Data. CRC Press, Taylor & Francis Group, LLC: Boca Raton, FL.

van der Laan, M. J. (2006). Statistical inference for variable importance. The International Journal of Biostatistics, 2(1).

van der Laan, M. J., & Pollard, K. S. (2003). A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference, 117(2), 275-303.

van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).

van der Laan, M. J., & Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media.

See Also

exportLatex, print.varimpact

Examples

####################################
# Create test dataset.
set.seed(1)
N <- 100
num_normal <- 5
X <- as.data.frame(matrix(rnorm(N * num_normal), N, num_normal))
Y <- rbinom(N, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] + .1*X[, 3]*X[, 4] - .2*abs(X[, 4])))
# Add some missing data to X so we can test imputation.
for (i in 1:10) X[sample(nrow(X), 1), sample(ncol(X), 1)] <- NA

####################################
# Basic example

vim <- varimpact(Y = Y, data = X[, 1:3])
vim
vim$results_all
exportLatex(vim)

# Impute by median rather than knn.
## Not run: 
vim <- varimpact(Y = Y, data = X[, 1:3], impute = "median")

## End(Not run)

####################################
# Multicore parallel example.
## Not run: 
# Setup multicore parallelization.
library(future)
plan("multisession", workers = 2)

vim <- varimpact(Y = Y, data = X[, 1:3])

## End(Not run)

####################################
# Cluster parallel example.
## Not run: 
cl = parallel::makeCluster(2L)
plan(cluster, workers = cl)
vim <- varimpact(Y = Y, data = X[, 1:3])
parallel::stopCluster(cl)

## End(Not run)

####################################
# mlbench BreastCancer example.
## Not run: 
data(BreastCancer, package="mlbench")
data <- BreastCancer

set.seed(1, "L'Ecuyer-CMRG")
# Reduce to a dataset of 100 observations to speed up testing.
# Create a numeric outcome variable.
data$Y <- as.numeric(data$Class == "malignant")
# Use multicore parallelization to speed up processing.
future::plan("multiprocess", workers = 2)
vim <- varimpact(Y = data$Y, data = subset(data, select=-c(Y, Class, Id)))

## End(Not run)


ck37/varImpact documentation built on June 26, 2022, 4:02 a.m.