varimpact | R Documentation |
varimpact
returns variable importance statistics ordered
by statistical significance using a combination of data-adaptive target
parameter
varimpact( Y, data, A_names = colnames(data), V = 2L, Q.library = c("SL.glm", "SL.mean"), g.library = c("SL.glm", "SL.mean"), family = "binomial", minYs = 15L, minCell = 0L, adjust_cutoff = 10L, corthres = 0.8, impute = "median", miss.cut = 0.5, bins_numeric = 10L, quantile_probs_factor = c(0.1, 0.9), quantile_probs_numeric = quantile_probs_factor, verbose = FALSE, verbose_tmle = FALSE, verbose_reduction = FALSE, parallel = TRUE, digits = 4L )
Y |
outcome of interest (numeric vector) |
data |
data frame of predictor variables of interest for which function returns VIM's. (possibly a matrix?) |
A_names |
Names of the variables for which we want to estimate importance, a subset of the data argument. |
V |
Number of cross-validation folds. |
Q.library |
library used by SuperLearner for model of outcome versus predictors |
g.library |
library used by SuperLearner for model of predictor variable of interest versus other predictors |
family |
family ('binomial' or 'gaussian') |
minYs |
mininum # of obs with event - if it is < minYs, skip VIM |
minCell |
is the cut-off for including a category of A in analysis, and presents the minumum of cells in a 2x2 table of the indicator of that level versus outcome, separately by training and validation sample. |
adjust_cutoff |
Maximum number of adjustment variables during TMLE. If more than this cutoff varimpact will attempt to reduce the dimensions to that number (using HOPACH hierarchical clustering). Must not be more than 15 due to HOPACH constraints. Set to NULL to disable any dimension reduction. |
corthres |
cut-off correlation with explanatory variable for inclusion of an adjustment variables |
impute |
Type of missing value imputation to conduct. One of: "zero", "median", "knn" (default). Note: knn results in the covariate data being centered/scaled. |
miss.cut |
eliminates explanatory (X) variables with proportion of missing obs > cut.off |
bins_numeric |
Numbers of bins when discretizing numeric variables. |
quantile_probs_factor |
Quantiles used to check if factors have sufficient variation. |
quantile_probs_numeric |
Quantiles used to check if numerics have sufficient variation. |
verbose |
Boolean - if TRUE the method will display more detailed output. |
verbose_tmle |
Boolean - if TRUE, will display even more detail on the TMLE estimation process. |
verbose_reduction |
Boolean - if TRUE, will display more detail during variable reduction step (clustering). |
parallel |
Use parallel processing if a backend is registered; enabled by default. |
digits |
Number of digits to round the value labels. |
The function performs the following functions.
Drops variables missing > miss.cut of time (tuneable).
Separate out covariates into factors and continuous (ordered).
Drops variables for which their distribution is uneven - e.g., all 1 value (tuneable) separately for factors and numeric variables (ADD MORE DETAIL HERE)
Makes dummy variable basis for factors, including naming dummies to be traceable to original factor variables later.
Makes new ordered variable of integers mapped to intervals defined by deciles for the ordered numeric variables (automatically makes) fewer categories if original variable has < 10 values.
Creates associated list of number of unique values and the list of them for each variable for use in variable importance part.
Makes missing covariate basis for both factors and ordered variables
For each variable, after assigning it as A, uses optimal histogram function to combine values using the distribution of A | Y=1 to avoid very small cell sizes in distribution of Y vs. A (tuneable) (ADD DETAIL)
Uses HOPACH* to cluster variables associated confounder/missingness basis for W, that uses specified minimum number of adjustment variables.
Finds min and max estimate of E(Ya) w.r.t. a. after looping through all values of A* (after processed by histogram)
Returns estimate of E(Ya(max)-Ya(min)) with SE using CV-TMLE.
*HOPACH is "Hierarchical Ordered Partitioning and Collapsing Hybrid"
Results object. TODO: add more detail here.
Alan E. Hubbard and Chris J. Kennedy, University of California, Berkeley
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), 289-300.
Gruber, S., & van der Laan, M. J. (2012). tmle: An R Package for Targeted Maximum Likelihood Estimation. Journal of Statistical Software, 51(i13).
Hubbard, A. E., Kherad-Pajouh, S., & van der Laan, M. J. (2016). Statistical Inference for Data Adaptive Target Parameters. The international journal of biostatistics, 12(1), 3-19.
Hubbard, A., Munoz, I. D., Decker, A., Holcomb, J. B., Schreiber, M. A., Bulger, E. M., ... & Rahbar, M. H. (2013). Time-Dependent Prediction and Evaluation of Variable Importance Using SuperLearning in High Dimensional Clinical Data. The journal of trauma and acute care surgery, 75(1 0 1), S53.
Hubbard, A. E., & van der Laan, M. J. (2016). Mining with inference: data-adaptive target parameters (pp. 439-452). In P. Buhlmann et al. (Ed.), Handbook of Big Data. CRC Press, Taylor & Francis Group, LLC: Boca Raton, FL.
van der Laan, M. J. (2006). Statistical inference for variable importance. The International Journal of Biostatistics, 2(1).
van der Laan, M. J., & Pollard, K. S. (2003). A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference, 117(2), 275-303.
van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).
van der Laan, M. J., & Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media.
exportLatex
, print.varimpact
#################################### # Create test dataset. set.seed(1) N <- 100 num_normal <- 5 X <- as.data.frame(matrix(rnorm(N * num_normal), N, num_normal)) Y <- rbinom(N, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] + .1*X[, 3]*X[, 4] - .2*abs(X[, 4]))) # Add some missing data to X so we can test imputation. for (i in 1:10) X[sample(nrow(X), 1), sample(ncol(X), 1)] <- NA #################################### # Basic example vim <- varimpact(Y = Y, data = X[, 1:3]) vim vim$results_all exportLatex(vim) # Impute by median rather than knn. ## Not run: vim <- varimpact(Y = Y, data = X[, 1:3], impute = "median") ## End(Not run) #################################### # Multicore parallel example. ## Not run: # Setup multicore parallelization. library(future) plan("multisession", workers = 2) vim <- varimpact(Y = Y, data = X[, 1:3]) ## End(Not run) #################################### # Cluster parallel example. ## Not run: cl = parallel::makeCluster(2L) plan(cluster, workers = cl) vim <- varimpact(Y = Y, data = X[, 1:3]) parallel::stopCluster(cl) ## End(Not run) #################################### # mlbench BreastCancer example. ## Not run: data(BreastCancer, package="mlbench") data <- BreastCancer set.seed(1, "L'Ecuyer-CMRG") # Reduce to a dataset of 100 observations to speed up testing. # Create a numeric outcome variable. data$Y <- as.numeric(data$Class == "malignant") # Use multicore parallelization to speed up processing. future::plan("multiprocess", workers = 2) vim <- varimpact(Y = data$Y, data = subset(data, select=-c(Y, Class, Id))) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.