CVtreeMLE: Fit ensemble decision trees to a vector of exposures and use...

View source: R/CVtreeMLE.R

CVtreeMLER Documentation

Fit ensemble decision trees to a vector of exposures and use targeted maximum likelihood estimation to determine the average treatment effect in each leaf of best fitting tree

Description

Fit ensemble decision trees on a mixed exposure while controlling for covariates using iterative backfitting of two Super Learners. If partitioning nodes are identified, use these partitions as a rule-based exposure. The CV-TMLE framework is used to create training and estimation samples. Trees are fit to the training and the average treatment effect (ATE) of the rule-based exposure is estimated in the validation folds. Any type of mixed exposure (continuous, binary, multinomial) is accepted. The ATE for multiple mixture components (interactions) are given as well as marginal effects if data-adaptively identified.

Usage

CVtreeMLE(
  w,
  a,
  y,
  data,
  w_stack = NULL,
  aw_stack = NULL,
  a_stack = NULL,
  n_folds,
  seed = 6442,
  family,
  parallel = TRUE,
  parallel_cv = TRUE,
  parallel_type = "multi_session",
  num_cores = 2,
  h_aw_trunc_lvl = 50,
  pooled_rule_type = "average",
  min_max = "min",
  region = NULL,
  min_obs = 25
)

Arguments

w

A character vector indicating which variables in the data to use as baseline covariates.

a

A character vector indicating which variables in the data to use as exposures.

y

A character indicating which variable in the data to use as the outcome.

data

Data frame of (W,A,Y) variables of interest.

w_stack

Stack of estimators used in the Super Learner during the iterative backfitting for Y|W, this should be an SL3 stack. If not provided, utils_create_sls is used to create default estimators used in the ensemble.

aw_stack

Stack of estimators used in the Super Learner for the Q and g mechanisms. If not provided, utils_create_sls is used to create default estimators used in the ensemble.

a_stack

Stack of estimators used in the Super Learner during the iterative backfitting for Y|A, this should be an SL3 object. If not provided, utils_create_sls is used to create default decision tree estimators used in the ensemble.

n_folds

Number of cross-validation folds.

seed

Pass in a seed number for consistency of results. If not provided a default seed is generated.

family

Family ('binomial' or 'continuous').

parallel

Use parallel processing if a backend is registered; enabled by default.

parallel_cv

Use parallel processing on CV procedure vs. parallel processing on Super Learner model fitting

parallel_type

default is multi_session, if parallel is true which type of parallelization to do multi_session or multicore

num_cores

If using parallel, the number of cores to parallelize over

h_aw_trunc_lvl

Level to truncate the clever covariate to control variance, default is 10.

pooled_rule_type

is "average" or "union" how to construct the rule across folds. The average take the average cutpoints and returns an average rule with lower and upper bounds for each cutpoint. The union rule creates a new rule that is the space that contains all the rules found across the fold and is therefore more conservative.

min_max

Which oracle region to go after the one that minimizes or maximizes the outcome.

region

If a predetermined region is of interest, put here: like "A < 0.02"

min_obs

Minimum number of observations to have in a region.

Details

The function performs the following functions.

  1. Imputes missing values with the mean and creates dummy indicator variables for imputed variables.

  2. Separate out covariates into factors and continuous (ordered).

  3. Create a variable which indicates the fold number assigned to each observation.

  4. Fit iterative backfitting algorithm onto the mixed exposure which applies ensemble decision trees to the mixed exposure and an unrestricted Super Learner on the covariates. Algorithms are fit, offset by their compliment until there is virtually no difference between the model fits. Extract partition nodes found for the mixture. This is done on each training fold data.

  5. Fit iterative backfitting algorithm onto each individual mixture component which applies ensemble decision trees to the mixed exposure and an unrestricted Super Learner on the covariates. Algorithms are fit, offset by their compliment until there is virtually no difference between the model fits. Extract partition nodes found for the mixture. This is done on each training fold data.

  6. Estimate nuisance parameters (Q and g estimates) for mixture interaction rule

  7. Estimate nuisance parameters (Q and g estimates) for marginal rules

  8. Estimate the Q outcome mechanism over all the marginal rules for later user input for targeted ATE for different marginal combinations based on data-adaptively identified thresholds.

  9. Use the mixture rules and data and do a TMLE fluctuation step to target the ATE for the given rule across all the folds. Calculate proportion of folds the rule is found.

  10. Use the marginal rules and data and do a TMLE fluctuation step to target the ATE for the given rule across all the folds. Calculate proportion of folds the rule is found.

  11. Calculate V-fold specific TMLE estimates of the rules.

  12. For the mixture rules, calculate a union rule or the rule that covers all the observations across the folds that the respective variable set in the rule.

  13. For the marginal rules, calculate a union rule or the rule that covers all the observations across the folds that the respective variable set in the rule.

Value

Object of class CVtreeMLE, containing a list of table results for: marginal ATEs, mixture ATEs, RMSE of marginal model fits, RMSE of mixture model fits, marginal rules, and mixture rules.

  • Model RMSEs: Root mean square error for marginal and interaction models in the iterative backfitting procedure

  • Pooled TMLE Marginal Results: Data frame of pooled TMLE Marginal Results: Pooled ATE results using TMLE for thresholds identified for each mixture component found

  • V-Specific Marg Results: A list of the v-fold marginal results. These are grouped by variable and direction of the ATE.

  • Pooled TMLE Mixture Results: Data frame of pooled TMLE Mixture Results

  • V-Specific Mix Results: A list of the v-fold mixture results. These are grouped by variable and direction of the ATE.

  • Pooled Marginal Refs: A data frame of the reference categories determined in each of the marginal results.

  • Marginal Rules: A data frame that includes the marginal rules and details related to fold found and RMSE

  • Mixture Rules: A data frame that includes the mixture rules and details related to fold found and RMSE

Authors

David McCoy, University of California, Berkeley

References

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), 289-300.

Gruber, S., & van der Laan, M. J. (2012). tmle: An R Package for Targeted Maximum Likelihood Estimation. Journal of Statistical Software, 51(i13).

Hubbard, A. E., Kherad-Pajouh, S., & van der Laan, M. J. (2016). Statistical Inference for Data Adaptive Target Parameters. The international journal of biostatistics, 12(1), 3-19.

Hubbard, A., Munoz, I. D., Decker, A., Holcomb, J. B., Schreiber, M. A., Bulger, E. M., ... & Rahbar, M. H. (2013). Time-Dependent Prediction and Evaluation of Variable Importance Using SuperLearning in High Dimensional Clinical Data. The journal of trauma and acute care surgery, 75(1 0 1), S53.

Hubbard, A. E., & van der Laan, M. J. (2016). Mining with inference: data-adaptive target parameters (pp. 439-452). In P. Buhlmann et al. (Ed.), Handbook of Big Data. CRC Press, Taylor & Francis Group, LLC: Boca Raton, FL.

van der Laan, M. J. (2006). Statistical inference for variable importance. The International Journal of Biostatistics, 2(1).

van der Laan, M. J., & Pollard, K. S. (2003). A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference, 117(2), 275-303.

van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).

van der Laan, M. J., & Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media.

Examples

n <- 800
p <- 4
x <- matrix(rnorm(n * p), n, p)
colnames(x) <- c("A1", "A2", "W1", "W2")
y_prob <- plogis(3 * sin(x[, 1]) + sin(x[, 2]), sin(x[, 4]))
Y <- rbinom(n = n, size = 1, prob = y_prob)
data <- as.data.frame(cbind(x, Y))

CVtreeMLE_fit <- CVtreeMLE(
  data = data,
  w = c("W1", "W2"),
  a = c("A1", "A2"),
  y = "Y",
  family = "binomial",
  parallel = FALSE,
  n_folds = 2
)


blind-contours/CVtreeMLE documentation built on June 22, 2024, 8:53 p.m.