knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/vignette_" )
Researchers are often interested in understanding what set of thresholds in a mixed exposure are safest. That is, if a policy was to be implemented to attempt to reduce an exposure or exposures below a set limit, which intervention would have the largest minimizing effect.
The problem of course is that this set of thresholds is not known beforehand. Therefore, it is necessary to both find the set of thresholds that, controlling for other exposures and covariates, would result in the lowest expected outcome.
This requires data-adaptive target parameters where cross-validation is used. In one part of the data, we find the minimizing region and in another, we estimate a policy intervention effect for that region.
Data adaptive target parameters constitute a flexible framework for
estimating the effects of a data adaptively determined target parameter.
There is a literature on the dangers of deriving parameters data-adaptively
and the common
approach for deriving consistent inference for a data-adaptively defined
parameter is to use sample-splitting. In sample-splitting the researcher
splits the full data into a training set used to define the parameter,
and an estimation sample in which estimates are derived given the parameter
identified in training. Of course this approach can be costly when new samples
are collected for the estimation data or loses power when the full data is
simply split. Our proposed approach for a decision tree applied to mixtures,
aims to preserve the data-adaptive part of the sample splitting algorithm
but we define an average of the data-adaptive parameter estimates across
estimation samples based on arbitrary splits in K-fold cross-validation.
In this way, one can still use the power of the entire dataset.
The CVtreeMLE
package implements decision tree algorithms for computing
thresholds in exposures that maximize or minimize the outcome while flexibly adjusting
for covariates. The nodes in the decision tree are represented as a binary
exposures
for which targeted minimum loss-based estimates (TMLE) are derived for
the counterfactual mean outcome difference if all individual were exposed
to the rule compared to the observed outcome under observed exposure distribution. We call our estimate the ARE, the average regional exposure effect.
Two types of results are given for the region found to minimize or maximize the expected outcome. K-fold specific results for a given ARE and variance estimates for a fold specific region. The pooled ARE takes the average across folds to gain power (reduce variance). The pooled ARE estimates the oracle target parameter which is the region that maximizes or minimizes the expected outcome. This of course can change across the folds and therefore this estimate, if there is inconsistency in the region estimated, can lead to an amalgamation of different regions. However, if there is strong signal for a particular region, the same exposures will be used in the region and therefore the pooled parameter gains power, leveraging the k-fold specific estimates.
We give both so that researchers can look to see how consistent rules are given sample splitting which adds information to the pooled result.
For a technical presentation, the interested reader is invited to consult @mccoy2022CVtreeMLE or the earlier work of @vdl2011targeted, @vdl2018targeted, and @vdl2022targeted. For more background on data-adaptive target parameters see @Hubbard2016 and chapter 9 in @vdl2018targeted.
Briefly, what this package will do is first, given the provided data it will partition the data into folds based on the n_folds parameter provided. In the training fold, we apply a custom decision tree that greedily partitions the exposure space, at each partition it uses g-computation to estimate the expected outcome for that region controlling for covariates. It finds the region that maximizes or minimizes the expected outcome. This is done for each fold. Then for each region that maximizes or minimizes the expected outcome in each fold, we estimate the effect comparing the expected outcome if everyone was forced to be in that region compared to the observed outcome under observed exposure. This is the ARE. This is done for each fold and we also pool across the folds. This procedure is done with CV-TMLE. The output are thresholds and the accompanying ARE for the region that has the maximimum or minimum expected outcome.
To start, let's load the packages we'll need and set a seed for simulation.
We will use a real-world data example with known ground-truth to show the
functionality of CVtreeMLE
.
library(data.table) library(CVtreeMLE) library(sl3) library(kableExtra) library(dplyr) library(ggplot2) library(purrr) seed <- 5454433 set.seed(seed)
Here we will try and replicate the analysis by Gibson et. al., https://ehjournal.biomedcentral.com/articles/10.1186/s12940-019-0515-1
and Mitro et. al, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4858394/
Who used the NHANES 2001-2002 data to investigate 18 POP exposures on telomere length.
Because reduced telomere length is associated with more biological aging and higher morbidity, we will try and identify which region in the POP space would maximize telomere length. That is, we want to find the region that, if we created a regulation on those chemicals, would result in maximal change in elongating telomere length.
Let's first load the data and investigate the variables:
data("NHANES_eurocim") exposures <- c("LBX074LA", # PCB74 Lipid Adj (ng/g) "LBX099LA", # PCB99 Lipid Adj (ng/g) "LBX118LA", # PCB118 Lipid Adj (ng/g) "LBX138LA", # PCB138 Lipid Adj (ng/g) "LBX153LA", # PCB153 Lipid Adj (ng/g) "LBX170LA", # PCB170 Lipid Adj (ng/g) "LBX180LA", # PCB180 Lipid Adj (ng/g) "LBX187LA", # PCB187 Lipid Adj (ng/g) "LBX194LA", # PCB194 Lipid Adj (ng/g) "LBXD03LA", # 1,2,3,6,7,8-hxcdd Lipid Adj (pg/g) "LBXD05LA", # 1,2,3,4,6,7,8-hpcdd Lipid Adj (pg/g) "LBXD07LA", # 1,2,3,4,6,7,8,9-ocdd Lipid Adj (pg/g) "LBXF03LA", # 2,3,4,7,8-pncdf Lipid Adj (pg/g) "LBXF04LA", # 1,2,3,4,7,8-hxcdf Lipid Adj (pg/g) "LBXF05LA", # 1,2,3,6,7,8-hxcdf Lipid Adj (pg/g) "LBXF08LA", # 1,2,3,4,6,7,8-hxcdf Lipid Adj (pg/g) "LBXHXCLA", # 3,3',4,4',5,5'-hxcb Lipid Adj (pg/g) "LBXPCBLA") # 3,3',4,4',5-pcnb Lipid Adj (pg/g) NHANES_eurocim <- NHANES_eurocim[complete.cases(NHANES_eurocim[, exposures]), ] outcome <- "TELOMEAN" covariates <- c("LBXWBCSI", # White blood cell count (SI) "LBXLYPCT", # Lymphocyte percent (%) "LBXMOPCT", # Monocyte percent (%) "LBXEOPCT", # Eosinophils percent (%) "LBXBAPCT", # Basophils percent (%) "LBXNEPCT", # Segmented neutrophils percent (%) "male", # Sex "age_cent", # Age at Screening, centered "race_cat", # race "bmi_cat3", # Body Mass Index (kg/m**2) "ln_lbxcot", # Cotinine (ng/mL), log-transformed "edu_cat") # Education Level - Adults 20+
To improve consistency in the region we find to be the maximizing region, it is best to remove an exposure of highly correlated sets, we do that here:
# Calculate the correlation matrix for the exposures cor_matrix <- cor(NHANES_eurocim[, exposures], use = "complete.obs") # Set a threshold for high correlation threshold <- 0.8 # Find pairs of highly correlated exposures highly_correlated_pairs <- which(abs(cor_matrix) > threshold & lower.tri(cor_matrix), arr.ind = TRUE) # Initiate a vector to keep track of exposures to remove exposures_to_remove <- c() # Loop through the highly correlated pairs and decide which exposure to remove for (pair in seq_len(nrow(highly_correlated_pairs))) { row <- highly_correlated_pairs[pair, "row"] col <- highly_correlated_pairs[pair, "col"] if (!(colnames(cor_matrix)[row] %in% exposures_to_remove)) { exposures_to_remove <- c(exposures_to_remove, colnames(cor_matrix)[row]) } } # Keep only uncorrelated exposures exposures_to_keep <- setdiff(exposures, exposures_to_remove)
# Convert continuous X variables to their corresponding deciles NHANES_eurocim[, exposures_to_keep] <- apply(NHANES_eurocim[, exposures_to_keep], 2, round, 1) nhanes_results <- CVtreeMLE( data = NHANES_eurocim, ## dataframe w = covariates, a = exposures_to_keep, y = outcome, n_folds = 5, seed = 34421, parallel_cv = TRUE, parallel = TRUE, family = "continuous", num_cores = 8, min_max = "max", min_obs = 20, p_val_thresh = 1 )
Let's first look at the results for each fold:
k_fold_results <- nhanes_results$`V-Specific Mix Results` k_fold_results %>% kableExtra::kbl(caption = "K-fold Results") %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria")
The pooled result is interpretable because it will be the average of estimates in the regions for the same variable:
pooled_mixture_results <- nhanes_results$`Oracle Region Results` pooled_mixture_results %>% kableExtra::kbl(caption = "Oracle Mixture Results") %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria")
We can also examine the pooled estimates by exposure set with average rules in each set:
region_specific_pooling <- nhanes_results$`Pooled TMLE Mixture Results` region_specific_pooling %>% kableExtra::kbl(caption = "Region Specific Mixture Results") %>% kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria")
Because we are identifying thresholds mixture space
data-adaptively using the training data as we rotate through the folds,
CVtreeMLE
will give, in the event that a signal actually exists, more
consistent results when higher n_fold values are used. Thus, we recommend using
10-fold CV when possible for more consistent and interpretable results.
CVtreeMLE
uses ensemble machine learning which is obviously computationally
demanding. The utils_create_sls.R
function creates some lean yet
non-parametric ensemble learners for each parameter. For example, glm,
elastic net, random forest, and xgboost models are created for the
nuisance parameters and decision trees of various depths are created for
the decision tree fitting Super Learner. Users are also welcome to pass their
own stacks of learners in to CVtreeMLE
if they feel the default estimators
are insufficient given the complexity of the data. Additionally, to help
with computational time, CVtreeMLE
uses the future package for sequential
and parallel processing. The default functionality is to parallelize
across the folds (parallel_cv = TRUE). The default parallelization type is
multi-session, this is because we expect most users to be programming in
Rstudio where multicore is not supported. Thus, a user should expected
multiple cores to be in use, the number which is equal to num_cores with
high CPU usage on each core because the data and models are copied to
each core. This is different from multicore where the data is not copied.
As we saw above, in this example, a dataset with 500 observations and 7
exposures took about 5 minutes with 5 folds. Thus, when using 10-fold CV,
which we recommend, on a standard environmental epidemiology data set of 500-
1000 observations, CVtreeMLE
should run easily on a modern local
machine.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.