knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/vignette_"
)

Introduction

Researchers are often interested in understanding what set of thresholds in a mixed exposure are safest. That is, if a policy was to be implemented to attempt to reduce an exposure or exposures below a set limit, which intervention would have the largest minimizing effect.

The problem of course is that this set of thresholds is not known beforehand. Therefore, it is necessary to both find the set of thresholds that, controlling for other exposures and covariates, would result in the lowest expected outcome.

This requires data-adaptive target parameters where cross-validation is used. In one part of the data, we find the minimizing region and in another, we estimate a policy intervention effect for that region.

Data Adaptive Target Parameters

Data adaptive target parameters constitute a flexible framework for estimating the effects of a data adaptively determined target parameter. There is a literature on the dangers of deriving parameters data-adaptively and the common approach for deriving consistent inference for a data-adaptively defined parameter is to use sample-splitting. In sample-splitting the researcher splits the full data into a training set used to define the parameter, and an estimation sample in which estimates are derived given the parameter identified in training. Of course this approach can be costly when new samples are collected for the estimation data or loses power when the full data is simply split. Our proposed approach for a decision tree applied to mixtures, aims to preserve the data-adaptive part of the sample splitting algorithm but we define an average of the data-adaptive parameter estimates across estimation samples based on arbitrary splits in K-fold cross-validation. In this way, one can still use the power of the entire dataset. The CVtreeMLE package implements decision tree algorithms for computing thresholds in exposures that maximize or minimize the outcome while flexibly adjusting for covariates. The nodes in the decision tree are represented as a binary exposures for which targeted minimum loss-based estimates (TMLE) are derived for the counterfactual mean outcome difference if all individual were exposed to the rule compared to the observed outcome under observed exposure distribution. We call our estimate the ARE, the average regional exposure effect.

Output

Two types of results are given for the region found to minimize or maximize the expected outcome. K-fold specific results for a given ARE and variance estimates for a fold specific region. The pooled ARE takes the average across folds to gain power (reduce variance). The pooled ARE estimates the oracle target parameter which is the region that maximizes or minimizes the expected outcome. This of course can change across the folds and therefore this estimate, if there is inconsistency in the region estimated, can lead to an amalgamation of different regions. However, if there is strong signal for a particular region, the same exposures will be used in the region and therefore the pooled parameter gains power, leveraging the k-fold specific estimates.

We give both so that researchers can look to see how consistent rules are given sample splitting which adds information to the pooled result.

For a technical presentation, the interested reader is invited to consult @mccoy2022CVtreeMLE or the earlier work of @vdl2011targeted, @vdl2018targeted, and @vdl2022targeted. For more background on data-adaptive target parameters see @Hubbard2016 and chapter 9 in @vdl2018targeted.

Mechanisms

Briefly, what this package will do is first, given the provided data it will partition the data into folds based on the n_folds parameter provided. In the training fold, we apply a custom decision tree that greedily partitions the exposure space, at each partition it uses g-computation to estimate the expected outcome for that region controlling for covariates. It finds the region that maximizes or minimizes the expected outcome. This is done for each fold. Then for each region that maximizes or minimizes the expected outcome in each fold, we estimate the effect comparing the expected outcome if everyone was forced to be in that region compared to the observed outcome under observed exposure. This is the ARE. This is done for each fold and we also pool across the folds. This procedure is done with CV-TMLE. The output are thresholds and the accompanying ARE for the region that has the maximimum or minimum expected outcome.

Working Example

To start, let's load the packages we'll need and set a seed for simulation. We will use a real-world data example with known ground-truth to show the functionality of CVtreeMLE.

library(data.table)
library(CVtreeMLE)
library(sl3)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(purrr)

seed <- 5454433
set.seed(seed)

NHANES Data

Here we will try and replicate the analysis by Gibson et. al., https://ehjournal.biomedcentral.com/articles/10.1186/s12940-019-0515-1

and Mitro et. al, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4858394/

Who used the NHANES 2001-2002 data to investigate 18 POP exposures on telomere length.

Because reduced telomere length is associated with more biological aging and higher morbidity, we will try and identify which region in the POP space would maximize telomere length. That is, we want to find the region that, if we created a regulation on those chemicals, would result in maximal change in elongating telomere length.

Let's first load the data and investigate the variables:

NHANES Data

data("NHANES_eurocim")

exposures <- c("LBX074LA", # PCB74 Lipid Adj (ng/g)
               "LBX099LA", # PCB99 Lipid Adj (ng/g)
               "LBX118LA", # PCB118 Lipid Adj (ng/g)
               "LBX138LA", # PCB138 Lipid Adj (ng/g)
               "LBX153LA", # PCB153 Lipid Adj (ng/g)
               "LBX170LA", # PCB170 Lipid Adj (ng/g)
               "LBX180LA", # PCB180 Lipid Adj (ng/g)
               "LBX187LA", # PCB187 Lipid Adj (ng/g)
               "LBX194LA", # PCB194 Lipid Adj (ng/g)
               "LBXD03LA", # 1,2,3,6,7,8-hxcdd Lipid Adj (pg/g)
               "LBXD05LA", # 1,2,3,4,6,7,8-hpcdd Lipid Adj (pg/g)
               "LBXD07LA", # 1,2,3,4,6,7,8,9-ocdd Lipid Adj (pg/g)
               "LBXF03LA", # 2,3,4,7,8-pncdf Lipid Adj (pg/g)
               "LBXF04LA", # 1,2,3,4,7,8-hxcdf Lipid Adj (pg/g)
               "LBXF05LA", # 1,2,3,6,7,8-hxcdf Lipid Adj (pg/g)
               "LBXF08LA", # 1,2,3,4,6,7,8-hxcdf Lipid Adj (pg/g)
               "LBXHXCLA", # 3,3',4,4',5,5'-hxcb Lipid Adj (pg/g)
               "LBXPCBLA") # 3,3',4,4',5-pcnb Lipid Adj (pg/g)

NHANES_eurocim <- NHANES_eurocim[complete.cases(NHANES_eurocim[, exposures]), ]

outcome <- "TELOMEAN"

covariates <- c("LBXWBCSI", # White blood cell count (SI)
                "LBXLYPCT", # Lymphocyte percent (%)
                "LBXMOPCT", # Monocyte percent (%)
                "LBXEOPCT", # Eosinophils percent (%)
                "LBXBAPCT", # Basophils percent (%)
                "LBXNEPCT", # Segmented neutrophils percent (%)
                "male", # Sex
                "age_cent", # Age at Screening, centered
                "race_cat", # race
                "bmi_cat3", # Body Mass Index (kg/m**2)
                "ln_lbxcot", # Cotinine (ng/mL), log-transformed
                "edu_cat") # Education Level - Adults 20+

To improve consistency in the region we find to be the maximizing region, it is best to remove an exposure of highly correlated sets, we do that here:

# Calculate the correlation matrix for the exposures
cor_matrix <- cor(NHANES_eurocim[, exposures], use = "complete.obs")

# Set a threshold for high correlation
threshold <- 0.8

# Find pairs of highly correlated exposures
highly_correlated_pairs <- which(abs(cor_matrix) > threshold & lower.tri(cor_matrix), arr.ind = TRUE)

# Initiate a vector to keep track of exposures to remove
exposures_to_remove <- c()

# Loop through the highly correlated pairs and decide which exposure to remove
for (pair in seq_len(nrow(highly_correlated_pairs))) {
  row <- highly_correlated_pairs[pair, "row"]
  col <- highly_correlated_pairs[pair, "col"]

  if (!(colnames(cor_matrix)[row] %in% exposures_to_remove)) {
    exposures_to_remove <- c(exposures_to_remove, colnames(cor_matrix)[row])
  }
}

# Keep only uncorrelated exposures
exposures_to_keep <- setdiff(exposures, exposures_to_remove)

Run CVtreeMLE

# Convert continuous X variables to their corresponding deciles
NHANES_eurocim[, exposures_to_keep] <- apply(NHANES_eurocim[, exposures_to_keep], 2, round, 1)

nhanes_results <- CVtreeMLE(
  data = NHANES_eurocim, ## dataframe
  w = covariates,
  a = exposures_to_keep,
  y = outcome,
  n_folds = 5,
  seed = 34421,
  parallel_cv = TRUE,
  parallel = TRUE,
  family = "continuous",
  num_cores = 8,
  min_max = "max",
  min_obs = 20,
  p_val_thresh = 1
)

Investigating Results

Let's first look at the results for each fold:

k_fold_results <- nhanes_results$`V-Specific Mix Results`

k_fold_results %>%
  kableExtra::kbl(caption = "K-fold Results") %>%
  kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria")

The pooled result is interpretable because it will be the average of estimates in the regions for the same variable:

pooled_mixture_results <- nhanes_results$`Oracle Region Results`

pooled_mixture_results %>%
  kableExtra::kbl(caption = "Oracle Mixture Results") %>%
  kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria")

We can also examine the pooled estimates by exposure set with average rules in each set:

region_specific_pooling <- nhanes_results$`Pooled TMLE Mixture Results`

region_specific_pooling %>%
  kableExtra::kbl(caption = "Region Specific Mixture Results") %>%
  kableExtra::kable_classic(full_width = FALSE, html_font = "Cambria")

Estimate Stability

Because we are identifying thresholds mixture space data-adaptively using the training data as we rotate through the folds, CVtreeMLE will give, in the event that a signal actually exists, more consistent results when higher n_fold values are used. Thus, we recommend using 10-fold CV when possible for more consistent and interpretable results.

Runtime Performance Guidelines

CVtreeMLE uses ensemble machine learning which is obviously computationally demanding. The utils_create_sls.R function creates some lean yet non-parametric ensemble learners for each parameter. For example, glm, elastic net, random forest, and xgboost models are created for the nuisance parameters and decision trees of various depths are created for the decision tree fitting Super Learner. Users are also welcome to pass their own stacks of learners in to CVtreeMLE if they feel the default estimators are insufficient given the complexity of the data. Additionally, to help with computational time, CVtreeMLE uses the future package for sequential and parallel processing. The default functionality is to parallelize across the folds (parallel_cv = TRUE). The default parallelization type is multi-session, this is because we expect most users to be programming in Rstudio where multicore is not supported. Thus, a user should expected multiple cores to be in use, the number which is equal to num_cores with high CPU usage on each core because the data and models are copied to each core. This is different from multicore where the data is not copied. As we saw above, in this example, a dataset with 500 observations and 7 exposures took about 5 minutes with 5 folds. Thus, when using 10-fold CV, which we recommend, on a standard environmental epidemiology data set of 500- 1000 observations, CVtreeMLE should run easily on a modern local machine.

References



blind-contours/CVtreeMLE documentation built on June 22, 2024, 8:53 p.m.