In alexpkeil1/vibr: Variable Importance in Black-Box Regression

library("knitr")

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Getting started

The vibr package contains functions to estimate variable importance based on a stochastic intervention to increase each predictor by a small amount (continuous) or increase the probability of a predictor by the same small amount (binary). The underlying methodology is based on papers by Ivan Diaz and Mark van der Laan and some improvements suggested by Haneuse and Rotnitsky. There are also functions to help diagnose potential identifiability issues to ensure, as much as possible, that variable importance is grounded in the context of a small change in exposure that has good support in the data. We start with a quick example using simulated data for a continuous health outcome and a small number of well water contaminants. This package supports doubly robust estimation (Targeted maximum likelihood and augmented inverse probability weighting) as well as modern techniques to reduce bias and speed up convergence (in sample size) of these methods (double cross-fitting and repeated double cross-fitting). The underlying statistical models draw on sl3, an R package with a unified framework for machine learning and ensemble machine learning based around the super learner algorithm (stacking).

Because sl3 is not available via CRAN (so you can't install it with install.packages), you have to install it via install_github which is available in either the remotes or devtools packages. CRAN does not allow packages with dependencies outside of CRAN, so vibr must be installed similarly unless and until sl3 is released on CRAN.

#devtools::install_github("tlverse/sl3", build_vignettes = TRUE, dependencies = TRUE) # may need to separately install "R.rsp" package
devtools::install_github("alexpkeil1/vibr", build_vignettes = TRUE, dependencies = TRUE) # this will install current version of sl3 if it  not installed

library("vibr")
library("qgcomp")

Fitting a model using targeted maximum likelihood

    data(metals, package="qgcomp")
    XYlist = data.frame(metals[,1:23], Y=metals$y+10) # adding constant to Y purely for illustration
    # split 
    spldat <- qgcomp::split_data(XYlist)
    trdat <- spldat$traindata
    vdat <- spldat$validdata

    Y_learners = .default_continuous_learners()
    Xbinary_learners = list(sl3::Lrnr_glm$new(name="logit"))
    Xdensity_learners = .default_density_learners(n_bins=c(5))

    X = trdat[,c(1:5,23)] # selecting a subset to make the example run faster
    Y = trdat$Y

    # basic variable importance using targeted maximum likelihood
    vi <- varimp(X=X, 
                 Y=Y, 
                 delta=0.1, 
                 Y_learners = Y_learners,
                 Xdensity_learners=Xdensity_learners[1:2],
                 Xbinary_learners=Xbinary_learners,
                 estimator="TMLE" # targeted maximum likelihood (AIPW, GCOMP. IPW and cross-vit versions of these are all available)
                 )
  vi

Explanation

The basic idea behind variable importance is that one wants to rank predictors based on how strongly they predict the target (e.g. health outcome), conditional on other predictors. Several variable importance algorithms are algorithm specific (e.g. random forest node-purity-based importance, linear model coefficients) and not generally applicable. Others are generally applicable. For example, we could compare the change in the predicted outcome from a population wide change from exposed to unexposed for some binary predictor, but such importance measures often map back to contrasts that simply are not well supported by the data.

The stochastic intervention approach assesses variable importance via small changes to each continuous predictor or the propensity score of each binary predictor, which assesses variable importance in an area of the predictors that we expect to have good support from the data. Thus, this approach focuses on a measure of variable importance that is heavily tied to the population distribution of predictors at hand. The output looks much like output from a set of linear model coefficients, but represents a generalization of these coefficients. A standard linear model coefficient represents the expected change in the mean target value for a one unit change in the predictor . Here, we replace "one unit" with "delta", where delta is some small value for which we can reasonably assume support in the data, and we also replace "at all values of other predictors" with "at the observed levels of all other predictors." This formulation allows us to model the target flexibly.

Flexible modeling of the target is farmed out to the sl3 package, but some defaults are provided in the vibr package. The vibr package allows assessment of variable importance via inverse probability weighting (which requires additional, possibly flexible, models for each predictor, conditional on all other predictors), g-computation, targeted maximum likelihood estimation and augmented inverse probability weighting (where the latter two require models for the target and the predictors)

Visualizing the implied stochastic intervention and diagnosing potential identication and sparsity issues

Regions near zero on this plot reflect observations of predictors (or "shifted" predictors, which are the values of the predictors after applying the hypothetical stochastic intervention) that have low estimated probability from the predictor models. This plot also demonstrates how "far" (informally) the estimated distribution of the observed predictors differs from the estimated distribution of the shifted predictors. One may see whether shifted predictors have much lower density, which could signal poor support for the stochastic interventnion. Large differences between the (sorted) distributions in these plots should indicate caution due to the potential for influential observations, but are not, in themselves, diagnostics of influential observations. Contrast the difference between these two plots with different magnitudes of the stochastic intervention.

vibr::plotshift_dens(vi, Acol=1, delta = 0.1)
vibr::plotshift_dens(vi, Acol=1, delta = 0.01)

We can also look at the stochastic interventions in terms of changes to the original exposure data and look for evidence of extrapolation in bivariate settings. For expample, if a proposed value of delta pushes the shifted predictors far away from the observed data in a bivariate plot, then the value of delta should probably be reduced. Note that bivariate plots represent a highly optimistic picture of identification, since more extrapolation will occur in higher dimensions. This diagnostic plot is meant as a rough reality check on the size of delta.

vibr::plotshift_scatter(vi, Acol=2, Bcol=3, delta = 0.1)
vibr::plotshift_scatter(vi, Acol=2, Bcol=3, delta = 0.01)

We can further plot the estimated density (horizontal axis) by a ratio of estimated density under a stochastic intervention (intervention exposure = observed exposure -delta) to the estimated density of the observed data. This gives an indication of influential observations (high weights) and allows one to cross-check against estimated density in the observed data. Contrast the two (very different) interventions, where the small intervention (delta = 0.001) implies that only a few observations would have a different probability density estimate for chloride. The second plot indicates that the intervention is relatively small (which will be well supported but also may not be very informative in terms of clinically relevant shifts in predictors).

vibr::plotshift_wt(vi, Acol=5, delta = 0.1)
vibr::plotshift_wt(vi, Acol=5, delta = 0.001)

Fitting a model using augmented inverse probability weighting

Rather than jumping into analysis, as we did above. It's worth checking identifiability prior to fitting models. This can be done using precheck_identification:

   suppressWarnings(ident <- precheck_identification(X=X, delta=0.01,
       Xdensity_learners=Xdensity_learners[c(1,2,3)],
       Xbinary_learners=Xbinary_learners, threshold=10,
       scale_continuous = TRUE))
   ident

This analysis tells us that about 1% of weights for barium and choloride exceed 10, with some observations being as high as 16.7. This is not necessarily worrisome, but it is useful to keep in mind when considering how influential some observations may be on the final results. Much higher weights may be more worrisome.

In contrast, a delta of 0.1 would have 11% of observations with a weight for chloride > 10. This suggests that delta = 0.1 is too large.

The weight distribution for mage35 (a binary variable) appears to have a few extreme estimates. We will get to that in a bit.

   suppressWarnings(ident2 <- precheck_identification(X=X, delta=0.1,
       Xdensity_learners=Xdensity_learners[c(1,2,3)],
       Xbinary_learners=Xbinary_learners, threshold=10,
       scale_continuous = TRUE))
   ident2

A delta of 0.02 seems reasonable - not too large so that weights aren't extreme, but also not as small as we could get. This (informally) supports a balance between clinical relevance and identifiability. Note that the output of this function also lets us know that some of the learners have failed and can be dropped in subsequent analysis.

   suppressWarnings(ident3 <- precheck_identification(X=X, delta=0.02,
       Xdensity_learners=Xdensity_learners[c(1,2,3)],
       Xbinary_learners=Xbinary_learners, threshold=10,
       scale_continuous = TRUE))
   ident3

It's not a bad idea to check different learner sets to see possible sensitivity of results. Here, we specify a different list of learners manually - the learner sets are just R lists of sl3 learners. You can learn about how to specify these learners from the sl3 documentation. Superlearner is automatically used with each of these learner libraries, and there is no user control over how that functions (at this time). If you do not wish to use super learner, you can specify a single learner (e.g. X_binary_learners above just uses a main-term logistic model to estimate propensity scores). Since the mage35 weights looked bad, we'll try using super learner to estimate the weights, and include some simpler models ("mean" or an intercept only model, and stepwise regression). The density learners include one learner that first uses principal component analysis to reduce dimensionality than then estimates density based on that reduced set (referred to in the sl3 literature as a 'pipeline', so that the 'learner' is a combination of PCA+gaussian regression). Such an approach may be useful for high dimensional variable sets. As you can see, the weights for mage35 are better behaved (we don't expect these to have a mean of 1.0 like stabilized weights often do).

Xbl = list(
  Lrnr_mean$new(name="mean"),
  Lrnr_glm$new(name="logit"), 
  Lrnr_stepwise$new()
)
Xdl = list(
  Lrnr_density_gaussian$new(name="ols_dens"),
  Pipeline$new(Lrnr_pca$new(name="pca"), Lrnr_density_gaussian$new(name="dens_gaussian_glm")),
  Lrnr_density_semiparametric$new(name="dens_sp_unadj", mean_learner = Lrnr_mean$new(), var_learner = NULL)
)

suppressWarnings(ident3b <- precheck_identification(X=X, delta=0.02,
       Xdensity_learners=Xdl,
       Xbinary_learners=Xbl, threshold=10,
       scale_continuous = TRUE))
ident3b

Now we can fit the model with our selected value of delta after adding some learners for the outcome regression.

Yl = list(
  Lrnr_mean$new(name="mean"),
  Lrnr_glm$new(name="ols"),
  Pipeline$new(Lrnr_pca$new(name="pca"), Lrnr_glm$new(name="ols", family=gaussian())),
  Pipeline$new(Lrnr_screener_importance$new(name="rfimpscreen", learner=Lrnr_randomForest$new()), Lrnr_glm$new(name="OLS", family=gaussian()))
)
    # basic variable importance augmented inverse probability weighting
    vi2 <- varimp(X=X, 
                 Y=Y, 
                 delta=0.02, 
                 Y_learners = Yl,
                 Xdensity_learners=Xdl,
                 Xbinary_learners=Xbl,
                 estimator="AIPW"
                 )
    vi2

This result gives roughly the same statistical (and importance) as the TMLE result given above, with a larger intervention. The interpretation is that calcium is the most important predictor of the outcome among this set, both in terms of effect size (Rank(|estimate|): ranking by absolute effect size) and in terms of statistical evidence (Rank(|z|): ranking by Z-score).

If we used the same intervention, TMLE and AIPW will often give very similar answers, so the choice between these approaches is mainly driven by investigator preference rather than objective criteria.

cross-fitting

Sometimes, it is useful to apply sample-splitting when using highly flexible machine learning algorithms, such as random forest. Here we use 7 fold double cross-fitting (K-fold, where K is an odd number >2). This scheme splits the data into equally sized 5 partitions, and, for each "fold", generalized propensity models (predictor models) are fit in two other folds, and an outcome regression model is fit in the remaining folds. The fitted models are then used to generate predicted outcomes and generalized propensity score in the main fold. This process is repeated for each of the folds, so that each predicted outcome/propensity score is estimated in data that are not used to fit the model. This whole process can be averaged over multiple different 5-way partitionings, which reduces noise due to the random selection of partitions (here it is only done over a single partitionings to reduce computational time via foldrepeats = 1). Note that inference for this relatively simple problem is similar to simply using the original TMLE estimator, but cross-fitting may give estimators with better performance in some scenarios (e.g. when estimators of the predictor or outcome models do not meet "Donsker" conditions, which are required for asymptotic variance estimators to be valid). Good general guidance is to simulate, as close as possible, the applied data example of interest and study whether the proposed estimators yield valid inference (approximately unbiased, approximately nominal confidence interval coverage). Cross-fitting is generally safe, but will tend to result in larger standard errors (due to sample splitting - compare standard error estimates from the vi2 object with those from the vi3 object) and can be extremely time-intensive.

    # basic variable importance from cross-fit TLME
    # library(future)
    #future::plan("multisession")
    set.seed(12345)
    vi3 <- varimp(X=X, 
                 Y=Y, 
                 delta=0.02, 
                 Y_learners = c(Yl, sl3::Lrnr_randomForest$new()), # added random forest from sl3/randomForest packages
                 Xdensity_learners=Xdl, 
                 Xbinary_learners=Xbl,
                 estimator="TMLEX", xfitfolds = 5, foldrepeats = 1
                 )
    vi3

Note that many combinations of xfitfolds and foldrepeats are possible. Generally, as both of these increase, the computational demands will also increase. The coding block above (where vi3 is estimated) includes commented code that (if uncommented) would allow parallel processing via the "future" package to reduce computational issues.

Causal effect estimation

The big difference between variable importance and causal effects is that variable importance is a purely statistical concept, whereas a causal effects combines a statistical result and causal identification conditions. From a strictly causal perspective, the framework of vibr is not (by default) suited for causal estimation because no consideration is given for causal relationships between variables. For estimators that use generalized propensity scores (IPW, TMLE, AIPW) all exposures are modeled conditional on all other exposures. This may be appropriate in some very limited cases, but in general this approach will be inappropriate (causally speaking) for at least some of the predictors. One way to craft a more thoughtful causal approach using vibr is to consider exposures one at a time and select control variables that are appropriate for each variable for which a causal effect is to be estimated. Here, we use the "W" argument to vibr to declare the control variables (often referred to as confounders in epidemiology) separately from the primary exposure of interest, arsenic. Effectively, this prevents fitting of generalized propensity score models for all variables in W, so it can be much more efficient if you only want a single causal quantity.

In this particular implementation, there is no difference in the modeling approach for arsenic itself, because all of the variables in W were previously in X. In general, approaches to causal effect estimation should be guided by a causal model such as that provided by a directed acyclic graph.

   # now focus only on arsenic
   suppressWarnings(ident4 <- precheck_identification(X=X, delta=0.2,
       Xdensity_learners=Xdl,
       Xbinary_learners=Xbl, threshold=10,
       scale_continuous = TRUE))
   ident4
   # .2 or lower seems reasonable for arsenic


    set.seed(12345)
    newX = X[, "arsenic", drop=FALSE] # key: X must still be a data frame
    W = X[,-which(names(X)=="arsenic")]
    suppressWarnings(vi3 <- varimp(X=newX,
                  W=W,
                 Y=Y, 
                 delta=0.15, 
                 Y_learners = Yl,
                 Xdensity_learners=Xdl,
                 Xbinary_learners=Xbl,
                 estimator="TMLE"
                 ))
    vi3