In AFIT-R/vip: Variable Importance Plots

knitr::opts_chunk$set(
  cache = TRUE,
  collapse = TRUE,
  comment = "#>",
  fig.align = "left"
)

Warning This vignette is still under construction!

As it turns out, the PDP-based variable importance (VI) measure discussed in @greenwell-simple-2018 (and provided by method = "pdp") can also be used to quantify the strength of potential interaction effects. Let $i\left(x_i, x_j\right)$ $\left(i \ne j\right)$ be the standard deviation of the joint partial dependence values $\bar{f}{ij}\left(x{ii'}, x_{jj'}\right)$ for $i' = 1, 2, \dots, k_i$ and $j' = 1, 2, \dots, k_j$. Essentially, a weak interaction effect of $x_i$ and $x_j$ on $Y$ would suggest that $i\left(x_i, x_j\right)$ has little variation when either $x_i$ or $x_j$ is held constant while the other varies.

Let $\boldsymbol{z}_s = \left(x_i, x_j\right)$, $i \neq j$, be any two predictors in the feature space $\boldsymbol{x}$. Construct the partial dependence function $\bar{f}_s\left(x_i, x_j\right)$ and compute $i\left(x_i\right)$ for each unique value of $x_j$, denoted $\i\left(x_i | x_j\right)$, and take the standard deviation of the resulting importance scores. The same can be done for $x_j$ and the results averaged together. Large values (relative to each other) would be indicative of possible interaction effects.

Prerequisites

# Load required packages
library(dplyr)           # for data wrangling
library(gbm)             # for fitting generalized boosted models
library(ggplot2)         # for general visualization
library(lattice)         # for general visualization
library(NeuralNetTools)  # for various tools for neural networks
library(nnet)            # for fitting neural networks w/ a single hidden layer
library(vip)             # for visualizing feature importance

Friedman's regression problem

To illustrate, we'll use one of the regression problems described in @multivariate-friedman-1991 and @bagging-breiman-1996. The feature space consists of ten independent $\mathcal{U}\left(0, 1\right)$ random variables; however, only five out of these ten actually appear in the true model. The response is related to the features according to the formula $$ Y = 10 \sin\left(\pi x_1 x_2\right) + 20 \left(x_3 - 0.5\right) ^ 2 + 10 x_4 + 5 x_5 + \epsilon, $$ where $\epsilon \sim \mathcal{N}\left(0, \sigma^2\right)$. Using the R package nnet [@venables-modern-2002], we fit a NN with one hidden layer containing eight units and a weight decay of 0.01 (these parameters were chosen using 5-fold cross-validation) to 500 observations simulated from the above model with $\sigma = 1$. The cross-validated $R^2$ value was 0.94.

VIPs for the fitted network are displayed in Figure 1. Notice how the Garson and Olden algorithms incorrectly label some of the features not in the true model as "important". For example, the Garson algorithm incorrectly labels $x_8$ (which is not included in the true model) as more important than $x_5$ (which is in the true model). Similarly, Olden's method incorrectly labels $x_{10}$ as being more important than $x_2$. Our method, on the other hand, clearly labels all five of the predictors in the true model as the most important features in the fitted NN.

# Simulate data
trn <- gen_friedman(500, seed = 101)

# Fit a neural network to simulated data
set.seed(103)
nn <- nnet(y ~ ., data = trn, size = 8, linout = TRUE, decay = 0.01,
           maxit = 1000, trace = FALSE)

# VIP: partial dependence algorithm
p1 <- vip(nn, method = "pdp", feature_names = paste0("x.", 1:10)) +
  labs(x = "", y = "Importance", title = "PDP method") +
  theme_light()

# VIP: Garson's algorithm
nn_garson <- garson(nn, bar_plot = FALSE) %>%
  tibble::rownames_to_column("Variable") %>%
  select(Variable, Importance = rel_imp) %>%
  mutate(Importance = Importance)
p2 <- ggplot(nn_garson, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col() +
  labs(x = "", y = "Importance", title = "Garson's method") +
  coord_flip() +
  theme_light()

# VIP: Olden's algorithm
nn_olden <- olden(nn, bar_plot = FALSE) %>%
  tibble::rownames_to_column("Variable") %>%
  select(Variable, Importance = importance) %>%
  mutate(Importance = Importance)
p3 <- ggplot(nn_olden, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col() +
  labs(x = "", y = "Importance", title = "Olden's method") +
  coord_flip() +
  theme_light()

# Display plots side by side
grid.arrange(p1, p2, p3, ncol = 3)

We also constructed the partial dependence functions for all pairwise interactions and computed the interaction statistic discussed in @greenwell-simple-2018. The top ten interaction statistics are displayed in Figure 2. There is a clear indication of an interaction effect between features $x_1$ and $x_2$, the only interaction effect present in the true model.

#
# The following parallel backend will only work on UNIX-like systems
# 

# Compute interaction statistics in parallel
library(doParallel)  # any parallel backend supported by the foreach package should work
registerDoParallel(cores = 4)  # use four cores
int <- vint(nn, feature_names = paste0("x.", 1:10), parallel = TRUE)

ggplot(int[1:10, ], aes(reorder(Variables, -Interaction), Interaction)) +
  geom_col(width = 0.75) +
  labs(x = "", y = "Interaction strength") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1)) +
  theme_light()

In fact, since we know the distributions of the predictors in the true model, we can work out the true partial dependence functions. For example, for the pairs $\left(x_1, x_2\right)$ and $\left(x_1, x_4\right)$, we have $$ f\left(x_1, x_2\right) = 10 \sin \left(\pi x_1 x_2\right) + 55 / 6, $$ and $$ f\left(x_1, x_4\right) = \frac{5 \pi x_1 \left(12 x_4 + 5\right) - 12 \cos \left(\pi x_1\right) + 12}{6 \pi x_1}. $$ Next, we simulated the standard deviation of $f\left(x_1, x_2\right)$ for a wide range of fixed values of $x_2$; this is what $i\left(x_1 | x_2\right)$ is trying to estimate. The results from doing this for both predictors in each model are displayed in Figure 3. The top row of Figure Figure 3 illustrates that the importance of $x_1$ (i.e., the strength of its relationship to the predicted outcome) heavily depends on the value of $x_2$ and vice versa (i.e., an interaction effect between $x_1$ and $x_2$). In the bottom row, on the other hand, we see that the importance of $x_1$ does not depend on the value of $x_4$ and vice versa (i.e., no interaction effect between $x_1$ and $x_4$).

# Simulation function
sim_fun <- function(pred.var, pd.fun, xlabs = c("", ""), ylabs = c("", "")) {
  x <- y <- seq(from = 0, to = 1, length = 100)
  xy <- expand.grid(x, y)
  z <- apply(xy, MARGIN = 1, FUN = function(x) {
    pd.fun(x[1L], x[2L])
  })
  res <- as.data.frame(cbind(xy, z))
  names(res) <- c(pred.var, "yhat")
  form <- as.formula(paste("yhat ~", paste(paste(pred.var, collapse = "*"))))
  p1 <- levelplot(form, data = res, col.regions = viridis::viridis,
                  xlab = expression(x[1]), ylab = expression(x[4]))
  approxVar.x <- function(x = 0.5, n = 100000) {
    y <- runif(n, min = 0, max = 1)
    sd(pd.fun(x, y))
  }
  approxVar.y <- function(y = 0.5, n = 100000) {
    x <- runif(n, min = 0, max = 1)
    sd(pd.fun(x, y))
  }
  x <- seq(from = 0, to = 1, length = 100)
  y1 <- sapply(x, approxVar.x)
  y2 <- sapply(x, approxVar.y)
  p2 <- xyplot(SD ~ x, data = data.frame(x = x, SD = y1), type = "l",
               lwd = 1, col = "black",
               ylim = c(min(y1, na.rm = TRUE) - 1, max(y1, na.rm = TRUE) + 1),
               xlab = expression(x[1]),
               ylab = expression(imp ~ (x[4]*" | "*x[1])))
  p3 <- xyplot(SD ~ x, data = data.frame(x = x, SD = y2), type = "l",
               lwd = 1, col = "black",
               ylim = c(min(y2, na.rm = TRUE) - 1, max(y2, na.rm = TRUE) + 1),
               xlab = expression(x[4]),
               ylab = expression(imp ~ (x[1]*" | "*x[4])))
  list(p1, p2, p3)
}

# Run simulations
sim1 <- sim_fun(
  pred.var = c("x.1", "x.2"), 
  xlabs = c(expression(x[1]), expression(x[2])),
  ylabs = c(expression(imp ~ (x[2]*" | "*x[1])), 
            expression(imp ~ (x[1]*" | "*x[2]))),
  pd.fun = function(x1, x2) {
    5 * (pi * x1 * (12 * x2 + 5) - 12 * cos(pi * x1) + 12) / (6 * pi * x1)
  })
sim2 <- sim_fun(
  pred.var = c("x.1", "x.4"), 
  xlabs = c(expression(x[1]), expression(x[4])),
  ylabs = c(expression(imp ~ (x[4]*" | "*x[1])), 
            expression(imp ~ (x[1]*" | "*x[4]))),
  pd.fun = function(x1, x2) {
    10 * sin(pi * x1 * x2) + 55 / 6
})

# Display plots side by side
grid.arrange(
  sim1[[1]], sim1[[2]], sim1[[3]], 
  sim2[[1]], sim2[[2]], sim2[[3]], 
  nrow = 2
)

Friedman's $H$-statistic

An alternative measure for the strength of interaction effects is known as Friedman's $H$-statistic [@friedman-2008-predictive]. Coincidentally, this method is also based on the estimated partial dependence functions of the corresponding predictors, but uses a different approach.

For comparison, we fit a GBM to the Friedman regression data from the previous section. The parameters were chosen using 5-fold cross-validation. We used the R package gbm [@gbm-pkg] which has built-in support for computing Friedman's $H$-statistic for any combination of predictors. The results are displayed in Figure 4. To our surprise, the $H$-statistic did not seem to catch the true interaction between $x_1$ and $x_2$. Instead, the $H$-statistic ranked the pairs $\left(x_8, x_9\right)$ and $\left(x_7, x_{10}\right)$ as having the strongest interaction effects, even though these predictors do not appear in the true model. Our VI-based interaction statistic, on the other hand, clearly suggests the pair $\left(x_1, x_2\right)$ as having the strongest interaction effect.

```r Friedman's $H$-statistic. \textit{Right:} Our VI-based interaction statistic."}

Fit a GBM

set.seed(937) trn.gbm <- gbm(y ~ ., data = trn, distribution = "gaussian", n.trees = 25000, shrinkage = 0.01, interaction.depth = 2, bag.fraction = 1, train.fraction = 0.8, cv.folds = 5, verbose = FALSE) best.iter <- gbm.perf(trn.gbm, method = "cv", plot.it = FALSE)

Friedman's H-statistic

combns <- t(combn(paste0("x.", 1:10), m = 2)) int.h <- numeric(nrow(combns)) for (i in 1:nrow(combns)) { # print(paste("iter", i, "of", nrow(combns))) int.h[i] <- interact.gbm(trn.gbm, data = trn, i.var = combns[i, ], n.trees = best.iter) } int.h <- data.frame(x = paste0(combns[, 1L], "*", combns[, 2L]), y = int.h) int.h <- int.h[order(int.h$y, decreasing = TRUE), ]

VI-based interaction statistic

int.i <- vint( object = trn.gbm, # fitted model object feature_names = paste0("x.", 1:10), # features for which to compute pairwise interactions statistics n.trees = best.iter, # needed if object is of class "gbm" parallel = TRUE )

Plot Friedman's H-statistics

p1 <- ggplot(int.h[1:10, ], aes(reorder(x, y), y)) + geom_col(width = 0.75) + labs(x = "", y = "Interaction strength", title = "Friedman's H-statistic") + theme(axis.text.x = element_text(angle = 45, vjust = 1)) + theme_light() + coord_flip()

Plot PDP-based interaction statistics

p2 <- ggplot(int.i[1:10, ], aes(reorder(Variables, Interaction), Interaction)) + geom_col(width = 0.75) + labs(x = "", y = "Interaction strength", title = "Partial dependence") + theme(axis.text.x = element_text(angle = 45, vjust = 1)) + theme_light() + coord_flip()