Assessing Variable Importance for Predictive Models of Arbitrary Type
In datarobot: 'DataRobot' Predictive Modeling API

Linear regression provides the historical and conceptual basis for much of the practical art of predictive modeling. Key advantages of linear regression models are that they are both easy to fit to data and easy to interpret and explain to end users. Unfortunately, these models often predict poorly relative to newer approaches from the machine learning literature like support vector machines, gradient boosting machines, or random forests. A significant price paid for the sometimes dramatically better performance of these alternative models is a greater difficulty of interpretation and explanation. To address one aspect of this problem, this vignette considers the problem of assessing variable importance for a prediction model of arbitrary type, adopting the well-known random permutation-based approach, and extending it to consensus-based measures computed from results for a large collection of models. In particular, this vignette uses the datarobot R package, which allows R users to interact with the DataRobot modeling engine, a massively parallel architecture for simultaneously fitting many models to a single dataset. These models cover a wide range of types, including constrained and unconstrained linear regression models, machine learning models like the ones listed above, and ensemble models that combine the predictions of the most successful of these individual models in several different ways. To demonstrate the practical utility of the ideas presented here, this approach is applied to a simulation-based dataset generated using the mlbench.friedman1 function from the R package mlbench, where it is possible to judge the reasonableness of the results relative to a "correct answer." For a more detailed introduction to the DataRobot package and its capabilities, refer to the companion vignette, "Introduction to the DataRobot R Package."

1. Introduction

To help understand the results obtained from complex machine learning models like random forests or gradient boosting machines, a number of model-specific variable importance measures have been developed. As @gromping09 notes, however, there is no general consensus on the "best" way to compute - or even define - the concept of variable importance. (For an illuminating list of possible approaches, consult the help file for the varImp function in the caret package [@caret]: this list does not cover all methods that have been proposed, or even all methods available in R, but it clearly illustrates the range of methods available.) For the application of primary interest here - i.e., understanding the aggregate behavior of a collection of different type models like those generated by the DataRobot modeling engine - a "good" variable importance measure should exhibit these characteristics:

it should be applicable to any model type;
it should be sensible to combine or compare results across different model types.

One strategy that meets both of these criteria is that suggested by @friedman01, applying a random permutation to each covariate and thus effectively removing it from the model, and then examining the impact of this change on the predictive performance of the resulting model. The basic idea is that if we remove an important covariate from consideration, the best achievable model performance should suffer significantly, while if we remove an unimportant covariate, we should see little or no change in performance.

2. Permutation-based significance measures

Ideally, we would like to implement the strategy just described by applying many random permutations and combining the results, but this ideal approach is computationally expensive. In his application to gradient boosting machines, @friedman01 could "borrow strength" by combining the random covariate permutations with the random subsampling inherent in the model fitting procedure. In characterizing arbitrary models, however, we cannot exploit their internal structure, but we can borrow strength by combining results across the different models we are fitting. Thus, the approach taken here consists of the following steps:

Run a DataRobot modeling project based on the original data and retrieve the results;
For each covariate whose influence we are interested in assessing: a. Generate a modified dataset, replacing the covariate with its random permutation, leaving all other covariates and the target variable unmodified; a. Set up and run a new modeling project based on this modified dataset; a. Retrieve the project information and compute the performance differences between the models in this project and the same models in the original project.
Characterize each covariate, in one of the following three ways: a. Select an individual model of particular interest (e.g., the model that performs best in the original project) and compute its performance degradation; a. Compute the average performance degradation over all project models; a. Compute a performance-weighted average of the degradation, like that defined in Section 5.

The following sections apply the procedure just described to a simulation-based example where we can make reasonable judgements about the relative importance of the covariates.

3. A simulation-based dataset

The basis for the example considered here is the simulation function mlbench.friedman1 from the R package mlbench [@leisch10]. This function simulates 10 independent, uniformly distributed random variables on the unit interval, $x_1$ through $x_{10}$, and generates the response variable $y$ defined by the first five of these covariates:

$$ y = 10 \sin (\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5. $$

The user of the mlbench.friedman1 simulation function specifies the number $n$ of samples to be generated, and the standard deviation $\sigma$ of a zero-mean, independent, identically distributed, Gaussian noise sequence added to the simulated $y$ values. In the example considered here, $n = 5000$, and the noise standard deviation had the default value, $\sigma = 1$. To guarantee repeatability, the function set.seed(33) was first called to initialize the random number generators; the result was saved in the CSV file Friedman1.csv.

Based on the structure of this simulator, we may classify the covariates into an influential subset ($x_1$ through $x_5$) and an irrelevant subset ($x_6$ through $x_{10}$). Also, from the model coefficients and the functions involved, we expect $x_4$ to be the most important variable, probably followed by $x_1$ and $x_2$, both comparably important, with $x_5$ probably less important. The influence of $x_3$ is somewhat more difficult to assess due to its quadratic dependence, but it seems likely that this nonlinearity will suppress this variable's influence since the total range of this term is from $0$ to $5$, the same as the $x_5$ term.

4. Results for the simulation data

To explore the utility of the random permutation-based strategy described in Section 3, eleven DataRobot projects were created. The first was based on the unmodified simulation dataset, specifying $y$ as the target variable and $x_1$ through $x_{10}$ as prediction covariates. This project minimizes root mean square prediction error (RMSE), the default fitting metric chosen by DataRobot:

friedman <- read.csv(system.file("extdata", "Friedman1.csv.gz", package = "datarobot"))
originalProject <- StartProject(friedman, "OriginalProject", target = "Y", wait = TRUE)
originalModels <- ListModels(originalProject)

As discussed in the companion vignette, "Introduction to the DataRobot R Package," the above sequence of model-building steps begins with the StartProject function, which uploads the modeling dataset Friedman1.csv, creates the project, gives it the name "OriginalProject", specifies that the response variable to be predicted by all project models is Y. Because the model-building process does require some time to complete, the parameter wait = TRUE is used to delay the call to ListModels, requesting all project model information, until the model-building process has completed.

All other projects were set up similarly, except that a single random permutation was applied to covariate $x_i$ for the $i^{th}$ project. The random permutations were generated using the R function PermuteColumn, which has three required parameters: originalFile, the name of the CSV file containing the unmodified data (here, Friedman1.csv); colName, the name of the column in this file containing the variable whose influence we wish to assess; and permutedFile, the name of the CSV file where the modified data will be stored. The R code is:

PermuteColumn <- function(originalFile, colName, permutedFile, iseed = 317) {
  set.seed(iseed)
  originalFile <- system.file("extdata", originalFile, package = "datarobot")
  dframe <- read.csv(originalFile)
  varNames <- colnames(dframe)
  colIndex <- which(varNames == colName)
  x <- dframe[ ,colIndex]
  y <- sample(x)
  outFrame <- dframe
  outFrame[ ,colIndex] <- y
  write.csv(outFrame, permutedFile, row.names=FALSE)
}

To compute the permutation-based variable importances, all project results are saved as elements of a composite list, modelList, whose first element is the original project model list and all subsequent elements are the permutation-based model lists. The R code to generate this list follows (note that because this process is building over 500 models, it takes some time to complete):

modelList <- list(n = 11)
modelList[[1]] <- originalModels
permFile <- tempfile(fileext = "permFile.csv")
for (i in 1:10) {
  varName <- paste("X",i,sep="")
  PermuteColumn("Friedman1.csv.gz", varName, permFile)
  projName <- paste("PermProject", varName, sep = "")
  permProject <- StartProject(permFile, projectName = projName, target = "Y", wait = TRUE)
  modelList[[i+1]] <- ListModels(permProject)
}

library(datarobot)
modelList <- readRDS("PermutationModelList.rds")

To obtain the desired permutation shifts, it is useful to convert the summary list generated by this code into a dataframe, with the following columns:

blueprintId, an alphanumeric character string that uniquely identifies the model structure and any preprocessing applied;
modelType, a character string describing the model structure;
expandedModel, a character string describing the model structure and any preprocessing applied;
samplePct, the fraction of the training dataset used in fitting the model;
a column giving the value of the fitting metric computed from the validation set data for the original model;
one column for each covariate in the original model, giving the validation fitting metric value for the model constructed after the random permutation was applied to that covariate.

This conversion is accomplished by the PermutationMerge function listed below, which requires the summary list, the samplePct value for the models to be matched, and a vector giving the names for the fitting metric columns in the resulting dataframe. Matching here is done on the basis of blueprintId and samplePct, which guarantees that fit comparisons are made between original and randomized fits to the same model for the same sample size, giving an "apples-to-apples" comparison of the effects of the random permutations applied to the covariates. In particular, note that DataRobot fits the same model to several different sample sizes, so it is necessary to specify both samplePct and blueprintId to uniquely identify a model.

PermutationMerge <- function(compositeList, matchPct = NULL, metricNames, matchMetric = NULL) {
  df <- as.data.frame(compositeList[[1]], simple = FALSE)
  if (is.null(matchPct)) {
    index <- seq(1, nrow(df), 1)
  } else {
      index <- which(round(df$samplePct) == matchPct)
  }
  if (is.null(matchMetric)) {
    projectMetric <- compositeList[[1]][[1]]$projectMetric
    matchMetric <- paste(projectMetric, "validation", sep = ".")
  }
  getCols <- c("modelType", "expandedModel", "samplePct", "blueprintId", matchMetric)
  outFrame <- df[index, getCols]
  keepCols <- getCols
  keepCols[5] <- metricNames[1]
  colnames(outFrame) <- keepCols
  n <- length(compositeList)
  for (i in 2:n) {
    df <- as.data.frame(compositeList[[i]], simple = FALSE)
    index <- which(df$samplePct == matchPct)
    upFrame <- df[index, c("blueprintId", matchMetric)]
    colnames(upFrame) <- c("blueprintId", metricNames[i])
    outFrame <- merge(outFrame, upFrame, by = "blueprintId")
  }
  outFrame
}

Because it gives the largest number of models to compare, the results for the 16% training sample are considered here, and the metricNames parameter required by the PermutationMerge function is defined as indicated in the following R code:

metricNames <- c("originalRMSE", paste("X", seq(1, 10, 1), "RMSE", sep = ""))
mergeFrame <- PermutationMerge(modelList, 16, metricNames)

The resulting dataframe gives the original validation set RMSE value and the permutation RMSE values for all covariates for the r nrow(mergeFrame) models fit to the 16% data sample (here, this sample consists of 800 records, large enough to give reasonable comparison results). Figure 1 shows a beanplot summary of these results, comparing the RMSE values for the r nrow(mergeFrame) common models built for each of the eleven datasets just described. This figure was generated using the R package beanplot [@kampstra08], and the "beans" or red lines in these plots represent the RMSE values for the individual models from each project. Beanplots are analogous to the more familiar boxplots, combining the boxplot range information with a nonparametric density estimate to give a more complete view of how each data subset is distributed. This additional information is particularly useful in cases like the one considered here, where these distributions differ significantly in form between the different subsets. These plots suggest that random permutations applied to the relevant variables X1 through X5 increase the RMSE values for most models, while permutations applied to the irrelevant variables X6 through X10 seem to have little impact on these values, but to be certain we need to make direct comparisons.

par(mfrow = c(1, 1))
library(beanplot)
BeanNames <- c("None", paste("X", seq(1, 10, 1), sep = ""))
beanplot(mergeFrame[, 5:15], names = BeanNames,
         xlab = "Permutation", ylab = "RMSE",
         col = c("transparent", "red", "red", "blue"),
         las = 2,
         what = c(0, 1, 0, 1))

Figure 2 presents a more direct beanplot summary of the differences in RMSE values between the random permutation-based models and the original model. Here, each beanplot corresponds to one of the covariates and the individual "beans" in these plots now represent the increase in validation set RMSE value that results when the random permutation is applied to the corresponding variable. The dataframe on which this summary is based was computed using the function ComputeDeltas:

ComputeDeltas <- function(mergeFrame, refCol, permNames, shiftNames) {
  allNames <- colnames(mergeFrame)
  refIndex <- which(allNames == refCol)
  xRef <- mergeFrame[, refIndex]
  permCols <- which(allNames %in% permNames)
  xPerm <- mergeFrame[, permCols]
  deltas <- xPerm - xRef
  colnames(deltas) <- shiftNames
  deltas$New <- xRef
  newIndex <- which(colnames(deltas) == "New")
  colnames(deltas)[newIndex] <- refCol
  deltas
}

To use this function, we need to specify: refCol, the name of the reference column with the original model RMSE values; permNames, the names of the columns containing the permutation-based RMSE values; and shiftNames, the names of the permutation-induced shift columns in the dataframe returned by the function. Here, the following code generates the desired permutation shift dataframe:

allNames <- colnames(mergeFrame)
refCol <- allNames[5]
permNames <- allNames[6:15]
shiftNames <- paste("X", seq(1, 10, 1), sep = "")
deltaFrame <- ComputeDeltas(mergeFrame, refCol, permNames, shiftNames)

par(mfrow=c(1, 1))
beanplot(deltaFrame[, 1:10],
         xlab = "Permutation", ylab = "RMSE Shift",
         col = c("transparent", "red", "red", "blue"),
         what = c(0, 1, 1, 1))
bestRow <- which.min(deltaFrame$originalRMSE)
bestModel <- mergeFrame$modelType[bestRow]
points(seq(1, 10, 1), deltaFrame[bestRow, 1:10], pch = 16, col = "limegreen", cex = 1.5)
legend("topright", col = c("limegreen", "blue"), pch = c(16, 15), cex = 1.2,
       legend = c(bestModel, "Average"))
abline(h = 0, lty = 2)

The beanplot shown in Figure 2 includes solid blue lines at the average RMSE shift for each covariate and a dashed reference line at zero RMSE shift; also, the results for the best model for this example (r bestModel) are shown as solid green circles. It is clear from these results that random permutations applied to the relevant variables X1 through X5 cause significant RMSE increases for most models, while random permutations applied to the irrelevant variables X6 through X10 have essentially no effect. An important exception is the trivial mean response model: this model consistently exhibits the worst RMSE performance in Figure 1, but because its response is independent of all covariates, it shows zero RMSE shift in response to all covariate permutations in Figure 2.

5. Numerical importance measures

While graphical summaries like those shown in Figures 1 and 2 are useful, we often want numerical importance measures. The RMSE shifts presented here provide a basis for computing these measures, which can be defined in several different ways. The simplest is to average the RMSE shifts over all models, giving a consensus-based relative importance value for each variable, corresponding to the blue lines shown in Figure 2. A potential disadvantage of this approach is that it gives equal weights to all models, including those like the mean response model for which the RMSE shifts are always zero. At the other extreme, we could take the RMSE shift in one individual model - the best predictor is an obvious choice here, corresponding to the green points in Figure 2 - but this has the potential disadvantage of being model-specific and it does not make full use of the performance shift data from all of the available models. An intermediate course is to form a performance-weighted average of the individual model results, giving greater weight to better-performing models. A specific implementation of this idea is the following performance-weighted average:

$$ \nu(x) = \frac{\sum_{i=1}^{M} \; (\mbox{RMSE}i^x - \mbox{RMSE}_i^0)/\mbox{RMSE}_i^0}{\sum{i=1}^{M} \; 1/\mbox{RMSE}_i^0}, $$

where $\mbox{RMSE}_i^x$ represents the RMSE value for the $i^{th}$ model in the collection fit from the dataset with a random permutation applied to the covariate $x$, and $\mbox{RMSE}_i^0$ is the RMSE value for this model when fit to the original dataset. For the example considered here, this approach gives the RMSE shift for the best models approximately five times greater weight than the poorest model. A simple R function to compute this performance-weighted average from the shift summary dataframes described above is varImpSummary:

varImpSummary <- function(deltaFrame, refCol, oneIndex) {
  vars <- colnames(deltaFrame)
  refIndex <- which(vars == refCol)
  refValue <- deltaFrame[, refIndex]
  wts <- 1/refValue  # Performance-weights = reciprocal fitting measure
  deltasOnly <- deltaFrame[, -refIndex]
  thisModel <- as.numeric(deltasOnly[oneIndex, ])
  avg <- apply(deltasOnly, MARGIN=2, mean)
  WtAvgFunction <- function(x, w) { sum(w * x) / sum(w) }
  wtAvg <- apply(deltasOnly, MARGIN = 2, WtAvgFunction, wts)
  varImpFrame <- data.frame(average = avg,
                            weightedAverage = wtAvg,
                            oneModel = thisModel)
  varImpFrame
}

Here, deltaFrame is the dataframe returned by the function ComputeDeltas listed above, refValue is the vector of original RMSE values ($\mbox{RMSE}_i^0$ values in the above equation), and oneIndex is the row number in deltaFrame describing the performance of a single model of particular interest. In this example, oneIndex specifies the best non-blender model in the original DataRobot project with no permutations applied to any variable.

varImp <- varImpSummary(deltaFrame, "originalRMSE", bestRow)
avg <- round(varImp$average, digits = 3)
wtAvg <- round(varImp$weightedAverage, digits = 3)
best <- round(varImp$oneModel, digits = 3)
df <- data.frame(Avg = avg, WtdAvg = wtAvg, Best = best)

The table shown below gives a summary of the three numerical variable importance measures just described for the simulation-based example considered here. Specifically, this table lists the average RMSE shift, the performance-weighted average RMSE shift defined above, and the shift associated with the best model. Based on these results, the relevant variables X1 through X5 are consistently ranked as much more important than the irrelevant variables, X6 through X10. Also, all three of the measures considered here give the same relative ranking to the relevant variables: X4 is most important, followed by X2 and X1, with similar shift values, then X5 and finally X3. Recall that this corresponds to the approximate ordering anticipated on the basis of the structure of the simulation model generating the data.

df

6. Summary

This note has described the use of random permutations to assess variable importance for predictive models of arbitrary type. The DataRobot modeling engine fits many different model types to the same dataset, building models that predict the same target variable from the same covariates, and the random permutation approach described here can be applied to all of these models, yielding variable importance assessments for any covariate and each model in the DataRobot modeling project. Since there is nothing model-specific about this random permutation strategy, the results can be compared and combined across all of these models, leading to consensus-based variable importance measures like those discussed in Section 5. These ideas were illustrated here for the simulation-based dataset described in Section 2, making it possible to compare the results with expectations based on the simulation model structure. For this example, all three of the numerical measures proposed in Section 5 gave results that were in agreement with these expectations. The companion vignette, "Using Many Models to Compare Datasets" presents the corresponding variable importance results for two real datasets.