library(knitr) opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) set.seed(123)
mlr
packageThis R package was developed as a part of the article "Visualizing the Feature Importance for Black Box Models" accepted at the ECML-PKDD 2018 conference track. The results of the application section of this article can be reproduced with the code provided here.
Install the development version from GitHub (using devtools
)
install.packages("devtools") devtools::install_github("giuseppec/featureImportance")
The featureImportance
package is an extension for the mlr
package and allows to compute the permutation feature importance in a model-agnostic manner.
The focus is on performance-based feature importance measures:
This use case computes the feature importance of a model based on a single test data set. For this purpose, we first build a model (here a random forest) on training data:
library(mlr) library(mlbench) library(ggplot2) library(gridExtra) library(featureImportance) set.seed(2018) # Get boston housing data and look at the data data(BostonHousing, package = "mlbench") str(BostonHousing) # Create regression task for mlr boston.task = makeRegrTask(data = BostonHousing, target = "medv") # Specify the machine learning algorithm with the mlr package lrn = makeLearner("regr.randomForest", ntree = 100) # Create indices for train and test data n = getTaskSize(boston.task) train.ind = sample(n, size = 0.6*n) test.ind = setdiff(1:n, train.ind) # Create test data using test indices test = getTaskData(boston.task, subset = test.ind) # Fit model on train data using train indices mod = train(lrn, boston.task, subset = train.ind)
In general, there are two ways how the feature importance can be computed:
replace.ids
.n.feat.perm
times.Visualizing the feature importance using fixed feature values is analogous to partial dependece plots and has the advantage that the local feature importance is calculated for each observation in the test data at the same feature values:
# Use feature values of 20 randomly chosen observations from test data to plot the importance curves obs.id = sample(1:nrow(test), 20) # Measure feature importance on test data imp = featureImportance(mod, data = test, replace.ids = obs.id, local = TRUE) summary(imp) # Plot PI and ICI curves for the lstat feature pi.curve = plotImportance(imp, feat = "lstat", mid = "mse", individual = FALSE, hline = TRUE) ici.curves = plotImportance(imp, feat = "lstat", mid = "mse", individual = TRUE, hline = FALSE) grid.arrange(pi.curve, ici.curves, nrow = 1)
Instead of using fixed feature values, the feature importance can also be computed by permuting the feature values. Here, the PI curve and ICI curves are evaluated on different randomly selected feature values. Thus, a smoother is internally used for plotting the curve:
# Measure feature importance on test data imp = featureImportance(mod, data = test, n.feat.perm = 20, local = TRUE) summary(imp) # Plot PI and ICI curves for the lstat feature pi.curve = plotImportance(imp, feat = "lstat", mid = "mse", individual = FALSE, hline = TRUE) ici.curves = plotImportance(imp, feat = "lstat", mid = "mse", individual = TRUE, hline = FALSE) grid.arrange(pi.curve, ici.curves, nrow = 1)
Instead of computing the feature importance of a model based on a single test data set, one can repeat this process by embedding the feature importance calculation within a resampling procedure. The resampling procedure creates multiple models using different training sets, and the corresponding test sets can be used to calculate the feature importance. For example, using 5-fold cross-validation results in 5 different models, one for each cross-validation fold.
rdesc = makeResampleDesc("CV", iter = 5) res = resample(lrn, boston.task, resampling = rdesc, models = TRUE) imp = featureImportance(res, data = getTaskData(boston.task), n.feat.perm = 20, local = TRUE) summary(imp) plotImportance(imp, feat = "lstat", mid = "mse", individual = FALSE, hline = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.