vi_permute: Permutation-based variable importance
In vip: Variable Importance Plots

View source: R/vi_permute.R

vi_permute

R Documentation

Permutation-based variable importance

Description

Compute permutation-based variable importance scores for the predictors in a model; for details on the algorithm, see Greenwell and Boehmke (2020).

Usage

vi_permute(object, ...)

## Default S3 method:
vi_permute(
  object,
  feature_names = NULL,
  train = NULL,
  target = NULL,
  metric = NULL,
  smaller_is_better = NULL,
  type = c("difference", "ratio"),
  nsim = 1,
  keep = TRUE,
  sample_size = NULL,
  sample_frac = NULL,
  reference_class = NULL,
  event_level = NULL,
  pred_wrapper = NULL,
  verbose = FALSE,
  parallel = FALSE,
  parallelize_by = c("features", "repetitions"),
  ...
)

Arguments

`object`	A fitted model object (e.g., a randomForest object).
`...`	Additional optional arguments to be passed on to foreach (e.g., `.packages` or `.export`).
`feature_names`	Character string giving the names of the predictor variables (i.e., features) of interest. If `NULL` (the default) then they will be inferred from the `train` and `target` arguments (see below). It is good practice to always specify this argument.
`train`	A matrix-like R object (e.g., a data frame or matrix) containing the training data. If `NULL` (the default) then the internal `get_training_data()` function will be called to try and extract it automatically. It is good practice to always specify this argument.
`target`	Either a character string giving the name (or position) of the target column in `train` or, if `train` only contains feature columns, a vector containing the target values used to train `object`.
`metric`	Either a function or character string specifying the performance metric to use in computing model performance (e.g., RMSE for regression or accuracy for binary classification). If `metric` is a function, then it requires two arguments, `actual` and `predicted`, and should return a single, numeric value. Ideally, this should be the same metric that was used to train `object`. See `list_metrics()` for a list of built-in metrics.
`smaller_is_better`	Logical indicating whether or not a smaller value of `metric` is better. Default is `NULL`. Must be supplied if `metric` is a user-supplied function.
`type`	Character string specifying how to compare the baseline and permuted performance metrics. Current options are `"difference"` (the default) and `"ratio"`.
`nsim`	Integer specifying the number of Monte Carlo replications to perform. Default is 1. If `nsim > 1`, the results from each replication are simply averaged together (the standard deviation will also be returned).
`keep`	Logical indicating whether or not to keep the individual permutation scores for all `nsim` repetitions. If `TRUE` (the default) then the individual variable importance scores will be stored in an attribute called `"raw_scores"`. (Only used when `nsim > 1`.)
`sample_size`	Integer specifying the size of the random sample to use for each Monte Carlo repetition. Default is `NULL` (i.e., use all of the available training data). Cannot be specified with `sample_frac`. Can be used to reduce computation time with large data sets.
`sample_frac`	Proportion specifying the size of the random sample to use for each Monte Carlo repetition. Default is `NULL` (i.e., use all of the available training data). Cannot be specified with `sample_size`. Can be used to reduce computation time with large data sets.
`reference_class`	Deprecated, use `event_level` instead.
`event_level`	String specifying which factor level of `truth` to consider as the "event". Options are `"first"` (the default) or `"second"`. This argument is only applicable for binary classification when `metric` is one of `"roc_auc"`, `"pr_auc"`, or `"youden"`. This argument is passed on to the corresponding yardstick metric.
`pred_wrapper`	Prediction function that requires two arguments, `object` and `newdata`. The output of this function should be determined by the `metric` being used: Regression - A numeric vector of predicted outcomes. Binary classification - A vector of predicted class labels (e.g., if using misclassification error) or a vector of predicted class probabilities for the reference class (e.g., if using log loss or AUC). Multiclass classification - A vector of predicted class labels (e.g., if using misclassification error) or a A matrix/data frame of predicted class probabilities for each class (e.g., if using log loss or AUC).
`verbose`	Logical indicating whether or not to print information during the construction of variable importance scores. Default is `FALSE`.
`parallel`	Logical indicating whether or not to run `vi_permute()` in parallel (using a backend provided by the foreach package). Default is `FALSE`. If `TRUE`, a foreach-compatible backend must be provided by must be provided. Note that `set.seed()` will not not work with foreach's parellelized for loops; for a workaround, see this solution.
`parallelize_by`	Character string specifying whether to parallelize across features (`parallelize_by = "features"`) or repetitions (`parallelize_by = "reps"`); the latter is only useful whenever `nsim > 1`. Default is `"features"`.

Value

A tidy data frame (i.e., a tibble object) with two columns:

Variable - the corresponding feature name;
Importance - the associated importance, computed as the average change in performance after a random permutation (or permutations, if nsim > 1) of the feature in question.

If nsim > 1, then an additional column (StDev) containing the standard deviation of the individual permutation scores for each feature is also returned; this helps assess the stability/variation of the individual permutation importance for each feature.

References

Brandon M. Greenwell and Bradley C. Boehmke, The R Journal (2020) 12:1, pages 343-366.

Examples

## Not run: 
#
# Regression example
#

library(ranger)   # for fitting random forests

# Simulate data from Friedman 1 benchmark; only x1-x5 are important!
trn <- gen_friedman(500, seed = 101)  # ?vip::gen_friedman

# Prediction wrapper
pfun <- function(object, newdata) {
  # Needs to return vector of predictions from a ranger object; see
  # `ranger::predcit.ranger` for details on making predictions
  predict(object, data = newdata)$predictions
}

# Fit a (default) random forest
set.seed(0803) # for reproducibility
rfo <- ranger(y ~ ., data = trn)

# Compute permutation-based VI scores
set.seed(2021)  # for reproducibility
vis <- vi(rfo, method = "permute", target = "y", metric = "rsq",
          pred_wrapper = pfun, train = trn)
print(vis)

# Same as above, but using `vi_permute()` directly
set.seed(2021)  # for reproducibility
vi_permute(rfo, target = "y", metric = "rsq", pred_wrapper = pfun
           train = trn)

# Plot VI scores (could also replace `vi()` with `vip()` in above example)
vip(vis, include_type = TRUE)

# Mean absolute error
mae <- function(truth, estimate) {
  mean(abs(truth - estimate))
}

# Permutation-based VIP with user-defined MAE metric
set.seed(1101)  # for reproducibility
vi_permute(rfo, target = "y", metric = mae, smaller_is_better = TRUE,
           pred_wrapper = pfun, train = trn)

# Same as above, but using `yardstick` package instead of user-defined metric
set.seed(1101)  # for reproducibility
vi_permute(rfo, target = "y", metric = yardstick::mae_vec,
           smaller_is_better = TRUE, pred_wrapper = pfun, train = trn)

#
# Classification (binary) example
#

library(randomForest)  # another package for fitting random forests

# Complete (i.e., imputed version of titanic data); see `?vip::titanic_mice`
head(t1 <- titanic_mice[[1L]])
t1$pclass <- as.ordered(t1$pclass)  # makes more sense as an ordered factor

# Fit another (default) random forest
set.seed(2053)  # for reproducibility
(rfo2 <- randomForest(survived ~ ., data = t1))

# Define prediction wrapper for predicting class labels from a
# "randomForest" object
pfun_class <- function(object, newdata) {
  # Needs to return factor of classifications
  predict(object, newdata = newdata, type = "response")
}

# Sanity check
pfun_class(rfo2, newdata = head(t1))
##   1   2   3   4   5   6
## yes yes yes  no yes  no
## Levels: no yes

# Compute mean decrease in accuracy
set.seed(1359)  # for reproducibility
vi(rfo2,
   method = "permute",
   train = t1,
   target = "survived",
   metric = "accuracy",  # or pass in `yardstick::accuracy_vec` directly
   # smaller_is_better = FALSE,  # no need to set for built-in metrics
   pred_wrapper = pfun_class,
   nsim = 30  # use 30 repetitions
)
## # A tibble: 5 × 3
##   Variable Importance   StDev
##   <chr>         <dbl>   <dbl>
## 1 sex          0.228  0.0110
## 2 pclass       0.0825 0.00505
## 3 age          0.0721 0.00557
## 4 sibsp        0.0346 0.00430
## 5 parch        0.0183 0.00236

# Define prediction wrapper for predicting class probabilities from a
# "randomForest" object
pfun_prob <- function(object, newdata) {
  # Needs to return vector of class probabilities for event level of interest
  predict(object, newdata = newdata, type = "prob")[, "yes"]
}

# Sanity check
pfun_prob(rfo2, newdata = head(t1))  # estiated P(survived=yes | x)
##     1     2     3     4     5     6
## 0.990 0.864 0.486 0.282 0.630 0.078

# Compute mean increase in Brier score
set.seed(1411)  # for reproducibility
vi(rfo2,
   method = "permute",
   train = t1,
   target = "survived",
   metric = yardstick::brier_class_vec,  # or pass in `"brier"` directly
   smaller_is_better = FALSE,  # need to set when supplying a function
   pred_wrapper = pfun_prob,
   nsim = 30  # use 30 repetitions
)

## # A tibble: 5 × 3
## Variable Importance   StDev
##   <chr>         <dbl>   <dbl>
## 1 sex          0.210  0.00869
## 2 pclass       0.0992 0.00462
## 3 age          0.0970 0.00469
## 4 parch        0.0547 0.00273
## 5 sibsp        0.0422 0.00200

# Some metrics, like AUROC, treat one class as the "event" of interest. In
# such cases, it's important to make sure the event level (which typically
# defaults to which ever event class comes first in alphabetical order)
# matches the event class that corresponds to the prediction wrappers
# returned probabilities. To do this, you can (and should) set the
# `event_class` argument. For instance, our prediction wrapper specified
# `survived = "yes"` as the event of interest, but this is considered the
# second event:
levels(t1$survived)
## [1] "no"  "yes"

# So, we need to specify the second class as the event of interest via the
# `event_level` argument (otherwise, we would get the negative of the results
# we were hoping for; a telltale sign the event level and prediction wrapper
do not match)
set.seed(1413)  # for reproducibility
vi(rfo,
   method = "permute",
   train = t1,
   target = "survived",
   metric = "roc_auc",
   event_level = "second",  # use "yes" as class label/"event" of interest
   pred_wrapper = pfun_prob,
   nsim = 30  # use 30 repetitions
)

## # A tibble: 5 × 3
## Variable Importance   StDev
##   <chr>         <dbl>   <dbl>
## 1 sex          0.229  0.0137
## 2 pclass       0.0920 0.00533
## 3 age          0.0850 0.00477
## 4 sibsp        0.0283 0.00211
## 5 parch        0.0251 0.00351

## End(Not run)

vip documentation built on Aug. 21, 2023, 5:12 p.m.