Evaluating submodels with the same model object

library(utils)
library(ggplot2)
theme_set(theme_bw())

Some R packages can create predictions from models that are different than the one that was fit. For example, if a boosted tree is fit with 10 iterations of boosting, the model can usually make predictions on submodels that have less than 10 trees (all other parameters being static). This is helpful for model tuning since you can cheaply evaluate tuning parameter combinations which often results in a large speed-up in the computations.

In parsnip, there is a method called multi_predict() that can do this. It's current methods are:

library(parsnip)
methods("multi_predict")

We'll use the attrition data in rsample to illustrate:

library(tidymodels)
data(attrition, package = "modeldata")

set.seed(4595)
data_split <- initial_split(attrition, strata = "Attrition")
attrition_train <- training(data_split)
attrition_test  <- testing(data_split)

A boosted classification tree is one of the most low-maintenance approaches that we could take to these data:

# requires the xgboost package
attrition_boost <- 
  boost_tree(mode = "classification", trees = 100) %>% 
  set_engine("C5.0")

Suppose that 10-fold cross-validation was being used to tune the model over the number of trees:

set.seed(616)
folds <- vfold_cv(attrition_train)

The process would fit a model on 90% of the data and predict on the remaining 10%. Using rsample:

model_data <- analysis(folds$splits[[1]])
pred_data  <- assessment(folds$splits[[1]])

fold_1_model <-
  attrition_boost %>% 
  fit_xy(x = model_data %>% dplyr::select(-Attrition), y = model_data$Attrition)

For multi_predict(), the same semantics of predict() are used but, for this model, there is an extra argument called trees. Candidate submodel values can be passed in with trees:

fold_1_pred <- 
  multi_predict(
    fold_1_model, 
    new_data = pred_data %>% dplyr::select(-Attrition),
    trees = 1:100,
    type = "prob"
  )
fold_1_pred

The results is a tibble that has as many rows as the data being predicted (n = r nrow(pred_data)). The .pred column contains a list of tibbles and each has the predictions across the different number of trees:

fold_1_pred$.pred[[1]]

To get this into a format that is more usable, we can use tidyr::unnest() but we first add row numbers so that we can track the predictions by test sample as well as the actual classes:

fold_1_df <- 
  fold_1_pred %>% 
  bind_cols(pred_data %>% dplyr::select(Attrition)) %>% 
  add_rowindex() %>% 
  tidyr::unnest(.pred)
fold_1_df

For two samples, what do these look like over trees?

fold_1_df %>% 
  dplyr::filter(.row %in% c(1, 88)) %>% 
  ggplot(aes(x = trees, y = .pred_No, col = Attrition, group = .row)) + 
  geom_step() + 
  ylim(0:1) +
  theme(legend.position = "top")

What does performance look like over trees (using the area under the ROC curve)?

fold_1_df %>% 
  group_by(trees) %>% 
  roc_auc(truth = Attrition, .pred_No) %>% 
  ggplot(aes(x = trees, y = .estimate)) + 
  geom_step()


Try the parsnip package in your browser

Any scripts or data that you put into this service are public.

parsnip documentation built on Aug. 18, 2023, 1:07 a.m.