knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  eval = TRUE
)

In this article we explore the idea of hyperparameter tuning using a surrogate model, which means that we have a base model that tries to minimize the loss metric (logloss in this case) on the training data, but we also have a surrogate or meta model that will try to optimize the output of the base model.

The meta model's input space is the parameter space of the base model, so to he goal of the model is to guide the search towards parameters that translate into a good base model performance.

We do that by sampling many values for the parameter combinations, and asking the surrogate model which values it thinks are gonna result in a good performance of the base model, based on historical performance of the base model.

Prepare the data

As usual, the data prep is the first step. Nothing new here, you can skip this part if you've seen the other articles

library(recipes)
library(magrittr)
library(tidytune)
library(rsample)
library(ParamHelpers)
library(MLmetrics)
library(knitr)

data("attrition")

attrition <- bind_rows(attrition, attrition, attrition, attrition, attrition)
attrition <- bind_rows(attrition, attrition, attrition, attrition, attrition)

attrition %<>% mutate(Attrition = ifelse(Attrition == 'Yes', 1, 0))

resamples <- rsample::vfold_cv(attrition, v = 2)

rec <- 
  recipe(attrition) %>%
  add_role(Attrition, new_role = 'outcome') %>%
  add_role(-Attrition, new_role = 'predictor') %>%
  step_novel(all_nominal(), -Attrition) %>%
  step_dummy(all_nominal(), -Attrition) %>%
  step_zv(all_predictors())

Scoring function

Now, we define our custom scoring function:

library(xgboost)

xgboost_classif_score <- 
  function(train_df, 
           target_var, 
           params, 
           eval_df, 
           ...){

  X_train <- train_df %>% select(-matches(target_var)) %>% as.matrix()
  y_train <- train_df[[target_var]]
  xgb_train_data <- xgb.DMatrix(X_train, label = y_train)

  X_eval <- eval_df %>% select(-matches(target_var)) %>% as.matrix()
  y_eval <- eval_df[[target_var]]
  xgb_eval_data <- xgb.DMatrix(X_eval, label = y_eval)

  model <- xgb.train(params = params,
                     data = xgb_train_data,
                     watchlist = list(train = xgb_train_data, eval = xgb_eval_data),
                     objective = 'binary:logistic',
                     verbose = FALSE,
                     ...)

  preds <- predict(model, xgb_eval_data)

  list(logloss = LogLoss(preds, y_eval), 
       acc = Accuracy(ifelse(preds > 0.5, 1, 0), y_eval))

  # You can also return a simple vector score:
  # LogLoss(preds, y_eval)
}

Random search

Then we perform random search. Here we try 50 paramter combinations.

set.seed(123)

# Random search example

xgboost_random_params <-
  makeParamSet(
    makeIntegerParam('max_depth', lower = 1, upper = 15),
    makeNumericParam('eta', lower = 0.01, upper = 0.1),
    makeNumericParam('gamma', lower = 0, upper = 5),
    makeIntegerParam('min_child_weight', lower = 1, upper = 100),
    makeNumericParam('subsample', lower = 0.25, upper = 0.9),
    makeNumericParam('colsample_bytree', lower = 0.25, upper = 0.9)
  )

prep <- memoise::memoise(prep)
bake <- memoise::memoise(bake)

results_random_search <- 
  random_search(
    resamples = resamples, 
    recipe = rec, 
    param_set = xgboost_random_params, 
    scoring_func = xgboost_classif_score, 
    nrounds = 1000,
    early_stopping_rounds = 20,
    eval_metric = 'logloss',
    n = 3
  )

Here's what our results look like so far:

summ_random_search <- 
  results_random_search %>%
  group_by_at(getParamIds(xgboost_random_params)) %>%
  summarise(logloss = mean(logloss))

summ_random_search %>%
  arrange(logloss) %>%
  head(10) %>%
  kable()

Surrogate search

Now we are in position to fine tune our parameters using a surrogate search. The idea is to use a meta model, mapping the parameter values to the performance of the xgboost classifier. Incidentally, the meta-model is by default a ranger model. This is because random forests are good at modeling non-linearity and interactions between variables, and require very little tuning to get a decent performance.

Here we generate 1000 random parameter set candidates for each of the 10 surrogate runs, and ask the surrogate model to pass through the top 5 candidates to the underlying xgboost classifier. Therefore the surrogate search will result in 50 calls to the classifier in total.

The goal of this approach is to try and spend more time around the most promising areas of the parameter space, while allowing some exploration thanks to the random generation of candidates.

set.seed(123)

results_surrogate_search <- 
  surrogate_search(
    resamples = resamples,
    recipe = rec,
    param_set = xgboost_random_params,
    n = 5,
    scoring_func = xgboost_classif_score,
    nrounds = 1000,
    early_stopping_rounds = 20,
    eval_metric = 'logloss',
    input = results_random_search,
    surrogate_target = 'logloss',
    n_candidates = 1000,
    top_n = 1,
    verbosity = TRUE
  )

prep <- memoise::memoise(prep) bake <- memoise::memoise(bake)

set.seed(123)

system.time( results_surrogate_search <- surrogate_search( resamples = resamples, recipe = rec, param_set = xgboost_random_params, n = 5, scoring_func = xgboost_classif_score, nrounds = 1000, early_stopping_rounds = 20, eval_metric = 'logloss', input = results_random_search, surrogate_target = 'logloss', n_candidates = 1000, top_n = 1, verbosity = TRUE ) )

prep <- recipes::prep bake <- recipes::bake ```



artichaud1/tidytune documentation built on May 20, 2019, 9:13 p.m.