knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  eval = FALSE
)

Here we will be using the fantastic furrr to illustrate how easy it is to run crossvalidation and hyperparameter tuning in parallel.

I will use only 2-fold cross-validation for this experiment on my old dual-core laptop.

We start with the usual setup (preparing the resamples, the recipes, and our custom scoring function):

library(recipes)
library(magrittr)
library(tidytune)
library(rsample)
library(ParamHelpers)
library(MLmetrics) # for LogLoss
library(dplyr)
library(tictoc)

Prepare the recipe

data("attrition")

attrition %<>% mutate(Attrition = ifelse(Attrition == 'Yes', 1, 0))

resamples <- rsample::vfold_cv(attrition, v = 2)

rec <- 
  recipe(attrition) %>%
  add_role(Attrition, new_role = 'outcome') %>%
  add_role(-Attrition, new_role = 'predictor') %>%
  step_novel(all_nominal(), -Attrition) %>%
  step_dummy(all_nominal(), -Attrition) %>%
  step_zv(all_predictors())

Prepare your scoring function

For the sake of this exercise, we will be forcing the xgboost internal algorithm to use 1 thread, in order to measure the performane boost obtained purely by going parallel with furrr.

library(xgboost)

xgboost_classif_score <- 
  function(train_df, 
           target_var, 
           params, 
           eval_df, 
           ...){

  X_train <- train_df %>% select(-matches(target_var)) %>% as.matrix()
  y_train <- train_df[[target_var]]
  xgb_train_data <- xgb.DMatrix(X_train, label = y_train)

  X_eval <- eval_df %>% select(-matches(target_var)) %>% as.matrix()
  y_eval <- eval_df[[target_var]]
  xgb_eval_data <- xgb.DMatrix(X_eval, label = y_eval)

  model <- xgb.train(params = params,
                     data = xgb_train_data,
                     watchlist = list(train = xgb_train_data, eval = xgb_eval_data),
                     objective = 'binary:logistic',
                     verbose = FALSE,
                     nthread = 1,
                     ...)

  preds <- predict(model, xgb_eval_data)

  list(logloss = LogLoss(preds, y_eval), 
       acc = Accuracy(ifelse(preds > 0.5, 1, 0), y_eval))

  # You can also return a simple vector score:
  # LogLoss(preds, y_eval)
}

Grid search:

Below we are using a grid of 756 parameter combinations.

Sequential mode:

library(future)
plan(sequential)

set.seed(123)

xgboost_param_grid <- 
  expand.grid(eta = c(0.1, 0.01), 
              max_depth = 2:15,
              min_child_weight = c(1, 25, 50),
              subsample = c(0.5, 0.75, 1),
              colsample_bytree = c(0.5, 0.75, 1))

tic()

results_grid_search <- 
  grid_search(
    resamples = resamples, 
    recipe = rec, 
    param_grid = xgboost_param_grid, 
    scoring_func = xgboost_classif_score, 
    nrounds = 1000,
    verbosity = FALSE,
    nthread = 1
  )

toc()

And now in parallel:

plan(multiprocess)

set.seed(123)

tic()

results_grid_search <- 
  grid_search(
    resamples = resamples, 
    recipe = rec, 
    param_grid = xgboost_param_grid, 
    scoring_func = xgboost_classif_score, 
    nrounds = 1000,
    verbosity = FALSE,
    nthread = 1
  )

toc()

On my laptop, I get about 40% speed improvement (about 400 secs for sequential mode, and 240 secs for parallel mode). Not bad, when you consider that it came at the cost of changing a single line of code. That's how well designed the furrr and future package are. Obviously we would get a bigger boost if we were doing more than 2-fold crossvalidation on a laptop with more than 2 cores. But furrr has a lot more to offer, and I invite you to dig deeper (see here for an example of parallel processing with AWS EC2 instances).



artichaud1/tidygrid documentation built on May 10, 2019, 9:28 a.m.