knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, eval = FALSE )
Here we will be using the fantastic furrr
to illustrate how easy it is to run crossvalidation and hyperparameter tuning in parallel.
I will use only 2-fold cross-validation for this experiment on my old dual-core laptop.
We start with the usual setup (preparing the resamples, the recipes, and our custom scoring function):
library(recipes) library(magrittr) library(tidytune) library(rsample) library(ParamHelpers) library(MLmetrics) # for LogLoss library(dplyr) library(tictoc)
data("attrition") attrition %<>% mutate(Attrition = ifelse(Attrition == 'Yes', 1, 0)) resamples <- rsample::vfold_cv(attrition, v = 2) rec <- recipe(attrition) %>% add_role(Attrition, new_role = 'outcome') %>% add_role(-Attrition, new_role = 'predictor') %>% step_novel(all_nominal(), -Attrition) %>% step_dummy(all_nominal(), -Attrition) %>% step_zv(all_predictors())
For the sake of this exercise, we will be forcing the xgboost internal algorithm to use 1 thread, in order to measure the performane boost obtained purely by going parallel with furrr
.
library(xgboost) xgboost_classif_score <- function(train_df, target_var, params, eval_df, ...){ X_train <- train_df %>% select(-matches(target_var)) %>% as.matrix() y_train <- train_df[[target_var]] xgb_train_data <- xgb.DMatrix(X_train, label = y_train) X_eval <- eval_df %>% select(-matches(target_var)) %>% as.matrix() y_eval <- eval_df[[target_var]] xgb_eval_data <- xgb.DMatrix(X_eval, label = y_eval) model <- xgb.train(params = params, data = xgb_train_data, watchlist = list(train = xgb_train_data, eval = xgb_eval_data), objective = 'binary:logistic', verbose = FALSE, nthread = 1, ...) preds <- predict(model, xgb_eval_data) list(logloss = LogLoss(preds, y_eval), acc = Accuracy(ifelse(preds > 0.5, 1, 0), y_eval)) # You can also return a simple vector score: # LogLoss(preds, y_eval) }
Below we are using a grid of 756 parameter combinations.
Sequential mode:
library(future) plan(sequential) set.seed(123) xgboost_param_grid <- expand.grid(eta = c(0.1, 0.01), max_depth = 2:15, min_child_weight = c(1, 25, 50), subsample = c(0.5, 0.75, 1), colsample_bytree = c(0.5, 0.75, 1)) tic() results_grid_search <- grid_search( resamples = resamples, recipe = rec, param_grid = xgboost_param_grid, scoring_func = xgboost_classif_score, nrounds = 1000, verbosity = FALSE, nthread = 1 ) toc()
And now in parallel:
plan(multiprocess) set.seed(123) tic() results_grid_search <- grid_search( resamples = resamples, recipe = rec, param_grid = xgboost_param_grid, scoring_func = xgboost_classif_score, nrounds = 1000, verbosity = FALSE, nthread = 1 ) toc()
On my laptop, I get about 40% speed improvement (about 400 secs for sequential mode, and 240 secs for parallel mode). Not bad, when you consider that it came at the cost of changing a single line of code. That's how well designed the furrr
and future
package are. Obviously we would get a bigger boost if we were doing more than 2-fold crossvalidation on a laptop with more than 2 cores. But furrr
has a lot more to offer, and I invite you to dig deeper (see here for an example of parallel processing with AWS EC2 instances).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.