knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  eval = TRUE
)

This vignette showcases hyperparameter tuning of a multilayer perceptron with Keras. Concretely, we will be tuning over the number of hidden layers and the number of hidden nodes in each layer.

Here are the libraries we will be using. The call to use_session_with_seed below is only to make a reproductible example.

library(recipes)
library(magrittr)
library(tidytune)
library(rsample)
library(dplyr)
library(keras)

use_session_with_seed(2)

Let's prepare our data using recipes. Note the step_center and step_scale, in order to make the input neural-network friendly.

data("attrition")

attrition %<>% 
  mutate(Attrition = ifelse(Attrition == 'Yes', 1, 0)) %>%
  mutate_if(is.ordered, factor, ordered = FALSE)

resamples <- rsample::vfold_cv(attrition, v = 2)

rec <- 
  recipe(attrition) %>%
  add_role(Attrition, new_role = 'outcome') %>%
  add_role(-Attrition, new_role = 'predictor') %>% 
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_novel(all_nominal(), -Attrition) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors())

We are now ready for the cool stuff. First, we write a custom wrapper that takes a params argument, which will contain the number of layers and the number of hidden nodes per layer. The function outputs a keras specification of the model. The format of the params argument is: list(hidden_layers=n, h_1=, ..., h_n=). Here we only concern ourselves with the hidden parts of the network, the output will be added later.

keras_mlp <- function(input_shape, params, activations = NULL){
  model <- keras_model_sequential()

  if(is.null(activations)){
    activations <- rep('relu', params$hidden_layers)
  }

  for(l in 1:params$hidden_layers){
    hidden <- params[[paste0('h',l)]]

    if(l == 1){
      model %>% layer_dense(units = hidden, input_shape = input_shape)
    }else{
      model %>% layer_dense(units = hidden)
    }
    model %>% layer_activation(activations[l])  
  }

  model
}

Now we can write a scoring function that calls the wrapper function above to fit and evaluate a model on a given training and evaluation datastets. These will be passed to the scoring function during the grid search procedure, properly baked using our initial recipe.

keras_classif_score <- function(train_df, 
                                target_var, 
                                params, 
                                eval_df,
                                ...){

  X_train <- train_df %>% select(-matches(target_var)) %>% as.matrix()
  y_train <- train_df[[target_var]]

  X_test <- eval_df %>% select(-matches(target_var)) %>% as.matrix()
  y_test <- eval_df[[target_var]]

  args <- list(...)
  activations <- args$activations

  model <- keras_mlp(input_shape   = ncol(X_train),
                     params        = params,
                     activations   = activations) %>%
    layer_dense(units = 1) %>%
    layer_activation('sigmoid')

  model %>% compile( 
    optimizer = optimizer_rmsprop(),
    loss = loss_binary_crossentropy,
    metrics = metric_binary_accuracy
  )

  early_stopping <- callback_early_stopping(monitor = 'val_loss', patience = 2)

  h <- model %>% fit(X_train, 
                y_train, 
                epochs=15, 
                batch_size=64, 
                callbacks = c(early_stopping),
                ...)

  model %>% evaluate(X_test, 
                     y_test, 
                     batch_size=32)
}

In the final step, we loop throuh a grid of candidate parameters and score the corresponding models. For the sake of the example, we limit ourselves to a small grid.

keras_param_grid <- 
  expand.grid(hidden_layers = 1:2, h1 = c(10:12), h2 = c(3:8)) %>%
  mutate(
    h2 = ifelse(hidden_layers == 2, h2, 0)
  ) %>%
  distinct()

results_grid_search <- 
  grid_search(
    resamples = resamples, 
    recipe = rec, 
    param_grid = keras_param_grid, 
    scoring_func = keras_classif_score
  )

The results can then easily be summarized:

results_grid_search %>%
  group_by(param_id, hidden_layers, h1, h2) %>%
  summarise(binary_accuracy = mean(binary_accuracy),
            loss = mean(loss)) %>%
  arrange(loss, desc(binary_accuracy))

Looking at the results above, you can probably spot a couple of parameter combinations that were "wasted", in the sense that you would probably not have bothered trying them, given the optimization path up to that point.

The underlying assumption is probably one of univariate convexity of the loss function with regards to the parameters, especially the ones that control the complexity of the model. For example, if you saw the model performance decrease when going from 5 to 6 hidden nodes in the second layer, everything else being equal, there's probably no point in trying 7 hidden nodes.

There's an option that allows you to accept or reject specific parameter combinations, based on the cumulative optimization path (this is the parameters combiantions seen so far and their corresponding performance). This option comes in the form of a function parameter. Along with this, the multi_convex_accept function allows to accept or reject parameters based on the assumption of univwqariate convexity outlined above.

Here's how we could adjust the example above to use this option and skip some of the parameter combinations, thereby reducing running time significantly.

results_grid_search2 <- 
  grid_search(
    resamples = resamples, 
    recipe = rec, 
    param_grid = keras_param_grid, 
    scoring_func = keras_classif_score,
    verbose = 0,
    accept_func = multi_convex_accept,
    score_var = 'loss'  # use loss and not accuracy
  )

To understand what the impact of the accept_func, let's take at a couple of parameter combinations:

results_grid_search2 %>%
  group_by(param_id, hidden_layers, h1, h2) %>%
  summarise(binary_accuracy = mean(binary_accuracy),
            loss = mean(loss)) %>%
  filter(
    param_id %in% c('Paramset02', 'Paramset04', 'Paramset06')
  )

As you can see above, observing that the loss increased when going from (h1=10, h2=3) to (h1=11, h2=3), the multi_convex_accept function decided it was not worth trying (h1=12, h2=3).

Not that this decision is taken on a fold-by-fold basis, so it could happen that the function decides to skip a parameter combination for one fold and not another. In that case, the summarized results above would show a NA value for that parameter combination. However it is up to you to decide how to handle such cases, e.g. by ignoring NA values with na.rm = TRUE.



artichaud1/tidygrid documentation built on May 10, 2019, 9:28 a.m.