Automatic machine learning

Using H2O AutoML

Automatic machine learning (AutoML) is the process of automatically searching, screening and evaluating many models for a specific dataset. AutoML could be particularly insightful as an exploratory approach to identify model families and parameterization that is most likely to succeed. You can use H2O's AutoML algorithm via the 'h2o' engine in auto_ml(). agua provides several helper functions to quickly wrangle and visualize AutoML's results.

Let's run an AutoML search on the concrete data.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 5.75,
  out.width = "95%"
)
options(digits = 3)
library(tidymodels)
library(agua)
library(ggplot2)
theme_set(theme_bw())
h2o_start()

data(concrete)
set.seed(4595)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test <- testing(concrete_split)

# run for a maximum of 120 seconds
auto_spec <-
  auto_ml() %>%
  set_engine("h2o", max_runtime_secs = 120, seed = 1) %>%
  set_mode("regression")

normalized_rec <-
  recipe(compressive_strength ~ ., data = concrete_train) %>%
  step_normalize(all_predictors())

auto_wflow <-
  workflow() %>%
  add_model(auto_spec) %>%
  add_recipe(normalized_rec)

auto_fit <- fit(auto_wflow, data = concrete_train)

extract_fit_parsnip(auto_fit)

In 120 seconds, AutoML fitted r nrow(as.data.frame(auto_fit$fit$fit$fit@leaderboard)) models. The parsnip fit object extract_fit_parsnip(auto_fit) shows the number of candidate models, the best performing algorithm and its corresponding model id, and a preview of the leaderboard with cross validation performances. The model_id column in the leaderboard is a unique model identifier for the h2o server. This can be useful when you need to predict on or extract a specific model, e.g. with predict(auto_fit, id = id) and extract_fit_engine(auto_fit, id = id). By default, they will operate on the best performing leader model.

# predict with the best model
predict(auto_fit, new_data = concrete_test)

Typically, we use AutoML to get a quick sense of the range of our success metric, and algorithms that are likely to succeed. agua provides tools to summarize these results.

rank_results(auto_fit) %>%
  filter(.metric == "mae") %>%
  arrange(rank)
collect_metrics(auto_fit, summarize = FALSE)
tidy(auto_fit) %>%
  mutate(
    .predictions = map(.model, predict, new_data = head(concrete_test))
  )
auto_fit %>%
  extract_fit_parsnip() %>%
  member_weights() %>%
  unnest(importance) %>%
  filter(type == "scaled_importance") %>%
  ggplot() +
  geom_boxplot(aes(value, algorithm)) +
  scale_x_sqrt() +
  labs(y = NULL, x = "scaled importance", title = "Member importance in stacked ensembles")

You can also autoplot() an AutoML object, which essentially wraps functions above to plot performance assessment and ranking. The lower the average ranking, the more likely the model type suits the data.

autoplot(auto_fit, type = "rank", metric = c("mae", "rmse")) +
  theme(legend.position = "none")

After initial assessment, we might be interested to allow more time for AutoML to search for more candidates. Recall that we have set engine argument max_runtime_secs to 120s before, we can increase it or adjust max_models to control the total runtime. H2O also provides an option to build upon an existing AutoML leaderboard and add more candidates, this can be done via refit(). The model to be re-fitted needs to have engine argument save_data = TRUE. If you also want to add stacked ensembles set keep_cross_validation_predictions = TRUE as well.

# not run 
auto_spec_refit <-
  auto_ml() %>%
  set_engine("h2o", 
             max_runtime_secs = 300, 
             save_data = TRUE,
             keep_cross_validation_predictions = TRUE) %>%
  set_mode("regression")

auto_wflow_refit <-
  workflow() %>%
  add_model(auto_spec_refit) %>%
  add_recipe(normalized_rec)

first_auto <- fit(auto_wflow_refit, data = concrete_train)
# fit another 60 seconds 
second_auto <- refit(first_auto, max_runtime_secs = 60)

Important engine arguments

There are several relevant engine arguments for H2O AutoML, some of the most commonly used are:

See the details section in h2o::h2o.automl() for more information.

Limiations

One current limitation of H2O AutoML models is that they can't be used in resampling. This means you can't use them with fit_resamples(), tune_grid(), tune_bayes(), etc.



Try the agua package in your browser

Any scripts or data that you put into this service are public.

agua documentation built on June 7, 2023, 5:07 p.m.