hmda.search.param: Search for Hyperparameters via Random Search

View source: R/hmda.search.param.R

hmda.search.paramR Documentation

Search for Hyperparameters via Random Search

Description

Runs an automated hyperparameter search and returns several summaries of the hyperparameter grids as well as detailed hyperparameters from each model, and then produces multiple summaries based on different strategies. These strategies include:

Best of Family

Selects the best model for each performance metric (avoiding duplicate model IDs).

Top 2

Extracts hyperparameter settings from the top 2 models (according to a specified ranking metric).

Top 5

Extracts hyperparameter settings from the top 5 models.

Top 10

Extracts hyperparameter settings from the top 10 models.

These summaries help in identifying candidate hyperparameter ranges for further manual tuning. Note that a good suggestion depends on the extent of random search you carry out.

Usage

hmda.search.param(
  algorithm = c("drf", "gbm"),
  sort_by = "logloss",
  x,
  y,
  training_frame = h2o.getFrame("hmda.train.hex"),
  validation_frame = NULL,
  max_models = 100,
  max_runtime_secs = 3600,
  nfolds = 10,
  seed = NULL,
  fold_column = NULL,
  weights_column = NULL,
  keep_cross_validation_predictions = TRUE,
  stopping_rounds = NULL,
  stopping_metric = "AUTO",
  stopping_tolerance = NULL,
  ...
)

Arguments

algorithm

Character vector. The algorithm to include in the random search. Supported values include "drf" (Distributed Random Forest) and "gbm" (Gradient Boosting Machine). The input is case-insensitive.

sort_by

Character string specifying the metric used to rank models. For metrics not in "logloss", "mean_per_class_error", "rmse", "mse", lower values indicate better performance; for these four metrics, higher values are preferred.

x

Vector of predictor column names or indices.

y

Character string specifying the response column.

training_frame

An H2OFrame containing the training data. Default is h2o.getFrame("hmda.train.hex").

validation_frame

An H2OFrame for early stopping. Default is NULL.

max_models

Integer. Maximum number of models to build. Default is 100.

max_runtime_secs

integer. Amount of time (in seconds) that the model should keep searching. Default is 3600.

nfolds

Integer. Number of folds for cross-validation. Default is 10.

seed

Integer. A seed for reproducibility. Default is NULL.

fold_column

Character. Column name for cross-validation fold assignment. Default is NULL.

weights_column

Character. Column name for observation weights. Default is NULL.

keep_cross_validation_predictions

Logical. Whether to keep cross-validation predictions. Default is TRUE.

stopping_rounds

Integer. Number of rounds with no improvement before early stopping. Default is NULL.

stopping_metric

Character. Metric to use for early stopping. Default is "AUTO".

stopping_tolerance

Numeric. Relative tolerance for early stopping. Default is NULL.

...

Additional arguments passed to h2o.automl().

Details

The function executes an automated hyperparameter search for the specified algorithm. It then extracts the leaderboard from the H2OAutoML object and retrieves detailed hyperparameter information for each model using automlModelParam() from the h2otools package. The leaderboard and hyperparameter data are merged by the model_id column. Sorting of the merged results is performed based on the sort_by metric. For metrics not in "logloss", "mean_per_class_error", "rmse", "mse", lower values are considered better; for these four metrics, higher values are preferred.

After sorting, the function applies three strategies to summarize the hyperparameter search:

  1. Best of Family: Selects the best model for each performance metric, ensuring that no model ID appears more than once.

  2. Top 2: Gathers hyperparameter settings from the top 2 models.

  3. Top 5 and Top 10: Similarly, collects hyperparameter settings from the top 5 and top 10 models, respectively.

  4. All: List all the hyperparameters that were tried

These strategies provide different levels of granularity for analyzing the hyperparameter space and can be used for prototyping and further manual tuning.

Value

A list with the following components:

grid_search

The H2OAutoML object returned by random search

leaderboard

A merged data frame that combines leaderboard performance metrics with hyperparameter settings for each model. The data frame is sorted based on the specified ranking metric.

hyperparameters_best_of_family

A summary list of the best hyperparameter settings for each performance metric. This strategy selects the best model per metric while avoiding duplicate model IDs.

hyperparameters_top2

A list of hyperparameter settings from the top 2 models as ranked by the chosen metric.

hyperparameters_top5

A list of hyperparameter settings from the top 5 models.

hyperparameters_top10

A list of hyperparameter settings from the top 10 models.

Examples

## Not run: 
  # NOTE: This example may take a long time to run on your machine

  # Initialize H2O (if not already running)
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome train/test set into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  # Run the hyperparameter search using DRF and GBM algorithms.
  result <- hmda.search.param(algorithm = c("gbm"),
                              x = x,
                              y = y,
                              training_frame = train,
                              max_models = 100,
                              nfolds = 10,
                              stopping_metric = "AUC",
                              stopping_rounds = 3)

  # Access the hyperparameter list of the best_of_family strategy:
  result$best_of_family

  # Access the hyperparameter of the top5 models based on the specified ranking parameter
  result$top5

## End(Not run)


HMDA documentation built on April 4, 2025, 6:06 a.m.