hmda.search.param: Search for Hyperparameters via Random Search
In HMDA: Holistic Multimodel Domain Analysis for Exploratory Machine Learning

hmda.search.param

R Documentation

Search for Hyperparameters via Random Search

Description

Runs an automated hyperparameter search and returns several summaries of the hyperparameter grids as well as detailed hyperparameters from each model, and then produces multiple summaries based on different strategies. These strategies include:

Best of Family: Selects the best model for each performance metric (avoiding duplicate model IDs).
Top 2: Extracts hyperparameter settings from the top 2 models (according to a specified ranking metric).
Top 5: Extracts hyperparameter settings from the top 5 models.
Top 10: Extracts hyperparameter settings from the top 10 models.

These summaries help in identifying candidate hyperparameter ranges for further manual tuning. Note that a good suggestion depends on the extent of random search you carry out.

Usage

hmda.search.param(
  algorithm = c("drf", "gbm"),
  sort_by = "logloss",
  x,
  y,
  training_frame = h2o.getFrame("hmda.train.hex"),
  validation_frame = NULL,
  max_models = 100,
  max_runtime_secs = 3600,
  nfolds = 10,
  seed = NULL,
  fold_column = NULL,
  weights_column = NULL,
  keep_cross_validation_predictions = TRUE,
  stopping_rounds = NULL,
  stopping_metric = "AUTO",
  stopping_tolerance = NULL,
  ...
)

Arguments

`algorithm`	Character vector. The algorithm to include in the random search. Supported values include "drf" (Distributed Random Forest) and "gbm" (Gradient Boosting Machine). The input is case-insensitive.
`sort_by`	Character string specifying the metric used to rank models. For metrics not in `"logloss", "mean_per_class_error", "rmse", "mse"`, lower values indicate better performance; for these four metrics, higher values are preferred.
`x`	Vector of predictor column names or indices.
`y`	Character string specifying the response column.
`training_frame`	An H2OFrame containing the training data. Default is `h2o.getFrame("hmda.train.hex")`.
`validation_frame`	An H2OFrame for early stopping. Default is `NULL`.
`max_models`	Integer. Maximum number of models to build. Default is 100.
`max_runtime_secs`	integer. Amount of time (in seconds) that the model should keep searching. Default is 3600.
`nfolds`	Integer. Number of folds for cross-validation. Default is 10.
`seed`	Integer. A seed for reproducibility. Default is `NULL`.
`fold_column`	Character. Column name for cross-validation fold assignment. Default is `NULL`.
`weights_column`	Character. Column name for observation weights. Default is `NULL`.
`keep_cross_validation_predictions`	Logical. Whether to keep cross-validation predictions. Default is `TRUE`.
`stopping_rounds`	Integer. Number of rounds with no improvement before early stopping. Default is `NULL`.
`stopping_metric`	Character. Metric to use for early stopping. Default is "AUTO".
`stopping_tolerance`	Numeric. Relative tolerance for early stopping. Default is `NULL`.
`...`	Additional arguments passed to `h2o.automl()`.

Details

The function executes an automated hyperparameter search for the specified algorithm. It then extracts the leaderboard from the H2OAutoML object and retrieves detailed hyperparameter information for each model using automlModelParam() from the h2otools package. The leaderboard and hyperparameter data are merged by the model_id column. Sorting of the merged results is performed based on the sort_by metric. For metrics not in "logloss", "mean_per_class_error", "rmse", "mse", lower values are considered better; for these four metrics, higher values are preferred.

After sorting, the function applies three strategies to summarize the hyperparameter search:

Best of Family: Selects the best model for each performance metric, ensuring that no model ID appears more than once.
Top 2: Gathers hyperparameter settings from the top 2 models.
Top 5 and Top 10: Similarly, collects hyperparameter settings from the top 5 and top 10 models, respectively.
All: List all the hyperparameters that were tried

These strategies provide different levels of granularity for analyzing the hyperparameter space and can be used for prototyping and further manual tuning.

Value

A list with the following components:

grid_search: The H2OAutoML object returned by random search
leaderboard: A merged data frame that combines leaderboard performance metrics with hyperparameter settings for each model. The data frame is sorted based on the specified ranking metric.
hyperparameters_best_of_family: A summary list of the best hyperparameter settings for each performance metric. This strategy selects the best model per metric while avoiding duplicate model IDs.
hyperparameters_top2: A list of hyperparameter settings from the top 2 models as ranked by the chosen metric.
hyperparameters_top5: A list of hyperparameter settings from the top 5 models.
hyperparameters_top10: A list of hyperparameter settings from the top 10 models.

Examples

## Not run: 
  # NOTE: This example may take a long time to run on your machine

  # Initialize H2O (if not already running)
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome train/test set into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  # Run the hyperparameter search using DRF and GBM algorithms.
  result <- hmda.search.param(algorithm = c("gbm"),
                              x = x,
                              y = y,
                              training_frame = train,
                              max_models = 100,
                              nfolds = 10,
                              stopping_metric = "AUC",
                              stopping_rounds = 3)

  # Access the hyperparameter list of the best_of_family strategy:
  result$best_of_family

  # Access the hyperparameter of the top5 models based on the specified ranking parameter
  result$top5

## End(Not run)

HMDA documentation built on April 4, 2025, 6:06 a.m.