hmda.grid: Tune Hyperparameter Grid for HMDA Framework

View source: R/hmda.grid.R

hmda.gridR Documentation

Tune Hyperparameter Grid for HMDA Framework

Description

Generates a hyperparameter grid for a single tree-based algorithm (either "drf" or "gbm") by running a grid search. The function validates inputs, generates an automatic grid ID for the grid (if not provided), and optionally saves the grid to a recovery directory. The resulting grid object contains all trained models and can be used for further analysis. For scientific computing, saving the grid is highly recommended to avoid future re-running the training!

Usage

hmda.grid(
  algorithm = c("drf", "gbm"),
  grid_id = NULL,
  x,
  y,
  training_frame = h2o.getFrame("hmda.train.hex"),
  validation_frame = NULL,
  hyper_params = list(),
  nfolds = 10,
  seed = NULL,
  keep_cross_validation_predictions = TRUE,
  recovery_dir = NULL,
  sort_by = "logloss",
  ...
)

Arguments

algorithm

Character. The algorithm to tune. Supported values are "drf" (Distributed Random Forest) and "gbm" (Gradient Boosting Machine). Only one algorithm can be specified. (Case-insensitive)

grid_id

Character. Optional identifier for the grid search. If NULL, an automatic grid_id is generated using the algorithm name and the current time.

x

Vector. Predictor column names or indices.

y

Character. The response column name or index.

training_frame

An H2OFrame containing the training data. Default is h2o.getFrame("hmda.train.hex").

validation_frame

An H2OFrame for early stopping. Default is NULL.

hyper_params

List. A list of hyperparameter vectors for tuning. If you do not have a clue about how to specify the hyperparameters, consider consulting hmda.suggest.param and hmda.search.param functions, which provide suggestions based on default values or random search.

nfolds

Integer. Number of folds for cross-validation. Default is 10.

seed

Integer. A seed for reproducibility. Default is NULL.

keep_cross_validation_predictions

Logical. Whether to keep cross-validation predictions. Default is TRUE.

recovery_dir

Character. Directory path to save the grid search output. If provided, the grid is saved using h2o.saveGrid().

sort_by

Character. Metric used to sort the grid. Default is "logloss".

...

Additional arguments passed to h2o.grid().

Details

The function executes the following steps:

  1. Input Validation: Ensures only one algorithm is specified and verifies that the training frame is an H2OFrame.

  2. Grid ID Generation: If no grid_id is provided, it creates one using the algorithm name and the current time.

  3. Grid Search Execution: Calls h2o.grid() with the provided hyperparameters and cross-validation settings.

  4. Grid Saving: If a recovery directory is specified, the grid is saved to disk using h2o.saveGrid().

The output is an H2O grid object that contains all the trained models.

Value

An object of class H2OGrid containing the grid search results.

Author(s)

E. F. Haghish

Examples

## Not run: 
  library(HMDA)
  library(h2o)
  hmda.init()

  # Import a sample binary outcome dataset into H2O
  train <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
  test <- h2o.importFile(
  "https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")

  # Identify predictors and response
  y <- "response"
  x <- setdiff(names(train), y)

  # For binary classification, response should be a factor
  train[, y] <- as.factor(train[, y])
  test[, y] <- as.factor(test[, y])

  params <- list(learn_rate = c(0.01, 0.1),
                 max_depth = c(3, 5, 9),
                 sample_rate = c(0.8, 1.0)
  )

  # Train and validate a cartesian grid of GBMs
  hmda_grid1 <- hmda.grid(algorithm = "gbm", x = x, y = y,
                          grid_id = "hmda_grid1",
                          training_frame = train,
                          nfolds = 10,
                          ntrees = 100,
                          seed = 1,
                          hyper_params = gbm_params1)

  # Assess the performances of the models
  grid_performance <- hmda.grid.analysis(hmda_grid1)

  # Return the best 2 models according to each metric
  hmda.best.models(grid_performance, n_models = 2)

## End(Not run)


HMDA documentation built on April 4, 2025, 6:06 a.m.