View source: R/hmda.search.param.R
hmda.search.param | R Documentation |
Runs an automated hyperparameter search and returns several summaries of the hyperparameter grids as well as detailed hyperparameters from each model, and then produces multiple summaries based on different strategies. These strategies include:
Selects the best model for each performance metric (avoiding duplicate model IDs).
Extracts hyperparameter settings from the top 2 models (according to a specified ranking metric).
Extracts hyperparameter settings from the top 5 models.
Extracts hyperparameter settings from the top 10 models.
These summaries help in identifying candidate hyperparameter ranges for further manual tuning. Note that a good suggestion depends on the extent of random search you carry out.
hmda.search.param(
algorithm = c("drf", "gbm"),
sort_by = "logloss",
x,
y,
training_frame = h2o.getFrame("hmda.train.hex"),
validation_frame = NULL,
max_models = 100,
max_runtime_secs = 3600,
nfolds = 10,
seed = NULL,
fold_column = NULL,
weights_column = NULL,
keep_cross_validation_predictions = TRUE,
stopping_rounds = NULL,
stopping_metric = "AUTO",
stopping_tolerance = NULL,
...
)
algorithm |
Character vector. The algorithm to include in the random search. Supported values include "drf" (Distributed Random Forest) and "gbm" (Gradient Boosting Machine). The input is case-insensitive. |
sort_by |
Character string specifying the metric used to rank
models. For metrics not in |
x |
Vector of predictor column names or indices. |
y |
Character string specifying the response column. |
training_frame |
An H2OFrame containing the training data.
Default is |
validation_frame |
An H2OFrame for early stopping.
Default is |
max_models |
Integer. Maximum number of models to build. Default is 100. |
max_runtime_secs |
integer. Amount of time (in seconds) that the model should keep searching. Default is 3600. |
nfolds |
Integer. Number of folds for cross-validation. Default is 10. |
seed |
Integer. A seed for reproducibility.
Default is |
fold_column |
Character. Column name for cross-validation fold
assignment. Default is |
weights_column |
Character. Column name for observation weights.
Default is |
keep_cross_validation_predictions |
Logical. Whether to keep
cross-validation predictions. Default is |
stopping_rounds |
Integer. Number of rounds with no improvement
before early stopping. Default is |
stopping_metric |
Character. Metric to use for early stopping. Default is "AUTO". |
stopping_tolerance |
Numeric. Relative tolerance for early stopping.
Default is |
... |
Additional arguments passed to |
The function executes an automated hyperparameter search for the specified
algorithm. It then extracts the leaderboard from the H2OAutoML object and
retrieves detailed hyperparameter information for each model using automlModelParam()
from the
h2otools package. The leaderboard and hyperparameter data are merged by the
model_id
column. Sorting of the merged results is performed based on
the sort_by
metric. For metrics not in
"logloss", "mean_per_class_error", "rmse", "mse"
, lower values are
considered better; for these four metrics, higher values are preferred.
After sorting, the function applies three strategies to summarize the hyperparameter search:
Best of Family: Selects the best model for each performance metric, ensuring that no model ID appears more than once.
Top 2: Gathers hyperparameter settings from the top 2 models.
Top 5 and Top 10: Similarly, collects hyperparameter settings from the top 5 and top 10 models, respectively.
All: List all the hyperparameters that were tried
These strategies provide different levels of granularity for analyzing the hyperparameter space and can be used for prototyping and further manual tuning.
A list with the following components:
The H2OAutoML object returned by random search
A merged data frame that combines leaderboard performance metrics with hyperparameter settings for each model. The data frame is sorted based on the specified ranking metric.
A summary list of the best hyperparameter settings for each performance metric. This strategy selects the best model per metric while avoiding duplicate model IDs.
A list of hyperparameter settings from the top 2 models as ranked by the chosen metric.
A list of hyperparameter settings from the top 5 models.
A list of hyperparameter settings from the top 10 models.
## Not run:
# NOTE: This example may take a long time to run on your machine
# Initialize H2O (if not already running)
library(HMDA)
library(h2o)
hmda.init()
# Import a sample binary outcome train/test set into H2O
train <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_train_10k.csv")
test <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
# Run the hyperparameter search using DRF and GBM algorithms.
result <- hmda.search.param(algorithm = c("gbm"),
x = x,
y = y,
training_frame = train,
max_models = 100,
nfolds = 10,
stopping_metric = "AUC",
stopping_rounds = 3)
# Access the hyperparameter list of the best_of_family strategy:
result$best_of_family
# Access the hyperparameter of the top5 models based on the specified ranking parameter
result$top5
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.