gpb.grid.search.tune.parameters: Function for choosing tuning parameters
In gpboost: Combining Tree-Boosting with Gaussian Process and Mixed Effects Models

gpb.grid.search.tune.parameters

R Documentation

Function for choosing tuning parameters

Description

Function that allows for choosing tuning parameters from a grid in a determinstic or random way using cross validation or validation data sets.

Usage

gpb.grid.search.tune.parameters(param_grid, data, params = list(),
  num_try_random = NULL, nrounds = 100L, gp_model = NULL,
  line_search_step_length = FALSE, use_gp_model_for_validation = TRUE,
  train_gp_model_cov_pars = TRUE, folds = NULL, nfold = 4L,
  label = NULL, weight = NULL, obj = NULL, eval = NULL,
  verbose_eval = 1L, stratified = TRUE, init_model = NULL,
  colnames = NULL, categorical_feature = NULL,
  early_stopping_rounds = NULL, callbacks = list(),
  return_all_combinations = FALSE, ...)

Arguments

`param_grid`	`list` with candidate parameters defining the grid over which a search is done
`data`	a `gpb.Dataset` object, used for training. Some functions, such as `gpb.cv`, may allow you to pass other types of data like `matrix` and then separately supply `label` as a keyword argument.
`params`	`list` with other parameters not included in `param_grid`
`num_try_random`	`integer` with number of random trial on parameter grid. If NULL, a deterministic search is done
`nrounds`	number of boosting iterations (= number of trees). This is the most important tuning parameter for boosting
`gp_model`	A `GPModel` object that contains the random effects (Gaussian process and / or grouped random effects) model
`line_search_step_length`	Boolean. If TRUE, a line search is done to find the optimal step length for every boosting update (see, e.g., Friedman 2001). This is then multiplied by the `learning_rate`. Applies only to the GPBoost algorithm
`use_gp_model_for_validation`	Boolean. If TRUE, the `gp_model` (Gaussian process and/or random effects) is also used (in addition to the tree model) for calculating predictions on the validation data. If FALSE, the `gp_model` (random effects part) is ignored for making predictions and only the tree ensemble is used for making predictions for calculating the validation / test error.
`train_gp_model_cov_pars`	Boolean. If TRUE, the covariance parameters of the `gp_model` (Gaussian process and/or random effects) are estimated in every boosting iterations, otherwise the `gp_model` parameters are not estimated. In the latter case, you need to either estimate them beforehand or provide the values via the `init_cov_pars` parameter when creating the `gp_model`
`folds`	`list` provides a possibility to use a list of pre-defined CV folds (each element must be a vector of test fold's indices). When folds are supplied, the `nfold` and `stratified` parameters are ignored.
`nfold`	the original dataset is randomly partitioned into `nfold` equal size subsamples.
`label`	Vector of labels, used if `data` is not an `gpb.Dataset`
`weight`	vector of response values. If not NULL, will set to dataset
`obj`	(character) The distribution of the response variable (=label) conditional on fixed and random effects. This only needs to be set when doing independent boosting without random effects / Gaussian processes.
`eval`	Evaluation metric to be monitored when doing CV and parameter tuning. This can be a string, function, or list with a mixture of strings and functions. a. character vector: Non-exhaustive list of supported metrics: "test_neg_log_likelihood", "mse", "rmse", "mae", "auc", "average_precision", "binary_logloss", "binary_error" See the "metric" section of the parameter documentation for a complete list of valid metrics. b. function: You can provide a custom evaluation function. This should accept the keyword arguments `preds` and `dtrain` and should return a named list with three elements: `name`: A string with the name of the metric, used for printing and storing results. `value`: A single number indicating the value of the metric for the given predictions and true values `higher_better`: A boolean indicating whether higher values indicate a better fit. For example, this would be `FALSE` for metrics like MAE or RMSE. c. list: If a list is given, it should only contain character vectors and functions. These should follow the requirements from the descriptions above.
`verbose_eval`	`integer`. Whether to display information on the progress of tuning parameter choice. If None or 0, verbose is of. If = 1, summary progress information is displayed for every parameter combination. If >= 2, detailed progress is displayed at every boosting stage for every parameter combination.
`stratified`	a `boolean` indicating whether sampling of folds should be stratified by the values of outcome labels.
`init_model`	path of model file of `gpb.Booster` object, will continue training from this model
`colnames`	feature names, if not null, will use this to overwrite the names in dataset
`categorical_feature`	categorical features. This can either be a character vector of feature names or an integer vector with the indices of the features (e.g. `c(1L, 10L)` to say "the first and tenth columns").
`early_stopping_rounds`	int. Activates early stopping. Requires at least one validation data and one metric. When this parameter is non-null, training will stop if the evaluation of any metric on any validation set fails to improve for `early_stopping_rounds` consecutive boosting rounds. If training stops early, the returned model will have attribute `best_iter` set to the iteration number of the best iteration.
`callbacks`	List of callback functions that are applied at each iteration.
`return_all_combinations`	a `boolean` indicating whether all tried parameter combinations are returned
`...`	other parameters, see Parameters.rst for more information.

Value

A list with the best parameter combination and score The list has the following format: list("best_params" = best_params, "best_iter" = best_iter, "best_score" = best_score) If return_all_combinations is TRUE, then the list contains an additional entry 'all_combinations'

Early Stopping

"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.

If multiple arguments are given to eval, their order will be preserved. If you enable early stopping by setting early_stopping_rounds in params, by default all metrics will be considered for early stopping.

If you want to only consider the first metric for early stopping, pass first_metric_only = TRUE in params. Note that if you also specify metric in params, that metric will be considered the "first" one. If you omit metric, a default metric will be used based on your choice for the parameter obj (keyword argument) or objective (passed into params).

Author(s)

Fabio Sigrist

Examples

# See https://github.com/fabsig/GPBoost/tree/master/R-package for more examples

library(gpboost)
data(GPBoost_data, package = "gpboost")

# Create random effects model, dataset, and define parameter grid
gp_model <- GPModel(group_data = group_data[,1], likelihood="gaussian")
dataset <- gpb.Dataset(X, label = y)
param_grid = list("learning_rate" = c(1,0.1,0.01), 
                  "min_data_in_leaf" = c(10,100,1000),
                  "max_depth" = c(1,2,3,5,10),
                  "lambda_l2" = c(0,1,10))
other_params <- list(num_leaves = 2^10)
# Note: here we try different values for 'max_depth' and thus set 'num_leaves' to a large value.
#       An alternative strategy is to impose no limit on 'max_depth', 
#       and try different values for 'num_leaves' as follows:
# param_grid = list("learning_rate" = c(1,0.1,0.01), 
#                   "min_data_in_leaf" = c(10,100,1000),
#                   "num_leaves" = 2^(1:10),
#                   "lambda_l2" = c(0,1,10))
# other_params <- list(max_depth = -1)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
                                              num_try_random = NULL, nfold = 4,
                                              data = dataset, gp_model = gp_model,
                                              use_gp_model_for_validation=TRUE, verbose_eval = 1,
                                              nrounds = 1000, early_stopping_rounds = 10)
print(paste0("Best parameters: ",
             paste0(unlist(lapply(seq_along(opt_params$best_params), 
                                  function(y, n, i) { paste0(n[[i]],": ", y[[i]]) }, 
                                  y=opt_params$best_params, 
                                  n=names(opt_params$best_params))), collapse=", ")))
print(paste0("Best number of iterations: ", opt_params$best_iter))
print(paste0("Best score: ", round(opt_params$best_score, digits=3)))
# Note: other scoring / evaluation metrics can be chosen using the 
#       'metric' argument, e.g., metric = "l1"

# Using manually defined validation data instead of cross-validation
valid_tune_idx <- sample.int(length(y), as.integer(0.2*length(y)))
folds = list(valid_tune_idx)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
                                              num_try_random = NULL, folds = folds,
                                              data = dataset, gp_model = gp_model,
                                              use_gp_model_for_validation=TRUE, verbose_eval = 1,
                                              nrounds = 1000, early_stopping_rounds = 10)

gpboost documentation built on Sept. 11, 2024, 7:04 p.m.