gpb_shared_params: Shared parameter docs

gpb_shared_paramsR Documentation

Shared parameter docs

Description

Parameter docs shared by gpb.train, gpb.cv, and gpboost

Arguments

callbacks

List of callback functions that are applied at each iteration.

data

a gpb.Dataset object, used for training. Some functions, such as gpb.cv, may allow you to pass other types of data like matrix and then separately supply label as a keyword argument.

early_stopping_rounds

int. Activates early stopping. Requires at least one validation data and one metric. When this parameter is non-null, training will stop if the evaluation of any metric on any validation set fails to improve for early_stopping_rounds consecutive boosting rounds. If training stops early, the returned model will have attribute best_iter set to the iteration number of the best iteration.

eval

evaluation function(s). This can be a character vector, function, or list with a mixture of strings and functions.

  • a. character vector: If you provide a character vector to this argument, it should contain strings with valid evaluation metrics. See the "metric" section of the parameter documentation for a list of valid metrics.

  • b. function: You can provide a custom evaluation function. This should accept the keyword arguments preds and dtrain and should return a named list with three elements:

    • name: A string with the name of the metric, used for printing and storing results.

    • value: A single number indicating the value of the metric for the given predictions and true values

    • higher_better: A boolean indicating whether higher values indicate a better fit. For example, this would be FALSE for metrics like MAE or RMSE.

  • c. list: If a list is given, it should only contain character vectors and functions. These should follow the requirements from the descriptions above.

eval_freq

evaluation output frequency, only effect when verbose > 0

valids

a list of gpb.Dataset objects, used for validation

record

Boolean, TRUE will record iteration message to booster$record_evals

colnames

feature names, if not null, will use this to overwrite the names in dataset

categorical_feature

categorical features. This can either be a character vector of feature names or an integer vector with the indices of the features (e.g. c(1L, 10L) to say "the first and tenth columns").

init_model

path of model file of gpb.Booster object, will continue training from this model

nrounds

number of boosting iterations (= number of trees). This is the most important tuning parameter for boosting

obj

(character) The distribution of the response variable (=label) conditional on fixed and random effects. This only needs to be set when doing independent boosting without random effects / Gaussian processes.

params

list of "tuning" parameters. See the parameter documentation for more information. A few key parameters:

  • learning_rate: The learning rate, also called shrinkage or damping parameter (default = 0.1). An important tuning parameter for boosting. Lower values usually lead to higher predictive accuracy but more boosting iterations are needed

  • num_leaves: Number of leaves in a tree. Tuning parameter for tree-boosting (default = 31)

  • max_depth: Maximal depth of a tree. Tuning parameter for tree-boosting (default = no limit)

  • min_data_in_leaf: Minimal number of samples per leaf. Tuning parameter for tree-boosting (default = 20)

  • lambda_l2: L2 regularization (default = 0)

  • lambda_l1: L1 regularization (default = 0)

  • max_bin: Maximal number of bins that feature values will be bucketed in (default = 255)

  • train_gp_model_cov_pars: If TRUE, the covariance parameters of the Gaussian process are stimated in every boosting iterations, otherwise the gp_model parameters are not estimated. In the latter case, you need to either esimate them beforehand or provide the values via the 'init_cov_pars' parameter when creating the gp_model (default = TRUE).

  • use_gp_model_for_validation: If TRUE, the Gaussian process is also used (in addition to the tree model) for calculating predictions on the validation data (default = TRUE)

  • leaves_newton_update: Set this to TRUE to do a Newton update step for the tree leaves after the gradient step. Applies only to Gaussian process boosting (GPBoost algorithm)

  • num_threads: Number of threads. For the best speed, set this to the number of real CPU cores(parallel::detectCores(logical = FALSE)), not the number of threads (most CPU using hyper-threading to generate 2 threads per CPU core).

verbose

verbosity for output, if <= 0, also will disable the print of evaluation during training

gp_model

A GPModel object that contains the random effects (Gaussian process and / or grouped random effects) model

use_gp_model_for_validation

Boolean. If TRUE, the gp_model (Gaussian process and/or random effects) is also used (in addition to the tree model) for calculating predictions on the validation data. If FALSE, the gp_model (random effects part) is ignored for making predictions and only the tree ensemble is used for making predictions for calculating the validation / test error.

train_gp_model_cov_pars

Boolean. If TRUE, the covariance parameters of the gp_model (Gaussian process and/or random effects) are estimated in every boosting iterations, otherwise the gp_model parameters are not estimated. In the latter case, you need to either estimate them beforehand or provide the values via the init_cov_pars parameter when creating the gp_model

Early Stopping

"early stopping" refers to stopping the training process if the model's performance on a given validation set does not improve for several consecutive iterations.

If multiple arguments are given to eval, their order will be preserved. If you enable early stopping by setting early_stopping_rounds in params, by default all metrics will be considered for early stopping.

If you want to only consider the first metric for early stopping, pass first_metric_only = TRUE in params. Note that if you also specify metric in params, that metric will be considered the "first" one. If you omit metric, a default metric will be used based on your choice for the parameter obj (keyword argument) or objective (passed into params).


gpboost documentation built on Oct. 24, 2023, 9:09 a.m.