generalized_boosted_machine: Fit a Generalized Boosted Machine (GBM)

View source: R/generalized_boosted_machine.R

generalized_boosted_machineR Documentation

Fit a Generalized Boosted Machine (GBM)

Description

generalized_boosted_machine() is a wrapper of the gbm::gbm() function to fit a generalized boosted machine with the ability to tune the hyperparameters with grid search or bayesian optimization in a simple way. You can fit univariate models for continuous, count, binary and categorical response variables.

All the parameters marked as (tunable) accept a vector of values with wich the grid is generated for grid search tuning or a list with the min and max values for bayesian optimization tuning. The returned object contains a data.frame with the hyperparameters combinations evaluated. In the end the best combination of hyperparameters is used to fit the final model, which is also returned and can be used to make new predictions.

Usage

generalized_boosted_machine(
  x,
  y,
  trees_number = 500,
  max_depth = 1,
  node_size = 10,
  shrinkage = 0.1,
  sampled_records_proportion = 0.5,
  tune_type = "Grid_search",
  tune_cv_type = "K_fold",
  tune_folds_number = 5,
  tune_testing_proportion = 0.2,
  tune_folds = NULL,
  tune_loss_function = NULL,
  tune_grid_proportion = 1,
  tune_bayes_samples_number = 10,
  tune_bayes_iterations_number = 10,
  predictors_relationship = NULL,
  validate_params = TRUE,
  seed = NULL,
  verbose = TRUE
)

Arguments

x

(matrix) The predictor (independet) variable(s). It must be a numeric matrix. You can use to_matrix() function to convert your data to a matrix.

y

(data.frame | vector | matrix) The response (dependent) variable. If it is a data.frame or a matrix it must have only one column. If y is character, logical or factor a categorical response is assumed. When the response is categorical with only two classes a binary distribution is assumed, with more than two classes a multinomial distribution. When the response variable is numeric with only integers values greater or equals than zero a poisson distribution is assumed, gaussian distribution otherwise.

trees_number

(numeric) (tunable) Number of trees. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. 500 by default.

max_depth

(numeric) (tunable) Maximum depth of each tree (the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc. 1 by default.

node_size

(numeric) (tunable) Also known as n.minobsinnode. Minimum number of observations in the terminal nodes of the trees. 10 by default.

shrinkage

(numeric) (tunable) A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction. 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. 0.1 by default.

sampled_records_proportion

(numeric) (tunable) Also known as bag.fraction. The fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. 0.5 by default.

tune_type

(character(1)) (case not sensitive) The type of tuning to perform. The options are "Grid_search" and "Bayesian_optimization". "Grid_search" by default.

tune_cv_type

(character(1)) (case not sensitive) The type of cross validation to tune the model. The options are "K_fold", "K_fold_strata" (only for univariate categorical response variables) and "Random". "K_fold" by defaul.

tune_folds_number

(numeric(1)) The number of folds to tune each hyperparameter combination (k in k-fold cross validation). 5 by default.

tune_testing_proportion

(numeric(1)) A number between (0, 1) to specify the proportion of records to use as validation set when tune_cv_type is "Random". 0.2 by default.

tune_folds

(list) Custom folds for tuning. It must be a list of list's where each entry will represent a fold. Each inner list has to contain the fields "training" and "testing" with numeric vectors of indices of those entries to be used as training and testing in each fold. Note that when this parameter is set, tune_cv_type, tune_folds_number and tune_testing_proportion are ignored. NULL by default.

tune_loss_function

(character(1)) (case not sensitive) The loss function to use in tuning. The options are "mse", "maape", "mae", "nrmse", "rmse" or "pearson" when y is a numeric response variable, "accuracy" or "kappa_coeff" when y is a categorical response variable (including binary) and "f1_score", "roc_auc" or "pr_auc" when y is a binary response variable. NULL by default which uses "mse" for numeric variables and "accuracy" for categorical variables.

tune_grid_proportion

(numeric(1)) Only when tune_type is "Grid_search", a number between (0, 1] to specify the proportion of hyperparameters combinations to sample from the grid and evaluate in tuning (useful when the grid is big). 1 by default (full grid).

tune_bayes_samples_number

(numeric(1)) Only when tune_type is "Bayesian_optimization", the number of initial random hyperparameters combinations to evalute before the Bayesian optimization process. 10 by default.

tune_bayes_iterations_number

(numeric(1)) Only when tune_type is "Bayesian_optimization", the number of optimization iterations to evaluate after the initial random samples specified in tune_bayes_samples_number. 10 by default.

predictors_relationship

(numeric) Also known as var.monotone. A vector with the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. NULL by default.

validate_params

(logical(1)) Should the parameters be validated? It is not recommended to set this parameter to FALSE because if something fails a non meaningful error is going to be thrown. TRUE by default.

seed

(numeric(1)) A value to be used as internal seed for reproducible results. NULL by default.

verbose

(logical(1)) Should the progress information be printed? TRUE by default.

Details

You have to consider that before tuning all columns without variance (where all the records has the same value) are removed. Such columns positions are returned in the removed_x_cols field of the returned object.

All records with missing values (NA), either in x or in y will be removed. The positions of the removed records are returned in the removed_rows field of the returned object.

Tuning

The general tuning algorithm works as follows:

Tuning algorithm

For grid search tuning, the hyperparameters grid is generated (step one in the algorithm) with the cartesian product of all the provided values (all the posible combinations) in all tunable parameters. If only one value of each tunable parameter is provided no tuning is done. tune_grid_proportion allows you to specify the proportion of all combinations you want to sample from the full grid and tune them, by default all combinations are evaluated.

For bayesian optimization tuning, step one in the algorithm works a little different. At start, tune_bayes_samples_number different hyperparameters combinations are generated and evaluated, then tune_bayes_iterations_number new hyperparameters combinations are generated and evaluated iteratively based on the bayesian optimization algorithm, but this process is equivalent to that described in the general tuninig algorithm. Note that only the hyperparameters for which the list of min and max values were provided are tuned and their values fall in the specified boundaries.

In univariate models with a numeric response variable, Mean Squared Error (mse()) is used by default as loss function. In univariate models with a categorical response variable, either binary or with more than two categories, accuracy (accuracy()) is used by default. You can change the default loss function used in tuning with the tune_loss_function parameter.

Value

An object of class "GeneralizedBoostedMachineModel" that inherits from classes "Model" and "R6" with the fields:

  • fitted_model: An object of class gbm::gbm.fit() with the model.

  • x: The final matrix used to fit the model.

  • y: The final vector used to fit the model.

  • hyperparams_grid: A data.frame with all the computed combinations of hyperparameters and with one more column called "loss" with the value of the loss function for each combination. The data is ordered with the best combinations at start, sometimes with the lowest values first and other times with the greatest values first, depending the loss function.

  • best_hyperparams: A list with the combination of hyperparameters with the best loss value (the first row in hyperparams_grid).

  • execution_time: A difftime object with the total time taken to tune and fit the model.

  • removed_rows: A numeric vector with the records' indices (in the provided position) that were deleted and not taken in account in tunning nor training.

  • removed_x_cols: A numeric vector with the columns' indices (in the provided positions) that were deleted and not taken in account in tunning nor training.

  • ...: Some other parameters for internal use.

See Also

predict.Model()

Other models: bayesian_model(), deep_learning(), generalized_linear_model(), mixed_model(), partial_least_squares(), random_forest(), support_vector_machine()

Examples

# Use all default hyperparameters (no tuning) -------------------------------
x <- to_matrix(iris[, -5])
y <- iris$Species
model <- generalized_boosted_machine(x, y)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted
# Obtain the predicted probabilities
predictions$probabilities

# Tune with grid search -----------------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- generalized_boosted_machine(
  x,
  y,
  trees_number = c(100, 200, 300),
  node_size = c(1, 2),
  shrinkage = c(0.1, 0.3),
  tune_type = "grid_search",
  tune_cv_type = "k_fold",
  tune_folds_number = 5
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Tune with Bayesian optimization -------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- generalized_boosted_machine(
  x,
  y,
  trees_number = list(min = 100, max = 500),
  node_size = list(min = 1, max = 10),
  tune_type = "bayesian_optimization",
  tune_bayes_samples_number = 5,
  tune_bayes_iterations_number = 5,
  tune_cv_type = "random",
  tune_folds_number = 4
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Obtain the execution time taken to tune and fit the model
model$execution_time

# Genomic selection ------------------------------------------------------------
data(Maize)

# Data preparation of G
Line <- model.matrix(~ 0 + Line, data = Maize$Pheno)
# Compute cholesky
Geno <- cholesky(Maize$Geno)
# G matrix
X <- Line %*% Geno
y <- Maize$Pheno$Y

# Set seed for reproducible results
set.seed(2022)
folds <- cv_kfold(records_number = nrow(X), k = 5)

Predictions <- data.frame()
Hyperparams <- data.frame()

# Model training and predictions
for (i in seq_along(folds)) {
  cat("*** Fold:", i, "***\n")
  fold <- folds[[i]]

  # Identify the training and testing sets
  X_training <- X[fold$training, ]
  X_testing <- X[fold$testing, ]
  y_training <- y[fold$training]
  y_testing <- y[fold$testing]

  # Model training
  model <- generalized_boosted_machine(
    x = X_training,
    y = y_training,

    # Specify the hyperparameters
    trees_number = c(30, 50, 80),
    node_size = c(5, 10, 15),
    max_depth = 5,
    shrinkage = c(0.001, 0.01, 0.1),
    tune_type = "grid_search",
    tune_cv_type = "k_fold",
    tune_folds_number = 5,

    # In this example the iterations wont be shown
    verbose = FALSE
  )

  # Prediction of testing set
  predictions <- predict(model, X_testing)

  # Predictions for the i-th fold
  FoldPredictions <- data.frame(
    Fold = i,
    Line = Maize$Pheno$Line[fold$testing],
    Env = Maize$Pheno$Env[fold$testing],
    Observed = y_testing,
    Predicted = predictions$predicted
  )
  Predictions <- rbind(Predictions, FoldPredictions)

  # Hyperparams
  HyperparamsFold <- model$hyperparams_grid %>%
    mutate(Fold = i)
  Hyperparams <- rbind(Hyperparams, HyperparamsFold)

  # Best hyperparams of the model
  cat("*** Optimal hyperparameters: ***\n")
  print(model$best_hyperparams)
}

head(Predictions)
# Compute the summary of all predictions
summaries <- gs_summaries(Predictions)

# Summaries by Line
head(summaries$line)

# Summaries by Environment
summaries$env

# Summaries by Fold
summaries$fold

# First rows of Hyperparams
head(Hyperparams)
# Last rows of Hyperparams
tail(Hyperparams)

brandon-mosqueda/SKM documentation built on Feb. 8, 2025, 5:24 p.m.