random_forest: Fit a Random Forest Model
In brandon-mosqueda/SKM: Sparse Kernels Methods

random_forest

R Documentation

Fit a Random Forest Model

Description

random_forest() is a wrapper of the randomForestSRC::rfsrc() function to fit a random forest model with the ability to tune the hyperparameters with grid search or bayesian optimization in a simple way. You can fit univariate or multivariate models for numeric and/or categorical response variables.

All the parameters marked as (tunable) accept a vector of values with wich the grid is generated for grid search tuning or a list with the min and max values for bayesian optimization tuning. The returned object contains a data.frame with the hyperparameters combinations evaluated. In the end the best combination of hyperparameters is used to fit the final model, which is also returned and can be used to make new predictions.

Usage

random_forest(
  x,
  y,
  trees_number = 500,
  node_size = 5,
  node_depth = NULL,
  sampled_x_vars_number = NULL,
  tune_type = "Grid_search",
  tune_cv_type = "K_fold",
  tune_folds_number = 5,
  tune_testing_proportion = 0.2,
  tune_folds = NULL,
  tune_loss_function = NULL,
  tune_grid_proportion = 1,
  tune_bayes_samples_number = 10,
  tune_bayes_iterations_number = 10,
  split_rule = NULL,
  splits_number = 10,
  x_vars_weights = NULL,
  records_weights = NULL,
  na_action = "omit",
  validate_params = TRUE,
  seed = NULL,
  verbose = TRUE
)

Arguments

`x`	(`matrix`) The predictor (independet) variable(s). It must be a numeric matrix. You can use `to_matrix()` function to convert your data to a `matrix`.
`y`	(`data.frame` \| `vector` \| `matrix`) The response (dependent) variable(s). If it is a `data.frame` or a `matrix` with 2 or more columns, a multivariate model is assumed, a univariate model otherwise. If `y` is (or contains some columns) `character`, `logical` or `factor`, a categorical variable is assumed, numeric otherwise. In multivariate models categorical and numeric variables can be combined for a mixed model.
`trees_number`	(`numeric`) (tunable) Number of trees. 500 by default.
`node_size`	(`numeric`) (tunable) Minimum size of terminal nodes. 5 by default.
`node_depth`	(`numeric`) (tunable) Maximum depth to which a tree should be grown. `NULL` (ignored) by default.
`sampled_x_vars_number`	(`numeric`) (tunable) Also known as `mtry`. Number of variables randomly selected as candidates for splitting a node. You can specify values between (0, 1] with the proportion of variables in `x` or directly the number of variables to use or a combination of both. `NULL` by default, which uses `p / 3` with numeric response variables or `sqrt(p)` otherwise, where `p` is the number of variables in `x`.
`tune_type`	(`character(1)`) (case not sensitive) The type of tuning to perform. The options are `"Grid_search"` and "`Bayesian_optimization`". `"Grid_search"` by default.
`tune_cv_type`	(`character(1)`) (case not sensitive) The type of cross validation to tune the model. The options are `"K_fold"`, `"K_fold_strata"` (only for univariate categorical response variables) and `"Random"`. `"K_fold"` by defaul.
`tune_folds_number`	(`numeric(1)`) The number of folds to tune each hyperparameter combination (k in k-fold cross validation). 5 by default.
`tune_testing_proportion`	(`numeric(1)`) A number between (0, 1) to specify the proportion of records to use as validation set when `tune_cv_type` is `"Random"`. 0.2 by default.
`tune_folds`	(`list`) Custom folds for tuning. It must be a `list` of `list`'s where each entry will represent a fold. Each inner `list` has to contain the fields `"training"` and `"testing"` with numeric vectors of indices of those entries to be used as training and testing in each fold. Note that when this parameter is set, `tune_cv_type`, `tune_folds_number` and `tune_testing_proportion` are ignored. `NULL` by default.
`tune_loss_function`	(`character(1)`) (case not sensitive) The loss function to use in tuning. The options are `"mse"`, `"maape"`, `"mae"`, `"nrmse"`, `"rmse"` or `"pearson"` when `y` is a numeric response variable, `"accuracy"` or `"kappa_coeff"` when `y` is a categorical response variable (including binary) and `"f1_score"`, `"roc_auc"` or `"pr_auc"` when `y` is a binary response variable. `NULL` by default which uses `"mse"` for numeric variables and `"accuracy"` for categorical variables.
`tune_grid_proportion`	(`numeric(1)`) Only when `tune_type` is `"Grid_search"`, a number between (0, 1] to specify the proportion of hyperparameters combinations to sample from the grid and evaluate in tuning (useful when the grid is big). 1 by default (full grid).
`tune_bayes_samples_number`	(`numeric(1)`) Only when `tune_type` is `"Bayesian_optimization"`, the number of initial random hyperparameters combinations to evalute before the Bayesian optimization process. 10 by default.
`tune_bayes_iterations_number`	(`numeric(1)`) Only when `tune_type` is `"Bayesian_optimization"`, the number of optimization iterations to evaluate after the initial random samples specified in `tune_bayes_samples_number`. 10 by default.
`split_rule`	(`character(1)`) (case not sensitive) Splitting rule. The available options are `"mse"`, `"gini"`, `"auc"`, `"entropy"`. `NULL` by default (which selects the best depending on the type of response variable. For more information, see Details section below).
`splits_number`	(`numeric(1)`) Non-negative integer value for number of random splits to consider for each candidate splitting variable. 10 by default.
`x_vars_weights`	(`numeric`) Vector of non-negative weights (does not have to sum to 1) representing the probability of selecting a variable for splitting. `NULL` by default (uniform weights).
`records_weights`	(`numeric`) Vector of non-negative weights (does not have to sum to 1) for sampling cases. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples. `NULL` by default (uniform weights).
`na_action`	(`character(1)`) (case not sensitive) Action taken if the data contains `NA`'s. The available options are `"omit"` (remove all records with `NA`'s) and `"impute"` (impute missing values). `"omit"` by default.
`validate_params`	(`logical(1)`) Should the parameters be validated? It is not recommended to set this parameter to `FALSE` because if something fails a non meaningful error is going to be thrown. `TRUE` by default.
`seed`	(`numeric(1)`) A value to be used as internal seed for reproducible results. `NULL` by default.
`verbose`	(`logical(1)`) Should the progress information be printed? `TRUE` by default.

Details

Tuning

The general tuning algorithm works as follows:

Tuning algorithm

For grid search tuning, the hyperparameters grid is generated (step one in the algorithm) with the cartesian product of all the provided values (all the posible combinations) in all tunable parameters. If only one value of each tunable parameter is provided no tuning is done. tune_grid_proportion allows you to specify the proportion of all combinations you want to sample from the full grid and tune them, by default all combinations are evaluated.

For bayesian optimization tuning, step one in the algorithm works a little different. At start, tune_bayes_samples_number different hyperparameters combinations are generated and evaluated, then tune_bayes_iterations_number new hyperparameters combinations are generated and evaluated iteratively based on the bayesian optimization algorithm, but this process is equivalent to that described in the general tuninig algorithm. Note that only the hyperparameters for which the list of min and max values were provided are tuned and their values fall in the specified boundaries.

In univariate models with a numeric response variable, Mean Squared Error (mse()) is used by default as loss function. In univariate models with a categorical response variable, either binary or with more than two categories, accuracy (accuracy()) is used by default. You can change the default loss function used in tuning with the tune_loss_function parameter.

split_rule

"mse": Implements weighted Mean Squared Error splitting for numeric response variables.
"gini": Implements Gini index splitting for categorical response variables.
"auc": AUC (area under the ROC curve) splitting for both two-class and multiclass setttings. AUC splitting is appropriate for imbalanced data.
"entropy": entropy splitting for categorical response variables.
Multivariate analysis: When one or both numeric and categorical responses are detected, a multivariate normalized composite split rule of Mean Squared Error and Gini index splitting is invoked.

Value

An object of class "RandomForestModel" that inherits from classes "Model" and "R6" with the fields:

fitted_model: An object of class randomForestSRC::rfsrc() with the model.
x: The final matrix used to fit the model.
y: The final vector or data.frame used to fit the model.
hyperparams_grid: A data.frame with all the computed combinations of hyperparameters and with one more column called "loss" with the value of the loss function for each combination. The data is ordered with the best combinations at start, sometimes with the lowest values first and other times with the greatest values first, depending the loss function.
best_hyperparams: A list with the combination of hyperparameters with the best loss value (the first row in hyperparams_grid).
execution_time: A difftime object with the total time taken to tune and fit the model.
removed_rows: A numeric vector with the records' indices (in the provided position) that were deleted and not taken in account in tunning nor training.
removed_x_cols: A numeric vector with the columns' indices (in the provided positions) that were deleted and not taken in account in tunning nor training.
...: Some other parameters for internal use.

Examples

# Use all default hyperparameters (no tuning) ----------------------------------
x <- to_matrix(iris[, -5])
y <- iris$Species
model <- random_forest(x, y)

# Obtain the variables importance
coef(model)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted
# Obtain the predicted probabilities
predictions$probabilities

# Tune with grid search --------------------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- random_forest(
  x,
  y,
  trees_number = c(100, 200, 300),
  node_size = c(1, 2),
  node_depth = c(10, 15),
  tune_type = "grid_search",
  tune_cv_type = "k_fold",
  tune_folds_number = 5
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Tune with Bayesian optimization ----------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- random_forest(
  x,
  y,
  trees_number = list(min = 100, max = 500),
  node_size = list(min = 1, max = 10),
  tune_type = "bayesian_optimization",
  tune_bayes_samples_number = 5,
  tune_bayes_iterations_number = 5,
  tune_cv_type = "random",
  tune_folds_number = 4
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Obtain the variables importance
coef(model)

# Obtain the execution time taken to tune and fit the model
model$execution_time

# Multivariate analysis --------------------------------------------------------
x <- to_matrix(iris[, -c(1, 5)])
y <- iris[, c(1, 5)]
model <- random_forest(x, y, trees_number = 100)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values of the first response
predictions$Sepal.Length$predicted
# Obtain the predicted values and probabilities of the second response
predictions$Species$predicted
predictions$Species$probabilities

# Obtain the predictions in a data.frame not in a list
predictions <- predict(model, x, format = "data.frame")
head(predictions)

# Genomic selection ------------------------------------------------------------
data(Wheat)

# Data preparation of G
Line <- model.matrix(~ 0 + Line, data = Wheat$Pheno)
# Compute cholesky
Geno <- cholesky(Wheat$Geno)
# G matrix
X <- Line %*% Geno
y <- Wheat$Pheno$Y

# Set seed for reproducible results
set.seed(2022)
folds <- cv_random(
  records_number = nrow(X),
  folds_number = 5,
  testing_proportion = 0.2
)

Predictions <- data.frame()
Hyperparams <- data.frame()

# Model training and predictions
for (i in seq_along(folds)) {
  cat("*** Fold:", i, "***\n")
  fold <- folds[[i]]

  # Identify the training and testing sets
  X_training <- X[fold$training, ]
  X_testing <- X[fold$testing, ]
  y_training <- y[fold$training]
  y_testing <- y[fold$testing]

  # Model training
  model <- random_forest(
    x = X_training,
    y = y_training,

    # Specify the hyperparameters for tunning
    trees_number = c(30, 50, 80),
    node_size = c(5, 10),
    tune_type = "grid_search"
  )

  # Prediction of testing set
  predictions <- predict(model, X_testing)

  # Predictions for the i-th fold
  FoldPredictions <- data.frame(
    Fold = i,
    Line = Wheat$Pheno$Line[fold$testing],
    Env = Wheat$Pheno$Env[fold$testing],
    Observed = y_testing,
    Predicted = predictions$predicted
  )
  Predictions <- rbind(Predictions, FoldPredictions)

  # Hyperparams
  HyperparamsFold <- model$hyperparams_grid %>%
    mutate(Fold = i)
  Hyperparams <- rbind(Hyperparams, HyperparamsFold)

  # Best hyperparams of the model
  cat("*** Optimal hyperparameters: ***\n")
  print(model$best_hyperparams)
}

head(Predictions)
# Compute the summary of all predictions
summaries <- gs_summaries(Predictions)

# Summaries by Line
head(summaries$line)

# Summaries by Environment
summaries$env

# Summaries by Fold
summaries$fold

# First rows of Hyperparams
head(Hyperparams)
# Last rows of Hyperparams
tail(Hyperparams)

brandon-mosqueda/SKM documentation built on Feb. 8, 2025, 5:24 p.m.