generalized_linear_model: Fit a Penalized Generalized Linear Model
In brandon-mosqueda/SKM: Sparse Kernels Methods

View source: R/generalized_linear_model.R

generalized_linear_model

R Documentation

Fit a Penalized Generalized Linear Model

Description

generalized_linear_model() is a wrapper of the glmnet::glmnet() function to fit a generalized linear model with the ability to tune the hyperparameters with grid search or bayesian optimization in a simple way. You can fit univariate models for continuous, count, binary and categorical response variables and multivariate models for numeric responses only.

All the parameters marked as (tunable) accept a vector of values with wich the grid is generated for grid search tuning or a list with the min and max values for bayesian optimization tuning. The returned object contains a data.frame with the hyperparameters combinations evaluated. In the end the best combination of hyperparameters is used to fit the final model, which is also returned and can be used to make new predictions.

Usage

generalized_linear_model(
  x,
  y,
  alpha = 1,
  tune_type = "Grid_search",
  tune_folds_number = 5,
  tune_folds = NULL,
  tune_loss_function = NULL,
  tune_grid_proportion = 1,
  tune_bayes_samples_number = 10,
  tune_bayes_iterations_number = 10,
  lambdas_number = 100,
  records_weights = NULL,
  standardize = TRUE,
  fit_intercept = TRUE,
  validate_params = TRUE,
  seed = NULL,
  verbose = TRUE
)

Arguments

`x`	(`matrix`) The predictor (independet) variable(s). It must be a numeric matrix. You can use `to_matrix()` function to convert your data to a `matrix`.
`y`	(`data.frame` \| `vector` \| `matrix`) The response (dependent) variable(s). If it is a `data.frame` or a `matrix` with 2 or more columns, a multivariate model is assumed, a univariate model otherwise. In univariate models if `y` is `character`, `logical` or `factor` a categorical response is assumed. When the response is categorical with only two classes a logistic regression is assumed, with more than two classes a multinomial regression. When the response variable is numeric with only integers values greater or equals than zero a poisson regression is assumed, multiple regression otherwise. In multivariate models all responses are coerced to numeric and a multi-response gaussian regression is assumed.
`alpha`	(`numeric`) (tunable) The elasticnet mixing parameter, with 0 <= `alpha` <= 1. The penalty is defined as: $(1 - \alpha)/2 \|\|\beta\|\|_2^2 + \alpha \|\|\beta\|\|_1$ `alpha = 0` is the lasso penalty, `alpha = 1` is the ridge penalty and `⁠0 < alpha < 1⁠` is the elasticnet penalty. 1 by default.
`tune_type`	(`character(1)`) (case not sensitive) The type of tuning to perform. The options are `"Grid_search"` and "`Bayesian_optimization`". `"Grid_search"` by default.
`tune_folds_number`	(`numeric(1)`) The number of folds to tune each hyperparameter combination (k in k-fold cross validation). 5 by default.
`tune_folds`	(`list`) Custom folds for tuning. It must be a `list` of `list`'s where each entry will represent a fold. Each inner `list` has to contain the fields `"training"` and `"testing"` with numeric vectors of indices of those entries to be used as training and testing in each fold. Note that when this parameter is set, `tune_cv_type`, `tune_folds_number` and `tune_testing_proportion` are ignored. `NULL` by default.
`tune_loss_function`	(`character(1)`) (case not sensitive) The loss function to use in tuning. The options are `"mse"`, `"maape"`, `"mae"`, `"nrmse"`, `"rmse"` or `"pearson"` when `y` is a numeric response variable, `"accuracy"` or `"kappa_coeff"` when `y` is a categorical response variable (including binary) and `"f1_score"`, `"roc_auc"` or `"pr_auc"` when `y` is a binary response variable. `NULL` by default which uses `"mse"` for numeric variables and `"accuracy"` for categorical variables.
`tune_grid_proportion`	(`numeric(1)`) Only when `tune_type` is `"Grid_search"`, a number between (0, 1] to specify the proportion of hyperparameters combinations to sample from the grid and evaluate in tuning (useful when the grid is big). 1 by default (full grid).
`tune_bayes_samples_number`	(`numeric(1)`) Only when `tune_type` is `"Bayesian_optimization"`, the number of initial random hyperparameters combinations to evalute before the Bayesian optimization process. 10 by default.
`tune_bayes_iterations_number`	(`numeric(1)`) Only when `tune_type` is `"Bayesian_optimization"`, the number of optimization iterations to evaluate after the initial random samples specified in `tune_bayes_samples_number`. 10 by default.
`lambdas_number`	(`numeric(1)`) The number of lambda values to be generated and evaluated in tuning. If `lambda` is provided, this parameter is ignored. 100 by default.
`records_weights`	(`numeric`) Observation weights. `NULL` by default (1 for each observation).
`standardize`	(`logical(1)`) Should the `x` variables be standardized? The coefficients are always returned on the original scale. If variables are in the same units already, you might not wish to standardize. `TRUE` by default.
`fit_intercept`	(`logical(1)`) Should intercept be fitted? `TRUE` by default.
`validate_params`	(`logical(1)`) Should the parameters be validated? It is not recommended to set this parameter to `FALSE` because if something fails a non meaningful error is going to be thrown. `TRUE` by default.
`seed`	(`numeric(1)`) A value to be used as internal seed for reproducible results. `NULL` by default.
`verbose`	(`logical(1)`) Should the progress information be printed? `TRUE` by default.
`tune_cv_type`	(`character(1)`) (case not sensitive) The type of cross validation to tune the model. The options are `"K_fold"`, `"K_fold_strata"` (only for univariate categorical response variables) and `"Random"`. `"K_fold"` by defaul.
`tune_testing_proportion`	(`numeric(1)`) A number between (0, 1) to specify the proportion of records to use as validation set when `tune_cv_type` is `"Random"`. 0.2 by default.

Details

You have to consider that before tuning all columns without variance (where all the records has the same value) are removed. Such columns positions are returned in the removed_x_cols field of the returned object.

All records with missing values (NA), either in x or in y will be removed. The positions of the removed records are returned in the removed_rows field of the returned object.

Tuning

The general tuning algorithm works as follows:

Tuning algorithm

For grid search tuning, the hyperparameters grid is generated (step one in the algorithm) with the cartesian product of all the provided values (all the posible combinations) in all tunable parameters. If only one value of each tunable parameter is provided no tuning is done. tune_grid_proportion allows you to specify the proportion of all combinations you want to sample from the full grid and tune them, by default all combinations are evaluated.

For bayesian optimization tuning, step one in the algorithm works a little different. At start, tune_bayes_samples_number different hyperparameters combinations are generated and evaluated, then tune_bayes_iterations_number new hyperparameters combinations are generated and evaluated iteratively based on the bayesian optimization algorithm, but this process is equivalent to that described in the general tuninig algorithm. Note that only the hyperparameters for which the list of min and max values were provided are tuned and their values fall in the specified boundaries.

For efficiency tunning is made using the glmnet::cv.glmnet() function which uses k-fold cross validation, so tune_cv_type and tune_testing_proportion parameters are not included in this function.

In univariate models with a numeric response variable, Mean Squared Error (mse()) is used by default as loss function. In univariate models with a categorical response variable, either binary or with more than two categories, accuracy (accuracy()) is used by default. You can change the default loss function used in tuning with the tune_loss_function parameter.

Value

An object of class "GeneralizedLinearModel" that inherits from classes "Model" and "R6" with the fields:

fitted_model: An object of class glmnet::glmnet() with the model.
x: The final matrix used to fit the model.
y: The final vector or matrix used to fit the model.
hyperparams_grid: A data.frame with all the computed combinations of hyperparameters and with one more column called "loss" with the value of the loss function for each combination. The data is ordered with the best combinations at start, sometimes with the lowest values first and other times with the greatest values first, depending the loss function.
best_hyperparams: A list with the combination of hyperparameters with the best loss value (the first row in hyperparams_grid).
execution_time: A difftime object with the total time taken to tune and fit the model.
removed_rows: A numeric vector with the records' indices (in the provided position) that were deleted and not taken in account in tunning nor training.
removed_x_cols: A numeric vector with the columns' indices (in the provided positions) that were deleted and not taken in account in tunning nor training.
...: Some other parameters for internal use.

Examples

# Use all default hyperparameters (no tuning) -------------------------------
x <- to_matrix(iris[, -5])
y <- iris$Species
model <- generalized_linear_model(x, y)

# Obtain the variables importance
coef(model)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted
# Obtain the predicted probabilities
predictions$probabilities

# Tune with grid search -----------------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- generalized_linear_model(
  x,
  y,
  alpha = c(0, 0.3, 0.6, 1),
  tune_type = "grid_search",
  tune_folds_number = 5
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Tune with Bayesian optimization -------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- generalized_linear_model(
  x,
  y,
  alpha = list(min = 0, max = 1),
  tune_type = "bayesian_optimization",
  tune_bayes_samples_number = 5,
  tune_bayes_iterations_number = 5
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Obtain the variables importance
coef(model)

# Obtain the execution time taken to tune and fit the model
model$execution_time

# Multivariate analysis -----------------------------------------------------
x <- to_matrix(iris[, -c(1, 2)])
y <- iris[, c(1, 2)]
model <- generalized_linear_model(x, y, alpha = 1)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values of the first response variable
predictions$Sepal.Length$predicted
# Obtain the predicted values of the second response variable
predictions$Sepal.Width$predicted

# Obtain the predictions in a data.frame not in a list
predictions <- predict(model, x, format = "data.frame")
head(predictions)

# Genomic selection ------------------------------------------------------------
data(Wheat)

# Data preparation of G
Line <- model.matrix(~ 0 + Line, data = Wheat$Pheno)
# Compute cholesky
Geno <- cholesky(Wheat$Geno)
# G matrix
X <- Line %*% Geno
y <- Wheat$Pheno$Y

# Set seed for reproducible results
set.seed(2022)
folds <- cv_random(
  records_number = nrow(X),
  folds_number = 5,
  testing_proportion = 0.3
)

Predictions <- data.frame()
Hyperparams <- data.frame()

# Model training and predictions
for (i in seq_along(folds)) {
  cat("*** Fold:", i, "***\n")
  fold <- folds[[i]]

  # Identify the training and testing sets
  X_training <- X[fold$training, ]
  X_testing <- X[fold$testing, ]
  y_training <- y[fold$training]
  y_testing <- y[fold$testing]

  # Model training
  model <- generalized_linear_model(
    x = X_training,
    y = y_training,

    # Specify the hyperparameters
    alpha = c(0, 0.25, 0.50, 0.75, 1),
    lambdas_number = 100,
    tune_type = "grid_search"
  )

  # Prediction of testing set
  predictions <- predict(model, X_testing)

  # Predictions for the i-th fold
  FoldPredictions <- data.frame(
    Fold = i,
    Line = Wheat$Pheno$Line[fold$testing],
    Env = Wheat$Pheno$Env[fold$testing],
    Observed = y_testing,
    Predicted = predictions$predicted
  )
  Predictions <- rbind(Predictions, FoldPredictions)

  # Hyperparams
  HyperparamsFold <- model$hyperparams_grid %>%
    mutate(Fold = i)
  Hyperparams <- rbind(Hyperparams, HyperparamsFold)

  # Best hyperparams of the model
  cat("*** Optimal hyperparameters: ***\n")
  print(model$best_hyperparams)
}

head(Predictions)
# Compute the summary of all predictions
summaries <- gs_summaries(Predictions)

# Summaries by Line
head(summaries$line)

# Summaries by Environment
summaries$env

# Summaries by Fold
summaries$fold

# First rows of Hyperparams
head(Hyperparams)
# Last rows of Hyperparams
tail(Hyperparams)

brandon-mosqueda/SKM documentation built on Feb. 8, 2025, 5:24 p.m.