View source: R/generalized_linear_model.R
generalized_linear_model | R Documentation |
generalized_linear_model()
is a wrapper of the glmnet::glmnet()
function
to fit a generalized linear model with the ability to tune the
hyperparameters with grid search or bayesian optimization in a simple way.
You can fit univariate models for continuous, count, binary and categorical
response variables and multivariate models for numeric responses only.
All the parameters marked as (tunable) accept a vector of values with
wich the grid is generated for grid search tuning or a list with the min
and max values for bayesian optimization tuning. The returned object contains
a data.frame
with the hyperparameters combinations evaluated. In the end
the best combination of hyperparameters is used to fit the final model, which
is also returned and can be used to make new predictions.
generalized_linear_model(
x,
y,
alpha = 1,
tune_type = "Grid_search",
tune_folds_number = 5,
tune_folds = NULL,
tune_loss_function = NULL,
tune_grid_proportion = 1,
tune_bayes_samples_number = 10,
tune_bayes_iterations_number = 10,
lambdas_number = 100,
records_weights = NULL,
standardize = TRUE,
fit_intercept = TRUE,
validate_params = TRUE,
seed = NULL,
verbose = TRUE
)
x |
( |
y |
( |
alpha |
(
|
tune_type |
( |
tune_folds_number |
( |
tune_folds |
( |
tune_loss_function |
( |
tune_grid_proportion |
( |
tune_bayes_samples_number |
( |
tune_bayes_iterations_number |
( |
lambdas_number |
( |
records_weights |
( |
standardize |
( |
fit_intercept |
( |
validate_params |
( |
seed |
( |
verbose |
( |
tune_cv_type |
( |
tune_testing_proportion |
( |
You have to consider that before tuning all columns without variance
(where all the records has the same value) are removed. Such columns
positions are returned in the removed_x_cols
field of the returned object.
All records with missing values (NA
), either in x
or in y
will be
removed. The positions of the removed records are returned in the
removed_rows
field of the returned object.
The general tuning algorithm works as follows:
For grid search tuning, the hyperparameters grid is generated (step one in
the algorithm) with the cartesian product of all the provided values (all the
posible combinations) in all tunable parameters. If only one value of
each tunable parameter is provided no tuning is done.
tune_grid_proportion
allows you to specify the proportion of all
combinations you want to sample from the full grid and tune them, by default
all combinations are evaluated.
For bayesian optimization tuning, step one in the algorithm works a little
different. At start, tune_bayes_samples_number
different
hyperparameters combinations are generated and evaluated, then
tune_bayes_iterations_number
new hyperparameters combinations are generated
and evaluated iteratively based on the bayesian optimization algorithm, but
this process is equivalent to that described in the general tuninig
algorithm. Note that only the hyperparameters for which the list of min and
max values were provided are tuned and their values fall in the specified
boundaries.
For efficiency tunning is made using the glmnet::cv.glmnet()
function which
uses k-fold cross validation, so tune_cv_type
and tune_testing_proportion
parameters are not included in this function.
In univariate models with a numeric response variable, Mean
Squared Error (mse()
) is used by default as loss function. In univariate
models with a categorical response variable, either binary or with more than
two categories, accuracy (accuracy()
) is used by default. You can change
the default loss function used in tuning with the tune_loss_function
parameter.
An object of class "GeneralizedLinearModel"
that inherits from classes
"Model"
and "R6"
with the fields:
fitted_model
: An object of class glmnet::glmnet()
with the model.
x
: The final matrix used to fit the model.
y
: The final vector
or matrix
used to fit the model.
hyperparams_grid
: A data.frame
with all the computed combinations of
hyperparameters and with one more column called "loss"
with the value of
the loss function for each combination. The data is ordered with the best
combinations at start, sometimes with the lowest values first and other
times with the greatest values first, depending the loss function.
best_hyperparams
: A list
with the combination of hyperparameters with
the best loss value (the first row in hyperparams_grid
).
execution_time
: A difftime
object with the total time taken to tune and
fit the model.
removed_rows
: A numeric
vector with the records' indices (in the
provided position) that were deleted and not taken in account in tunning
nor training.
removed_x_cols
: A numeric
vector with the columns' indices (in the
provided positions) that were deleted and not taken in account in tunning
nor training.
...
: Some other parameters for internal use.
predict.Model()
, coef.Model()
Other models:
bayesian_model()
,
deep_learning()
,
generalized_boosted_machine()
,
mixed_model()
,
partial_least_squares()
,
random_forest()
,
support_vector_machine()
# Use all default hyperparameters (no tuning) -------------------------------
x <- to_matrix(iris[, -5])
y <- iris$Species
model <- generalized_linear_model(x, y)
# Obtain the variables importance
coef(model)
# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted
# Obtain the predicted probabilities
predictions$probabilities
# Tune with grid search -----------------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- generalized_linear_model(
x,
y,
alpha = c(0, 0.3, 0.6, 1),
tune_type = "grid_search",
tune_folds_number = 5
)
# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams
# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted
# Tune with Bayesian optimization -------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- generalized_linear_model(
x,
y,
alpha = list(min = 0, max = 1),
tune_type = "bayesian_optimization",
tune_bayes_samples_number = 5,
tune_bayes_iterations_number = 5
)
# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams
# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted
# Obtain the variables importance
coef(model)
# Obtain the execution time taken to tune and fit the model
model$execution_time
# Multivariate analysis -----------------------------------------------------
x <- to_matrix(iris[, -c(1, 2)])
y <- iris[, c(1, 2)]
model <- generalized_linear_model(x, y, alpha = 1)
# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values of the first response variable
predictions$Sepal.Length$predicted
# Obtain the predicted values of the second response variable
predictions$Sepal.Width$predicted
# Obtain the predictions in a data.frame not in a list
predictions <- predict(model, x, format = "data.frame")
head(predictions)
# Genomic selection ------------------------------------------------------------
data(Wheat)
# Data preparation of G
Line <- model.matrix(~ 0 + Line, data = Wheat$Pheno)
# Compute cholesky
Geno <- cholesky(Wheat$Geno)
# G matrix
X <- Line %*% Geno
y <- Wheat$Pheno$Y
# Set seed for reproducible results
set.seed(2022)
folds <- cv_random(
records_number = nrow(X),
folds_number = 5,
testing_proportion = 0.3
)
Predictions <- data.frame()
Hyperparams <- data.frame()
# Model training and predictions
for (i in seq_along(folds)) {
cat("*** Fold:", i, "***\n")
fold <- folds[[i]]
# Identify the training and testing sets
X_training <- X[fold$training, ]
X_testing <- X[fold$testing, ]
y_training <- y[fold$training]
y_testing <- y[fold$testing]
# Model training
model <- generalized_linear_model(
x = X_training,
y = y_training,
# Specify the hyperparameters
alpha = c(0, 0.25, 0.50, 0.75, 1),
lambdas_number = 100,
tune_type = "grid_search"
)
# Prediction of testing set
predictions <- predict(model, X_testing)
# Predictions for the i-th fold
FoldPredictions <- data.frame(
Fold = i,
Line = Wheat$Pheno$Line[fold$testing],
Env = Wheat$Pheno$Env[fold$testing],
Observed = y_testing,
Predicted = predictions$predicted
)
Predictions <- rbind(Predictions, FoldPredictions)
# Hyperparams
HyperparamsFold <- model$hyperparams_grid %>%
mutate(Fold = i)
Hyperparams <- rbind(Hyperparams, HyperparamsFold)
# Best hyperparams of the model
cat("*** Optimal hyperparameters: ***\n")
print(model$best_hyperparams)
}
head(Predictions)
# Compute the summary of all predictions
summaries <- gs_summaries(Predictions)
# Summaries by Line
head(summaries$line)
# Summaries by Environment
summaries$env
# Summaries by Fold
summaries$fold
# First rows of Hyperparams
head(Hyperparams)
# Last rows of Hyperparams
tail(Hyperparams)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.