deep_learning: Fit a Deep Learning Model
In brandon-mosqueda/SKM: Sparse Kernels Methods

deep_learning

R Documentation

Fit a Deep Learning Model

Description

deep_learning() is a wrapper of the keras::keras_model_sequential() function to fit a deep learning model with the ability to tune the hyperparameters with grid search or bayesian optimization in a simple way. You can fit univariate and multivariate models for numeric and/or categorical response variables.

All the parameters marked as (tunable) accept a vector of values with wich the grid is generated for grid search tuning or a list with the min and max values for bayesian optimization tuning. The returned object contains a data.frame with the hyperparameters combinations evaluated. In the end the best combination of hyperparameters is used to fit the final model, which is also returned and can be used to make new predictions.

Usage

deep_learning(
  x,
  y,
  learning_rate = 0.001,
  epochs_number = 500,
  batch_size = 32,
  layers = list(list(neurons_number = 50, neurons_proportion = NULL, activation = "relu",
    dropout = 0, ridge_penalty = 0, lasso_penalty = 0)),
  output_penalties = list(ridge_penalty = 0, lasso_penalty = 0),
  tune_type = "Grid_search",
  tune_cv_type = "K_fold",
  tune_folds_number = 5,
  tune_testing_proportion = 0.2,
  tune_folds = NULL,
  tune_grid_proportion = 1,
  tune_bayes_samples_number = 10,
  tune_bayes_iterations_number = 10,
  optimizer = "adam",
  loss_function = NULL,
  with_platt_scaling = FALSE,
  platt_proportion = 0.3,
  shuffle = TRUE,
  early_stop = FALSE,
  early_stop_patience = 50,
  validate_params = TRUE,
  seed = NULL,
  verbose = TRUE
)

Arguments

`x`	(`matrix`) The predictor (independet) variable(s). It must be a numeric matrix. You can use `to_matrix()` function to convert your data to a `matrix`.
`y`	(`data.frame` \| `vector` \| `matrix`) The response (dependent) variable(s). If it is a `data.frame` or a `matrix` with 2 or more columns, a multivariate model is assumed, a univariate model otherwise. In univariate models if `y` is `character`, `logical` or `factor` a categorical response is assumed. When the response is categorical with only two classes a binary response is assumed, with more than two classes a categorical response. When the response variable is numeric with only integers values greater or equals than zero a count response is assumed, a continuous response otherwise. In multivariate models response variables can be of different types for mixed models.
`learning_rate`	(`numeric`) (tunable) This hyperparameter controls how much to change the model in response to the estimated error each time the model weights are updated. 0.001 by default.
`epochs_number`	(`numeric`) (tunable) An arbitrary cutoff, generally defined as "one pass over the entire dataset", used to separate training into distinct phases, which is useful for logging and periodic evaluation. 500 by default.
`batch_size`	(`numeric`) (tunable) A hyperparameter of gradient descent that controls the number of training samples to work through before the model's internal parameters are updated. 32 by default.
`layers`	(`list`) The hidden layers. It must be a `list` of `list`'s where each entry will represent a hidden layer in the neural network. Each inner `list` can contain the following fields with a vector of values: `"neurons_number"`: (`numeric`) (tunable) The number of neurons in that layer. 50 by default. `"neurons_proportion"`: (`numeric`) (tunable) Similar to `"neurons_number"` but the provided values will be the proportion specified times the number of columns in `x`, so a value of 1 means "use as many neurons as columns in `x`", 0.5 means use as neurons number the half of number of columns in `x`. This is combined with the values of `"neurons_number"` for tuning. `NULL` by default. `"activation"`: (`character`) (tunable) (case not sensitive) The name of the activation function to apply in this layer. The available activation functions are `"linear"`, `"relu"`, `"elu"`, `"selu"`, `"hard_sigmoid"`, `"sigmoid"`, `"softmax"`, `"softplus"`, `"softsign"`, `"tanh"`, `"exponential"`. This hyperparameter can only be tuned with grid search tuning, with bayesian optimization a fixed value have to be provided. `"relu"` by default. `"dropout"`: (`numeric`) (tunable) The proportion of neurons randomly selected and set to 0 at each step during training process, which helps prevent overfitting. 0 by default. `"lasso_penalty"`: (`numeric`) (tunable) The regularization value between [0, 1] for penalizing that layer with Lasso (a.k.a L1) penalty. 0 by default (no penalty). `"ridge_penalty"`: (`numeric`) (tunable) The regularization value between [0, 1] for penalizing that layer with Ridge (a.k.a L2) penalty. Note that if both penalization params (Ridge and Lasso) are sent, the ElasticNet penalization is implemented, that is a combination of both of them. 0 by default (no penalty). You can provide as many `list`'s as you want, each of them representing a hidden layer and you do not need to provide all the parameters. If one parameter is not provided, the default value described above is used. By default the next `list` is used: `list( list( neurons_number = 50, neurons_proportion = NULL, activation = "relu", dropout = 0, ridge_penalty = 0, lasso_penalty = 0 ) )`
`output_penalties`	(`list`) The penalty values for the output layer. The list can contain the following two fields: `"lasso_penalty"`: (`numeric`) (tunable) The regularization value between [0, 1] for penalizing that layer with Lasso (a.k.a L1) penalty. 0 by default (no penalty). `"ridge_penalty"`: (`numeric`) (tunable) The regularization value between [0, 1] for penalizing that layer with Ridge (a.k.a L2) penalty. Note that if both penalization params (Ridge and Lasso) are sent, the ElasticNet penalization, is implemented that is a combination of both of them. 0 by default (no penalty). You do not have to provide the two values, if one of them is not provided the default value is used. By default the next `list` is used: `list( ridge_penalty = 0, lasso_penalty = 0 )`
`tune_type`	(`character(1)`) (case not sensitive) The type of tuning to perform. The options are `"Grid_search"` and "`Bayesian_optimization`". `"Grid_search"` by default.
`tune_cv_type`	(`character(1)`) (case not sensitive) The type of cross validation to tune the model. The options are `"K_fold"`, `"K_fold_strata"` (only for univariate categorical response variables) and `"Random"`. `"K_fold"` by defaul.
`tune_folds_number`	(`numeric(1)`) The number of folds to tune each hyperparameter combination (k in k-fold cross validation). 5 by default.
`tune_testing_proportion`	(`numeric(1)`) A number between (0, 1) to specify the proportion of records to use as validation set when `tune_cv_type` is `"Random"`. 0.2 by default.
`tune_folds`	(`list`) Custom folds for tuning. It must be a `list` of `list`'s where each entry will represent a fold. Each inner `list` has to contain the fields `"training"` and `"testing"` with numeric vectors of indices of those entries to be used as training and testing in each fold. Note that when this parameter is set, `tune_cv_type`, `tune_folds_number` and `tune_testing_proportion` are ignored. `NULL` by default.
`tune_grid_proportion`	(`numeric(1)`) Only when `tune_type` is `"Grid_search"`, a number between (0, 1] to specify the proportion of hyperparameters combinations to sample from the grid and evaluate in tuning (useful when the grid is big). 1 by default (full grid).
`tune_bayes_samples_number`	(`numeric(1)`) Only when `tune_type` is `"Bayesian_optimization"`, the number of initial random hyperparameters combinations to evalute before the Bayesian optimization process. 10 by default.
`tune_bayes_iterations_number`	(`numeric(1)`) Only when `tune_type` is `"Bayesian_optimization"`, the number of optimization iterations to evaluate after the initial random samples specified in `tune_bayes_samples_number`. 10 by default.
`optimizer`	(`character(1)`) (case not sensitive) Algorithm used to reduce the loss function and update the weights in backpropagation. The available options are `"adadelta"`, `"adagrad"`, `"adamax"`, `"adam"`, `"nadam"`, `"rmsprop"` and `"sgd"`. `"adam"` by default.
`loss_function`	(`character(1)`) (case not sensitive) The name of the loss function the model will seek to minimize during training and tuning. You can find the complete list of available loss functions in the Details section below. This parameter can be used only in univariate analysis. `NULL` by default which selects one automatically based on the type of the response variable `y`: `"mean_squared_error"` for continuous, `"poisson"` for counts, `"binary_crossentropy"` for binary and `"categorical_crossentropy"` for categorical.
`with_platt_scaling`	(`logical(1)`) Should Platt scaling be used to fit the model and adjust the predictions? Only available for univariate models with a numeric or binary response variable. For more information, see Details section below. `FALSE` by default.
`platt_proportion`	(`numeric(1)`) The proportion of individuals used to fit the linear model required for Platt scaling. Note that this parameter is used only when `with_platt_scaling` is `TRUE`. 0.3 by default.
`shuffle`	(`logical(1)`) Should the training data be shuffled before each epoch? `TRUE` by default.
`early_stop`	(`logical(1)`) Should the model stop training when the loss function has stopped improving? `FALSE` by default.
`early_stop_patience`	(`numeric(1)`) The number of epochs with no improvement after which training will be stopped. Note that this parameter is used only when `early_stop` is `TRUE`. 50 by default.
`validate_params`	(`logical(1)`) Should the parameters be validated? It is not recommended to set this parameter to `FALSE` because if something fails a non meaningful error is going to be thrown. `TRUE` by default.
`seed`	(`numeric(1)`) A value to be used as internal seed for reproducible results. `NULL` by default.
`verbose`	(`logical(1)`) Should the progress information be printed? `TRUE` by default.
`tune_loss_function`	(`character(1)`) (case not sensitive) The loss function to use in tuning. The options are `"mse"`, `"maape"`, `"mae"`, `"nrmse"`, `"rmse"` or `"pearson"` when `y` is a numeric response variable, `"accuracy"` or `"kappa_coeff"` when `y` is a categorical response variable (including binary) and `"f1_score"`, `"roc_auc"` or `"pr_auc"` when `y` is a binary response variable. `NULL` by default which uses `"mse"` for numeric variables and `"accuracy"` for categorical variables.

Details

You have to consider that before tuning all columns without variance (where all the records has the same value) are removed. Such columns positions are returned in the removed_x_cols field of the returned object.

All records with missing values (NA), either in x or in y will be removed. The positions of the removed records are returned in the removed_rows field of the returned object.

Tuning

The general tuning algorithm works as follows:

Tuning algorithm

For grid search tuning, the hyperparameters grid is generated (step one in the algorithm) with the cartesian product of all the provided values (all the posible combinations) in all tunable parameters. If only one value of each tunable parameter is provided no tuning is done. tune_grid_proportion allows you to specify the proportion of all combinations you want to sample from the full grid and tune them, by default all combinations are evaluated.

For bayesian optimization tuning, step one in the algorithm works a little different. At start, tune_bayes_samples_number different hyperparameters combinations are generated and evaluated, then tune_bayes_iterations_number new hyperparameters combinations are generated and evaluated iteratively based on the bayesian optimization algorithm, but this process is equivalent to that described in the general tuninig algorithm. Note that only the hyperparameters for which the list of min and max values were provided are tuned and their values fall in the specified boundaries.

Important: Unlike the other models, when tuning deep learning models steps 6 and 7 are omited in the algorithm, instead train and test datasets are sent to keras, the first one to fit the model and the second one to compute the loss function at the end of each epoch, so at the end, the saved value in step 8 is the validation loss value returned by keras in the last epoch. tune_loss_function parameter cannot be used in deep_learning function since the same loss function evaluated at each epoch and specified in loss_function parameter is used for tuning too.

Last (output) layer

By default this function selects the activation function and the number of neurons for the last layer of the model based on the response variable(s) type(s). For continuous responses the "linear" (identity) activation function is used with one neuron, for count responses the "exponential" with one neuron, for binary responses the "sigmoid" with one neuron and for categorical responses "softmax" with as many neurons as number of categories.

Loss functions

The available options of the loss_function parameter are:

Probabilistic losses

"binary_crossentropy"
"categorical_crossentropy"
"sparse_categorical_crossentropy"
"poisson"
"kl_divergence"

Regression losses

"mean_squared_error"
"mean_absolute_error"
"mean_absolute_percentage_error"
"mean_squared_logarithmic_error"
"cosine_similarity"
"huber"
"log_cosh"

Hinge losses for "maximum-margin" classification

"hinge"
"squared_hinge"
"categorical_hinge"

Platt scaling

It is a way of improving the training process of deep learning models that uses a calibration based on a model that is already trained and applied via a post-processing operation.

After tuninig, Platt scaling calibration divides the dataset into Training and Calibration datasets, then it uses Training to fit the deep learning model with the best hyperparameters combination and with this model computes the predictions of the Calibration dataset. Finally with the predicted and true values, a linear model is fitted (observed in function of predicted), this linear model corresponds to the calibration and when a new prediction is going to be made, first the deep learning model is used and the resulting predicted value is calibrated with the linear model.

Note that Platt scaling calibration only works for numeric and binary response variables of univariate models.

Value

An object of class "DeepLearningModel" that inherits from classes "Model" and "R6" with the fields:

fitted_model: An object of class keras::keras_model_sequential() with the model.
x: The final matrix used to fit the model.
y: The final vector or matrix used to fit the model.
hyperparams_grid: A data.frame with all the computed combinations of hyperparameters and with one more column called "loss" with the value of the loss function for each combination. The data is ordered with the best combinations at start, sometimes with the lowest values first and other times with the greatest values first, depending the loss function.
best_hyperparams: A list with the combination of hyperparameters with the best loss value (the first row in hyperparams_grid).
execution_time: A difftime object with the total time taken to tune and fit the model.
removed_rows: A numeric vector with the records' indices (in the provided position) that were deleted and not taken in account in tunning nor training.
removed_x_cols: A numeric vector with the columns' indices (in the provided positions) that were deleted and not taken in account in tunning nor training.
...: Some other parameters for internal use.

Examples

# Use all default hyperparameters (no tuning) -------------------------------
x <- to_matrix(iris[, -5])
y <- iris$Species
model <- deep_learning(x, y)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted
# Obtain the predicted probabilities
predictions$probabilities

# Tune with grid search -----------------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- deep_learning(
  x,
  y,
  epochs_number = c(10, 20),
  learning_rate = c(0.001, 0.01),
  layers = list(
    # First hidden layer
    list(neurons_number = c(10, 20)),
    # Second hidden layer
    list(neurons_number = c(10))
  ),
  tune_type = "grid_search",
  tune_cv_type = "k_fold",
  tune_folds_number = 5
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Tune with Bayesian optimization -------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- deep_learning(
  x,
  y,
  epochs_number = list(min = 10, max = 50),
  learning_rate = list(min = 0.001, max = 0.5),
  layers = list(
    list(
      neurons_number = list(min = 10, max = 20),
      dropout = list(min = 0, max = 1),
      activation_layer = "sigmoid"
    )
  ),
  tune_type = "bayesian_optimization",
  tune_bayes_samples_number = 5,
  tune_bayes_iterations_number = 5,
  tune_cv_type = "random",
  tune_folds_number = 2
)

# Obtain the whole grid with the loss values
model$hyperparams_grid
# Obtain the hyperparameters combination with the best loss value
model$best_hyperparams

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Obtain the execution time taken to tune and fit the model
model$execution_time

# Multivariate analysis -----------------------------------------------------
x <- to_matrix(iris[, -c(1, 5)])
y <- iris[, c(1, 5)]
model <- deep_learning(
  x,
  y,
  epochs_number = 10,
  layers = list(
    list(
      neurons_number = 50,
      dropout = 0.5,
      activation = "relu",
      ridge_penalty = 0.5,
      lasso_penalty = 0.5
    )
  ),
  optimizer = "adadelta"
)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values of the first response
predictions$Sepal.Length$predicted
# Obtain the predicted values and probabilities of the second response
predictions$Species$predicted
predictions$Species$probabilities

# Obtain the predictions in a data.frame not in a list
predictions <- predict(model, x, format = "data.frame")
head(predictions)

# With Platt scaling --------------------------------------------------------
x <- to_matrix(iris[, -1])
y <- iris$Sepal.Length
model <- deep_learning(
  x,
  y,
  with_platt_scaling = TRUE,
  platt_proportion = 0.25
)

# Predict using the fitted model
predictions <- predict(model, x)
# Obtain the predicted values
predictions$predicted

# Genomic selection ------------------------------------------------------------
data(Maize)

# Data preparation of G
Line <- model.matrix(~ 0 + Line, data = Maize$Pheno)
# Compute cholesky
Geno <- cholesky(Maize$Geno)
# G matrix
X <- Line %*% Geno
y <- Maize$Pheno$Y

# Set seed for reproducible results
set.seed(2022)
folds <- cv_kfold(records_number = nrow(X), k = 4)

Predictions <- data.frame()
Hyperparams <- data.frame()

# Model training and predictions
for (i in seq_along(folds)) {
  cat("*** Fold:", i, "***\n")
  fold <- folds[[i]]

  # Identify the training and testing sets
  X_training <- X[fold$training, ]
  X_testing <- X[fold$testing, ]
  y_training <- y[fold$training]
  y_testing <- y[fold$testing]

  # Model training
  model <- deep_learning(
    X_training,
    y_training,
    epochs_number = list(min = 50, max = 100),
    learning_rate = list(min = 0.0001, max = 0.1),
    layers = list(
      list(
        neurons_number = list(min = 2, max = 5),
        activation = c("linear")
      ),
      list(
        neurons_number = list(min = 2, max = 10),
        activation = c("linear")
      )
    ),
    tune_type = "Bayesian_Optimization",
    tune_bayes_iterations_number = 5,
    tune_bayes_samples_number = 5,

    tune_cv_type = "k_fold",
    tune_folds_number = 3
  )

  # Prediction of testing set
  predictions <- predict(model, X_testing)

  # Predictions for the i-th fold
  FoldPredictions <- data.frame(
    Fold = i,
    Line = Maize$Pheno$Line[fold$testing],
    Env = Maize$Pheno$Env[fold$testing],
    Observed = y_testing,
    Predicted = predictions$predicted
  )
  Predictions <- rbind(Predictions, FoldPredictions)

  # Hyperparams
  HyperparamsFold <- model$hyperparams_grid %>%
    mutate(Fold = i)
  Hyperparams <- rbind(Hyperparams, HyperparamsFold)

  # Best hyperparams of the model
  cat("*** Optimal hyperparameters: ***\n")
  print(model$best_hyperparams)
}

head(Predictions)
# Compute the summary of all predictions
summaries <- gs_summaries(Predictions)

# Summaries by Line
head(summaries$line)

# Summaries by Environment
summaries$env

# Summaries by Fold
summaries$fold

# First rows of Hyperparams
head(Hyperparams)
# Last rows of Hyperparams
tail(Hyperparams)

brandon-mosqueda/SKM documentation built on Feb. 8, 2025, 5:24 p.m.