Lextravagenza: Laurae's Extravagenza machine learning model
In Laurae2/Laurae: Advanced High Performance Data Science Toolbox for R

Description Usage Arguments Details Value Examples

This function is a machine learning model using dynamic depth with xgboost but ignores the gradient boosting enhancements of xgboost. It outperforms xgboost in nearly every scenario where the number of boosting iterations is small. When the number of boosting iterations is large (like: 100), this model has worse performance than typical gradient boosted tree implementations. This does not work on multiclass problems.

1
2
3

Lextravagenza(train, valid, test, maximize = FALSE, personal_rounds = 100,
  personal_depth = 1:10, personal_eta = 0.2, auto_stop = 10,
  base_margin = 0.5, seed = 0, ...)

`train`	Type: xgb.DMatrix. The training data. It will be used for training the models.
`valid`	Type: xgb.DMatrix. The validation data. It will be used for selecting the model depth per iteration to assess generalization.
`test`	Type: xgb.DMatrix. The testing data. It will be used for early stopping.
`maximize`	Type: boolean. Whether to maximize or minimize the loss function. Defaults to `FALSE`.
`personal_rounds`	Type: integer. The number of separate boosting iterations. Defaults to `100`.
`personal_depth`	Type: vector of integers. The possible depth values during boosting of trees. Defaults to `1:10`, which means a depth between `1` and `10`, therefore `c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)`.
`personal_eta`	Type: numeric. The shrinkage (learning rate). Lower values mean lower overfitting. Defaults to `0.20`.
`auto_stop`	Type: integer. The early stopping value. When the metric does not improve for `auto_stop` iterations, the training is interrupted and the model returned to the user. Defaults to `10`.
`base_margin`	Type: numeric. The base prediction value. For binary classification, it is recommended to be the number of label `1` divided by the number of observations, although it is not mandatory. Defaults to `0.5`.
`seed`	Type: integer. Random seed used for training. Defaults to `0`.
`...`	Other arguments to pass to `xgb.train`. Examples: `nthread = 1`, `eta = 0.4`...

Dynamic depth allows to train dynamic boosted trees that fits better the data early, thus overfitting quickly the data. As it uses a validation set as feedback during training, it is necessary to have a second validation set (test set), an uncommon scenario in machine learning.

The Extravagenza model does not leverage the properties of gradient and hessian to optimize the learning appropriately, hence overfitting faster without using any knowledge of previous trainings (but the last tree only).

Do not use this method when you attempt to predict large trees, as not being able to use the previous gradients/hessians leads to a poor generalization (but still better than most non-ensemble models). Usually, a xgboost model needing only 75 iterations will require 200 iterations for the Extravagenza machine learning model to (potentially) outperform the initial xgboost model.

For example, on House Prices data set using RMSE, you can try to beat xgboost:

In addition, you will need to use the latest xgboost repo (from pull request 1964 at least) if you want to train without spamming the console (verbose = 0 used to not record metric!).

A list with the model (model), the parameters (eta, base_margin), the best training iteration for generalization (best_iter), the depth evolution over the number of iterations (depth_tree), the validation score (valid_loss), and the test score (test_loss).

## Not run: 
library(Laurae)
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data[1:5000, ], label = agaricus.train$label[1:5000])
dval <- xgb.DMatrix(agaricus.train$data[5001:6513, ], label = agaricus.train$label[5001:6513])
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
Lex_model <- Lextravagenza(train = dtrain, # Train data
                           valid = dval, # Validation data = depth tuner
                           test = dtest, # Test data = early stopper
                           maximize = FALSE, # Not maximizing RMSE
                           personal_rounds = 50, # Boosting for 50 iterations
                           personal_depth = 1:8, # Dynamic depth between 1 and 8
                           personal_eta = 0.40, # Shrinkage of boosting to 0.40
                           auto_stop = 5, # Early stopping of 5 iterations
                           base_margin = 0.5, # Start with 0.5 probabilities
                           seed = 0, # Random seed
                           nthread = 1, # 1 thread for training
                           eta = 0.40, # xgboost shrinkage of 0.40 (avoid fast overfit)
                           booster = "gbtree", # train trees, can't work with GLM
                           objective = "binary:logistic", # classification, binary
                           eval_metric = "rmse" # RMSE metric to optimize
)

str(Lex_model, max.level = 1) # Get list of the model structure

predictedValues <- pred.Lextravagenza(Lex_model, dtest, nrounds = Lex_model$best_iter)
all.equal(sqrt(mean((predictedValues - agaricus.test$label)^2)),
          Lex_model$test[Lex_model$best_iter])

# Get depth evolution vs number of boosting iterations
plot(x = 1:length(Lex_model$depth),
     y = Lex_model$depth,
     main = "Depth vs iterations",
     xlab = "Iterations",
     ylab = "Depth")

# Get validation evolution vs number of boosting iterations
plot(x = 1:length(Lex_model$valid),
     y = Lex_model$valid,
     main = "Validation loss vs iterations",
     xlab = "Iterations",
     ylab = "Validation loss")

# Get testing evolution vs number of boosting iterations
plot(x = 1:length(Lex_model$test),
     y = Lex_model$test,
     main = "Testing loss vs iterations",
     xlab = "Iterations",
     ylab = "Testing loss")

## End(Not run)