Laurae.xgb.train: xgboost Model Trainer
In Laurae2/LauraeDS: Laurae's Data Science Package

Description Usage Arguments Details Value Examples

Trains an xgboost model. Requires Matrix and xgboost packages.

Laurae.xgb.train(train, watchlist = NULL, clean_mem = FALSE, seed = 1,
  verbose = 1, verbose_iterations = 1, objective = "reg:linear",
  metric = "rmse", maximize = NULL, boost_method = "gbtree",
  boost_tree = "hist", boost_grow = "depthwise", boost_bin = 255,
  boost_memory = "uint32", boost_weighting = 1, learn_threads = 1,
  learn_shrink = 0.05, iteration_max = 100, iteration_trees = 1,
  iteration_stop = 20, tree_depth = 6, tree_leaves = 0, sample_row = 1,
  sample_col = 1, reg_l1 = 0, reg_l2 = 0, reg_l2_bias = 0,
  reg_loss = 0, reg_hessian = 1, dart_rate_drop = 0, dart_skip_drop = 0,
  dart_sampling = "uniform", dart_norm = "tree", dart_min_1 = 0, ...)

`train`	Type: xgb.DMatrix. The training data.
`watchlist`	Type: list of xgb.DMatrix. The data to monitor through the metrics, defaults to `list()`.
`clean_mem`	Type: logical. Whether the force garbage collection before and after training in order to reclaim RAM. Defaults to `FALSE`.
`seed`	Type: numeric. Seed for the random number generator for reproducibility, defaults to `1`.
`verbose`	Type: numeric. Whether to print messages. Defaults to `1`.
`verbose_iterations`	Type: numeric. How many iterations to cool down before printing on the console again. Defaults to `1`.
`objective`	Type: character or function. The objective to optimize, defaults to `reg:linear`. `"reg:linear"`Linear Regression. `"reg:logistic"`Logistic Regression. `"binary:logistic"`Logistic Regression (binary classification, probabilities). `"binary:logitraw"`Logistic Regression (binary classification, raw score). `"multi:softmax"`Multiclass Logistic Regression (multiclass classification, best class). `"multi:softprob"`Multiclass Logistic Regression (multiclass classification, probability matrix). `"rank:pairwise"`LambdaMART-like Ranking (pairwise loss). `"count:poisson"`Poisson Regression (count). `"poisson-nloglik"`Negative Log Likelihood (Poisson Regression). `"reg:gamma"`Gamma Regression with Log-link. `"gamma-nloglik"`Negative Log Likelihood (Gamma Regression). `"gamma-deviance"`Residual Deviance (Gamma Regression). `"reg:tweedie"`Tweedie Regression with Log-link. `"tweedie-nloglik"`Negative Log Likelihood (Tweedie Regression).
`metric`	Type: character or function. The metric to print against the `watchlist`, defaults to `rmse`. `"rmse"`Root Mean Squared Error. `"mae"`Mean Absolute Error. `"logloss"`Negative Log Likelihood. `"error"`Binary classification Error Rate. `"error@t"`Binary classification Error Rate at `t`. `"merror"`Multiclass classification Error Rate. `"mlogloss"`Multiclass Negative Log Likelihood. `"auc"`Area Under the Curve. `"ndcg@n"`Normalized Discounted Cumulative Gain at `n`. `"map@n"`Mean Average Precision at `n`.
`maximize`	Type: logical. Whether to maximize the metric, defaults to `NULL`.
`boost_method`	Type: character. Boosting method, defauts to `"gbtree"`. Boosting Method. xgboost has access to three different boosting methods: `"gblinear"`Generalized Linear Model, which is using Shotgun (Parallel Stochastic Gradient Descent). `"gbtree"`Gradient Boosted Trees, which is the default boosting method using Decision Trees and Stochastic Gradient Descent. `"dart"`Dropout Additive Regression Trees, which is a method employing the Dropout method from Neural Networks. The booster method has a huge impact on training performance. The booster method defines the algorithm you will use for boosting or training the model. For instance, a linear boosted model is obviously better for linear problems. Tree-based boosted models are better for non-linear problems, as they have the ability to approximate them. DART (Dropout Additive Regression Trees) is similar to Dropout in neural networks, except you are applying this idea to trees (dropping trees randomly).
`boost_tree`	Type: character. Tree method, defauts to `"hist"`. Tree Method. Tips: leave it alone unless you know what you are doing. This parameter is exclusive to xgboost implementation and three different values: `"exact"`for training the exact original xgboost. `"approx"`for training the approximate/distributed xgboost. `"hist"`for training xgboost in fast histogram mode, similarly to LightGBM. The tree method has a huge impact on training speed. The way trees are built is essential to maximize or lower performance for training. In addition, it has a huge impact on the training speed, as leaving feature accuracy down for lower passes during training loops allow to train models significantly faster.
`boost_grow`	Type: character. Growing method, defauts to `"depthwise"`. Growing Method. Tips: leave it alone unless you know what you are doing. The original xgboost uses depthwise growing policy, which unallows growing deeper trees as long as all the nodes are not at the same level. The depthwise policy (`grow_policy = "depthwise"`) acts as a regularizer which lowers the fitting performance, by providing a potentially higher generalization performance. To act as the same as LightGBM (growing at the best loss instead of at the best depth), set `grow_policy = "lossguide"`. The tree growing method allows to switch between two ways of training: depth-wise method: original xgboost training way, which is highly performing on datasets not relying on distribution rules (far from synthetic). loss-guide method: original LightGBM training way, which is highly performing on datasets relying on distribution rules (close to synthetic). The xgboost way of training allows to minimize depth, where growing an additional depth is considered as a last resort. The LightGBM way of training allows to minimize loss, where growing an additional depth is not considered as a last resort.
`boost_bin`	Type: numeric. Maximum number of unique values per feature, defauts to `255`. Number of maximum unique values per feature. Tips: leave it alone unless you know what you are doing. xgboost does not optimize the dataset storage depending on the max_bin parameter. As such, it requires 4GB RAM to train a model on Higgs 3.5M. By providing less unique values per feature, the model can be trained significantly faster without a large loss in performance. In cases where the dataset is closer to a synthetic dataset, the model might perform even better than without binning.
`boost_memory`	Type: character. Memory used for binning, defauts to `"uint32"`. Memory pressure of bins. Tips: leave it alone unless you know what you are doing. The positive label should be the rare label. By performing a weight multiplication on the positive label, the model is performing a cost-sensitive training. The cost-sensitive training is applied to the booster model which impacts directly the trained models. It implies a potential higher performance, especially when it comes to ranking tasks such as for AUC.
`boost_weighting`	Type: numeric. Weighting of positive labels, defauts to `1`. Multiplication applied to every positive label weight. Tips: leave it alone unless you know what you are doing. The matrix data type is defining the memory pressure in memory, while determining also the maximum number of bins. The default binning is 32 bit, which means 255 bins are possible per column. Lowering it to 16 bit (127 bins) or 8 bit (63 bins) lowers the maximum number of bins, therefore lowering accuracy and improving memory pressure.
`learn_threads`	Type: numeric. Number of threads, defauts to `1`. Number of threads using for training models. Tips: larger data benefit from more threads, but smaller data has reverse benefits. Intel CPUs benefit from hyperthreading and you should use the number of logical cores in your computer instead of the number of physical cores. The old rationale "number of threads = physical cores" was when multithreading was so poor that the overhead was too large. Nowadays, this is not true for most cases (you would not multithread anymore if this were true). Using multithreaded training allows to train models faster. This is not always true in the case of small datasets, where training is so fast that the overhead is too large. In addition, when using many threads (like 40 on 1Mx1K dataset), be careful of the number of leaves parameter combined with unlimited depth, as it will massively slow down the training. To find the best number of threads, you can benchmark manually the training speed by changing the number of threads. Choosing the number of threads depends both on your CPU and the dataset. Do not overallocate logical cores.
`learn_shrink`	Type: numeric. Learning rate, defauts to `0.05`. Multiplication performed on each boosting iteration. Tips: set this larger for hyperparameter tuning. Once your learning rate is fixed, do not change it. It is not a good practice to consider the learning rate as a hyperparameter to tune. Learning rate should be tuned according to your training speed and performance tradeoff. Do not let an optimizer tune it. One must not expect to see an overfitting learning rate of 0.0202048. Each iteration is supposed to provide an improvement to the training loss. Such improvement is multiplied with the learning rate in order to perform smaller updates. Smaller updates allow to overfit slower the data, but requires more iterations for training. For instance, doing 5 iteations at a learning rate of 0.1 approximately would require doing 5000 iterations at a learning rate of 0.001, which might be obnoxious for large datasets. Typically, we use a learning rate of 0.05 or lower for training, while a learning rate of 0.10 or larger is used for tinkering the hyperparameters.
`iteration_max`	Type: numeric. Number of boosting iterations, defauls to `100`. Number of boosting iterations. Tips: combine with early stopping to stop automatically boosting. Larger is not always better. Keep an eye on overfitting. It is better to perform cross-validation one model at a time, in order to get the number of iterations per fold. In addition, this allows to get a precise idea of how noisy the data is. When selecting the number of iterations, it is typical to select 1.10x the mean of the number of iterations found via cross-validation.
`iteration_trees`	Type: numeric. Averaged trees per iteration, defauls to `1`. Number of trees per boosting iteration. Tips: Do not tune it unless you know what you are doing. To achieve Random Forest, one should use sampling parameters to not get identical trees. The combination of Random Forest and Gradient Boosting is well-known "not so good" combination. In fact, Gradient Boosted Trees is supposed to be an extension of Decision Trees and Random Forest, using mathematical optimization. Therefore, it does not make practical sense to use Gradient Boosted Random Forests. To achieve a similar performance to Random Forests, one should use a row sampling of 0.632 (.632 Bootstrap) and a column sampling depending on the task. For regression, it is recommended to use 1/3 features per tree. For classification, it is recommended to use sqrt(number of features)/(number of features) features per tree. For other cases, no recommendations are existing.
`iteration_stop`	Type: numeric. Number of iterations without improvement before stopping, defauls to `20`. Number of maximum iterations without improvements. Tips: make sure you added a validation dataset to watch, otherwise this parameter is useless. Setting early stopping too large risks overfitting by unallowing training to stop due to luck. Scale this parameter appropriately with the learning rate (usually: linearly). Early Stopping allows to not let a model train until the end when the validation metric is not improving for a specified amount of iterations. By keeping this value low enough, boosting will quickly give up training when there is no improvement over time. When it is large enough, boosting will refuse to give up training, even though some improvements over the best iteration might be pure luck. This value should be called accordingly with the number of iterations.
`tree_depth`	Type: numeric. Maximum tree depth, defauls to `6`. Maximum depth of each trained tree. Tips: use unlimited depth when needing deep branched trees. Unlimited depth is essential for training models whose branching is one-sided (instead of balanced branching). such as for long chain of features, like 50 to get to the expected real rule. Each model trained at each iteration will have that maximum depth and cannot bypass it. As the maximum depth increases, the model is able to fit better the training data. However, fitting better the training data does not cause 100 In addition, this is the most sensible hyperparameter for gradient boosting: tune this first. xgboost lossguide training allows 0 depth training (unlimited depth). The maximum leaves allowed, if depth is not unlimited, is equal to 2^depth - 1 (ex: a maximum depth of 10 leads to a maximum of 1023 leaves)
`tree_leaves`	Type: numeric. Maximum tree leaves, defauls to `0`. Maximum leaves for each trained tree. Tips: adjust depth accordingly by allowing a slightly higher depth than the theoretical number of leaves. Restricting the number of leaves acts as a regularization in order to not grow very deep trees. It also prevents from growing gigantic trees when the maximum depth is large (if not unlimited). Each model trained at each iteration will have that maximum leaves and cannot bypass it. As the maximum leaves increases, the model is able to fit better the training data. However, fitting better the training data does not cause 100 In addition, this is the second most sensible hyperparameter for gradient boosting: tune it with the maximum depth.
`sample_row`	Type: numeric. Row sampling, defauls to `1`. Percentage of rows used per iteration frequency. Tips: adjust it roughly but not precisely. Stochastic Gradient Descent is not always better than Gradient Descent. The name "Stochastic Gradient Descent" is technically both right and wrong. Each model trained at each iteration will have only a specific By training over random partitions of the data, abusing the stochastic nature of the process, the resulting model might fit better the data. In addition, this is the third most sensible hyperparameter for gradient boosting: tune it with the column sampling. Overfitting happens when a combination of seed and very peculiar sampling value (like 0.728472) is used, as it does not make sense.
`sample_col`	Type: numeric. Column sampling per tree, defauls to `1`. Percentage of columns used per iteration. Tips: adjust it roughly but not precisely. Stochastic Gradient Descent is not always better than Gradient Descent. The name "Stochastic Gradient Descent" is technically both right and wrong. Each model trained at each iteration will have only a specific By training over random partitions of the data, abusing the stochastic nature of the process, the resulting model might fit better the data. In addition, this is the third most sensible hyperparameter for gradient boosting: tune it with the row sampling. Overfitting happens when a combination of seed and very peculiar sampling value (like 0.728472) is used, as it does not make sense.
`reg_l1`	Type: numeric. L1 regularization, defauls to `0`. L1 Regularization for boosting. Tips: leave it alone unless you know what you are doing. Adding regularization is not always better. The regularization scaling is dataset-dependent and weight-dependent. Gradient Boosting applies regularization to the nominator of the gain computation. In addition, it is added to the numerator multiplicated by the weight of the sample. Each sample has its own pair of gradient/hessian, unlike typical gradient descent methods where that statistic pair is summed up for immediate output and parameter adjustment.
`reg_l2`	Type: numeric. L2 regularization, defauls to `0`. L2 Regularization for boosting. Tips: leave it alone unless you know what you are doing. Adding regularization is not always better. The regularization scaling is dataset-dependent and weight-dependent. Gradient Boosting applies regularization to the nominator of the gain computation. In addition, it is added to the numerator multiplicated by the weight of the sample. Each sample has its own pair of gradient/hessian, unlike typical gradient descent methods where that statistic pair is summed up for immediate output and parameter adjustment.
`reg_l2_bias`	Type: numeric. L2 Bias regularization (not for GBDT models), defauls to `0`. L2 Bias Regularization for boosting. Tips: leave it alone unless you know what you are doing. Adding regularization is not always better. The regularization scaling is dataset-dependent and weight-dependent. Gradient Boosting applies regularization to the nominator of the gain computation. In addition, it is added to the numerator multiplicated by the weight of the sample. Each sample has its own pair of gradient/hessian, unlike typical gradient descent methods where that statistic pair is summed up for immediate output and parameter adjustment.
`reg_loss`	Type: numeric. Minimum Loss per Split, defauls to `0`. Prune by minimum loss requirement. Tips: leave it alone unless you know what you are doing. Adding pruning threshold is not always better. Gamma (loss) regularization happens post training (blocks the trees from being kept) unlike Hessian regularization. Loss regularization is a direct regularization technique allowing the model to prune any leaves which do not meet the minimal gain to split criteria. This is extremely useful when you are trying to build deep trees but trying also to avoid building useless branches of the trees (overfitting).
`reg_hessian`	Type: numeric. Minimum Hessian per Split, defauls to `1`. Prune by minimum hessian requirement. Tips: leave it alone unless you know what you are doing. Adding pruning threshold is not always better. Hessian regularization happens on the fly (blocks the trees for growing) unlike Loss regularization.
`dart_rate_drop`	Type: numeric. DART booster tree drop rate, defauls to `0`. Probability to to drop a tree on one iteration. Tips: leave it alone unless you know what you are doing. Smaller/Larger is not always better. Defines the dropping probability of each tree during each DART iteration to regenerate gradient/hessian statistics.
`dart_skip_drop`	Type: numeric. DART booster tree skip rate, defauls to `0`. Probability to to skipping any drop on one iteration. Tips: leave it alone unless you know what you are doing. Smaller/Larger is not always better. Defines the probability of skipping dropping during each DART iteration to regenerate gradient/hessian statistics.
`dart_sampling`	Type: character. DART booster sampling distribution, defauls to `"uniform"`. Other choice is `"weighted"`. Uniform weight application for trees. Tips: leave it alone unless you know what you are doing. Use `sample_type = "uniform"` to setup uniform sampling for dropped trees. You may use also `sample_type = "weighted"` to drop trees in proportion to their weights, defined by normalize_type. Smaller/Larger is not always better. Defines the probability of skipping dropping during each DART iteration to regenerate gradient/hessian statistics.
`dart_norm`	Type: character. DART booster weight normalization, defauls to `"tree"`. Other choice is `"forest"`. Weight normalization method for trees. Tips: leave it alone unless you know what you are doing. Smaller/Larger is not always better. Normalizing the weight of trees differently allows to put an emphasis on the earliest/latest trees built, leading to different tree structures.
`dart_min_1`	Type: numeric. DART booster drop at least one tree, defauls to `0`. Other choice is `1`. Minimum of one dropped tree at any iteration. Tips: leave it alone unless you know what you are doing. Smaller/Larger is not always better. Dropping at least one tree at each iteration allows to build different trees.
`...`	Other parameters to pass to xgboost's `params`.

The following parameters were removed the following reasons:

debug_verbosewas a parameter added to debug Laurae's code for several xgboost GitHub issues.
colsample_bylevelis significantly weaker than colsample_bytree.
sparse_thresholdis a mysterious "hist" parameter.
max_conflict_rateis a "hist" specific feature bundling parameter.
max_search_groupis a "hist" specific feature bundling parameter.
base_marginis an unusual hyperparameter which should be used for guaranteeing faster convergence.
num_classis a parameter which must be added by yourself for multiclass problems.
enable_feature_groupingis not available in every xgboost version.
sketch_epsbecause "approx" method is obsolete since "hist" exists.
max_delta_stepshould be defined by yourself only when you need it (especially for Poisson regression which has exploding gradients).
tweedie_variance_powershould be defined by yourself when you are optimizing Tweedie distribution objectives.
updaterbecause we don't expect you to modify the sequence of tree updates, as xgboost automatically defines it.
refresh_leafbecause we are not only updating node statistics.
process_typebecause we let xgboost do its job.
???because I might have missed some other important parameters.

You may add them without any issues unlike other parameters.

The xgboost model.

library(Matrix)
library(xgboost)

data(agaricus.train, package = "xgboost")
data(agaricus.test, package = "xgboost")

dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

logregobj <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  preds <- 1/(1 + exp(-preds))
  grad <- preds - labels
  hess <- preds * (1 - preds)
  return(list(grad = grad, hess = hess))
}
evalerror <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
  return(list(metric = "error", value = err))
}

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = "binary:logistic",
                          metric = "auc",
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5)

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = logregobj,
                          metric = "auc",
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5)

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = "binary:logistic",
                          metric = evalerror,
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5,
                          maximize = FALSE)

# CAN'T DO THIS, IGNORE ANY NOT 1st METRIC
model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = logregobj,
                          metric = c("rmse", "auc"),
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5)

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = logregobj,
                          metric = evalerror,
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5,
                          maximize = FALSE)

# CAN'T DO THIS
# model <- Laurae.xgb.train(train = dtrain,
#                           watchlist = watchlist,
#                           verbose = 1,
#                           objective = logregobj,
#                           metric = c(evalerror, "auc"),
#                           tree_depth = 2,
#                           learn_shrink = 1,
#                           learn_threads = 1,
#                           iteration_max = 5,
#                           maximize = FALSE)