Laurae.xgb.train: xgboost Model Trainer

Description Usage Arguments Details Value Examples

View source: R/Laurae.xgb.train.R

Description

Trains an xgboost model. Requires Matrix and xgboost packages.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Laurae.xgb.train(train, watchlist = NULL, clean_mem = FALSE, seed = 1,
  verbose = 1, verbose_iterations = 1, objective = "reg:linear",
  metric = "rmse", maximize = NULL, boost_method = "gbtree",
  boost_tree = "hist", boost_grow = "depthwise", boost_bin = 255,
  boost_memory = "uint32", boost_weighting = 1, learn_threads = 1,
  learn_shrink = 0.05, iteration_max = 100, iteration_trees = 1,
  iteration_stop = 20, tree_depth = 6, tree_leaves = 0, sample_row = 1,
  sample_col = 1, reg_l1 = 0, reg_l2 = 0, reg_l2_bias = 0,
  reg_loss = 0, reg_hessian = 1, dart_rate_drop = 0, dart_skip_drop = 0,
  dart_sampling = "uniform", dart_norm = "tree", dart_min_1 = 0, ...)

Arguments

train

Type: xgb.DMatrix. The training data.

watchlist

Type: list of xgb.DMatrix. The data to monitor through the metrics, defaults to list().

clean_mem

Type: logical. Whether the force garbage collection before and after training in order to reclaim RAM. Defaults to FALSE.

seed

Type: numeric. Seed for the random number generator for reproducibility, defaults to 1.

verbose

Type: numeric. Whether to print messages. Defaults to 1.

verbose_iterations

Type: numeric. How many iterations to cool down before printing on the console again. Defaults to 1.

objective

Type: character or function. The objective to optimize, defaults to reg:linear.

  • "reg:linear"Linear Regression.

  • "reg:logistic"Logistic Regression.

  • "binary:logistic"Logistic Regression (binary classification, probabilities).

  • "binary:logitraw"Logistic Regression (binary classification, raw score).

  • "multi:softmax"Multiclass Logistic Regression (multiclass classification, best class).

  • "multi:softprob"Multiclass Logistic Regression (multiclass classification, probability matrix).

  • "rank:pairwise"LambdaMART-like Ranking (pairwise loss).

  • "count:poisson"Poisson Regression (count).

  • "poisson-nloglik"Negative Log Likelihood (Poisson Regression).

  • "reg:gamma"Gamma Regression with Log-link.

  • "gamma-nloglik"Negative Log Likelihood (Gamma Regression).

  • "gamma-deviance"Residual Deviance (Gamma Regression).

  • "reg:tweedie"Tweedie Regression with Log-link.

  • "tweedie-nloglik"Negative Log Likelihood (Tweedie Regression).

metric

Type: character or function. The metric to print against the watchlist, defaults to rmse.

  • "rmse"Root Mean Squared Error.

  • "mae"Mean Absolute Error.

  • "logloss"Negative Log Likelihood.

  • "error"Binary classification Error Rate.

  • "error@t"Binary classification Error Rate at t.

  • "merror"Multiclass classification Error Rate.

  • "mlogloss"Multiclass Negative Log Likelihood.

  • "auc"Area Under the Curve.

  • "ndcg@n"Normalized Discounted Cumulative Gain at n.

  • "map@n"Mean Average Precision at n.

maximize

Type: logical. Whether to maximize the metric, defaults to NULL.

boost_method

Type: character. Boosting method, defauts to "gbtree".

  • Boosting Method.

  • xgboost has access to three different boosting methods:

    • "gblinear"Generalized Linear Model, which is using Shotgun (Parallel Stochastic Gradient Descent).

    • "gbtree"Gradient Boosted Trees, which is the default boosting method using Decision Trees and Stochastic Gradient Descent.

    • "dart"Dropout Additive Regression Trees, which is a method employing the Dropout method from Neural Networks.

  • The booster method has a huge impact on training performance.

  • The booster method defines the algorithm you will use for boosting or training the model.

  • For instance, a linear boosted model is obviously better for linear problems.

  • Tree-based boosted models are better for non-linear problems, as they have the ability to approximate them.

  • DART (Dropout Additive Regression Trees) is similar to Dropout in neural networks, except you are applying this idea to trees (dropping trees randomly).

boost_tree

Type: character. Tree method, defauts to "hist".

  • Tree Method.

  • Tips: leave it alone unless you know what you are doing.

  • This parameter is exclusive to xgboost implementation and three different values:

    • "exact"for training the exact original xgboost.

    • "approx"for training the approximate/distributed xgboost.

    • "hist"for training xgboost in fast histogram mode, similarly to LightGBM.

  • The tree method has a huge impact on training speed.

  • The way trees are built is essential to maximize or lower performance for training.

  • In addition, it has a huge impact on the training speed, as leaving feature accuracy down for lower passes during training loops allow to train models significantly faster.

boost_grow

Type: character. Growing method, defauts to "depthwise".

  • Growing Method.

  • Tips: leave it alone unless you know what you are doing.

  • The original xgboost uses depthwise growing policy, which unallows growing deeper trees as long as all the nodes are not at the same level.

  • The depthwise policy (grow_policy = "depthwise") acts as a regularizer which lowers the fitting performance, by providing a potentially higher generalization performance.

  • To act as the same as LightGBM (growing at the best loss instead of at the best depth), set grow_policy = "lossguide".

  • The tree growing method allows to switch between two ways of training:

    • depth-wise method: original xgboost training way, which is highly performing on datasets not relying on distribution rules (far from synthetic).

    • loss-guide method: original LightGBM training way, which is highly performing on datasets relying on distribution rules (close to synthetic).

  • The xgboost way of training allows to minimize depth, where growing an additional depth is considered as a last resort.

  • The LightGBM way of training allows to minimize loss, where growing an additional depth is not considered as a last resort.

boost_bin

Type: numeric. Maximum number of unique values per feature, defauts to 255.

  • Number of maximum unique values per feature.

  • Tips: leave it alone unless you know what you are doing.

  • xgboost does not optimize the dataset storage depending on the max_bin parameter.

  • As such, it requires 4GB RAM to train a model on Higgs 3.5M.

  • By providing less unique values per feature, the model can be trained significantly faster without a large loss in performance.

  • In cases where the dataset is closer to a synthetic dataset, the model might perform even better than without binning.

boost_memory

Type: character. Memory used for binning, defauts to "uint32".

  • Memory pressure of bins.

  • Tips: leave it alone unless you know what you are doing.

  • The positive label should be the rare label.

  • By performing a weight multiplication on the positive label, the model is performing a cost-sensitive training.

  • The cost-sensitive training is applied to the booster model which impacts directly the trained models.

  • It implies a potential higher performance, especially when it comes to ranking tasks such as for AUC.

boost_weighting

Type: numeric. Weighting of positive labels, defauts to 1.

  • Multiplication applied to every positive label weight.

  • Tips: leave it alone unless you know what you are doing.

  • The matrix data type is defining the memory pressure in memory, while determining also the maximum number of bins.

  • The default binning is 32 bit, which means 255 bins are possible per column.

  • Lowering it to 16 bit (127 bins) or 8 bit (63 bins) lowers the maximum number of bins, therefore lowering accuracy and improving memory pressure.

learn_threads

Type: numeric. Number of threads, defauts to 1.

  • Number of threads using for training models.

  • Tips: larger data benefit from more threads, but smaller data has reverse benefits.

  • Intel CPUs benefit from hyperthreading and you should use the number of logical cores in your computer instead of the number of physical cores.

  • The old rationale "number of threads = physical cores" was when multithreading was so poor that the overhead was too large. Nowadays, this is not true for most cases (you would not multithread anymore if this were true).

  • Using multithreaded training allows to train models faster.

  • This is not always true in the case of small datasets, where training is so fast that the overhead is too large.

  • In addition, when using many threads (like 40 on 1Mx1K dataset), be careful of the number of leaves parameter combined with unlimited depth, as it will massively slow down the training.

  • To find the best number of threads, you can benchmark manually the training speed by changing the number of threads.

  • Choosing the number of threads depends both on your CPU and the dataset. Do not overallocate logical cores.

learn_shrink

Type: numeric. Learning rate, defauts to 0.05.

  • Multiplication performed on each boosting iteration.

  • Tips: set this larger for hyperparameter tuning.

  • Once your learning rate is fixed, do not change it.

  • It is not a good practice to consider the learning rate as a hyperparameter to tune.

  • Learning rate should be tuned according to your training speed and performance tradeoff.

  • Do not let an optimizer tune it. One must not expect to see an overfitting learning rate of 0.0202048.

  • Each iteration is supposed to provide an improvement to the training loss.

  • Such improvement is multiplied with the learning rate in order to perform smaller updates.

  • Smaller updates allow to overfit slower the data, but requires more iterations for training.

  • For instance, doing 5 iteations at a learning rate of 0.1 approximately would require doing 5000 iterations at a learning rate of 0.001, which might be obnoxious for large datasets.

  • Typically, we use a learning rate of 0.05 or lower for training, while a learning rate of 0.10 or larger is used for tinkering the hyperparameters.

iteration_max

Type: numeric. Number of boosting iterations, defauls to 100.

  • Number of boosting iterations.

  • Tips: combine with early stopping to stop automatically boosting.

  • Larger is not always better.

  • Keep an eye on overfitting.

  • It is better to perform cross-validation one model at a time, in order to get the number of iterations per fold. In addition, this allows to get a precise idea of how noisy the data is.

  • When selecting the number of iterations, it is typical to select 1.10x the mean of the number of iterations found via cross-validation.

iteration_trees

Type: numeric. Averaged trees per iteration, defauls to 1.

  • Number of trees per boosting iteration.

  • Tips: Do not tune it unless you know what you are doing.

  • To achieve Random Forest, one should use sampling parameters to not get identical trees.

  • The combination of Random Forest and Gradient Boosting is well-known "not so good" combination.

  • In fact, Gradient Boosted Trees is supposed to be an extension of Decision Trees and Random Forest, using mathematical optimization.

  • Therefore, it does not make practical sense to use Gradient Boosted Random Forests.

  • To achieve a similar performance to Random Forests, one should use a row sampling of 0.632 (.632 Bootstrap) and a column sampling depending on the task.

  • For regression, it is recommended to use 1/3 features per tree.

  • For classification, it is recommended to use sqrt(number of features)/(number of features) features per tree.

  • For other cases, no recommendations are existing.

iteration_stop

Type: numeric. Number of iterations without improvement before stopping, defauls to 20.

  • Number of maximum iterations without improvements.

  • Tips: make sure you added a validation dataset to watch, otherwise this parameter is useless.

  • Setting early stopping too large risks overfitting by unallowing training to stop due to luck.

  • Scale this parameter appropriately with the learning rate (usually: linearly).

  • Early Stopping allows to not let a model train until the end when the validation metric is not improving for a specified amount of iterations.

  • By keeping this value low enough, boosting will quickly give up training when there is no improvement over time.

  • When it is large enough, boosting will refuse to give up training, even though some improvements over the best iteration might be pure luck.

  • This value should be called accordingly with the number of iterations.

tree_depth

Type: numeric. Maximum tree depth, defauls to 6.

  • Maximum depth of each trained tree.

  • Tips: use unlimited depth when needing deep branched trees.

  • Unlimited depth is essential for training models whose branching is one-sided (instead of balanced branching). such as for long chain of features, like 50 to get to the expected real rule.

  • Each model trained at each iteration will have that maximum depth and cannot bypass it.

  • As the maximum depth increases, the model is able to fit better the training data.

  • However, fitting better the training data does not cause 100

  • In addition, this is the most sensible hyperparameter for gradient boosting: tune this first.

  • xgboost lossguide training allows 0 depth training (unlimited depth).

  • The maximum leaves allowed, if depth is not unlimited, is equal to 2^depth - 1 (ex: a maximum depth of 10 leads to a maximum of 1023 leaves)

tree_leaves

Type: numeric. Maximum tree leaves, defauls to 0.

  • Maximum leaves for each trained tree.

  • Tips: adjust depth accordingly by allowing a slightly higher depth than the theoretical number of leaves.

  • Restricting the number of leaves acts as a regularization in order to not grow very deep trees.

  • It also prevents from growing gigantic trees when the maximum depth is large (if not unlimited).

  • Each model trained at each iteration will have that maximum leaves and cannot bypass it.

  • As the maximum leaves increases, the model is able to fit better the training data.

  • However, fitting better the training data does not cause 100

  • In addition, this is the second most sensible hyperparameter for gradient boosting: tune it with the maximum depth.

sample_row

Type: numeric. Row sampling, defauls to 1.

  • Percentage of rows used per iteration frequency.

  • Tips: adjust it roughly but not precisely.

  • Stochastic Gradient Descent is not always better than Gradient Descent.

  • The name "Stochastic Gradient Descent" is technically both right and wrong.

  • Each model trained at each iteration will have only a specific

  • By training over random partitions of the data, abusing the stochastic nature of the process, the resulting model might fit better the data.

  • In addition, this is the third most sensible hyperparameter for gradient boosting: tune it with the column sampling.

  • Overfitting happens when a combination of seed and very peculiar sampling value (like 0.728472) is used, as it does not make sense.

sample_col

Type: numeric. Column sampling per tree, defauls to 1.

  • Percentage of columns used per iteration.

  • Tips: adjust it roughly but not precisely.

  • Stochastic Gradient Descent is not always better than Gradient Descent.

  • The name "Stochastic Gradient Descent" is technically both right and wrong.

  • Each model trained at each iteration will have only a specific

  • By training over random partitions of the data, abusing the stochastic nature of the process, the resulting model might fit better the data.

  • In addition, this is the third most sensible hyperparameter for gradient boosting: tune it with the row sampling.

  • Overfitting happens when a combination of seed and very peculiar sampling value (like 0.728472) is used, as it does not make sense.

reg_l1

Type: numeric. L1 regularization, defauls to 0.

  • L1 Regularization for boosting.

  • Tips: leave it alone unless you know what you are doing.

  • Adding regularization is not always better.

  • The regularization scaling is dataset-dependent and weight-dependent.

  • Gradient Boosting applies regularization to the nominator of the gain computation.

  • In addition, it is added to the numerator multiplicated by the weight of the sample.

  • Each sample has its own pair of gradient/hessian, unlike typical gradient descent methods where that statistic pair is summed up for immediate output and parameter adjustment.

reg_l2

Type: numeric. L2 regularization, defauls to 0.

  • L2 Regularization for boosting.

  • Tips: leave it alone unless you know what you are doing.

  • Adding regularization is not always better.

  • The regularization scaling is dataset-dependent and weight-dependent.

  • Gradient Boosting applies regularization to the nominator of the gain computation.

  • In addition, it is added to the numerator multiplicated by the weight of the sample.

  • Each sample has its own pair of gradient/hessian, unlike typical gradient descent methods where that statistic pair is summed up for immediate output and parameter adjustment.

reg_l2_bias

Type: numeric. L2 Bias regularization (not for GBDT models), defauls to 0.

  • L2 Bias Regularization for boosting.

  • Tips: leave it alone unless you know what you are doing.

  • Adding regularization is not always better.

  • The regularization scaling is dataset-dependent and weight-dependent.

  • Gradient Boosting applies regularization to the nominator of the gain computation.

  • In addition, it is added to the numerator multiplicated by the weight of the sample.

  • Each sample has its own pair of gradient/hessian, unlike typical gradient descent methods where that statistic pair is summed up for immediate output and parameter adjustment.

reg_loss

Type: numeric. Minimum Loss per Split, defauls to 0.

  • Prune by minimum loss requirement.

  • Tips: leave it alone unless you know what you are doing.

  • Adding pruning threshold is not always better.

  • Gamma (loss) regularization happens post training (blocks the trees from being kept) unlike Hessian regularization.

  • Loss regularization is a direct regularization technique allowing the model to prune any leaves which do not meet the minimal gain to split criteria.

  • This is extremely useful when you are trying to build deep trees but trying also to avoid building useless branches of the trees (overfitting).

reg_hessian

Type: numeric. Minimum Hessian per Split, defauls to 1.

  • Prune by minimum hessian requirement.

  • Tips: leave it alone unless you know what you are doing.

  • Adding pruning threshold is not always better.

  • Hessian regularization happens on the fly (blocks the trees for growing) unlike Loss regularization.

dart_rate_drop

Type: numeric. DART booster tree drop rate, defauls to 0.

  • Probability to to drop a tree on one iteration.

  • Tips: leave it alone unless you know what you are doing.

  • Smaller/Larger is not always better.

  • Defines the dropping probability of each tree during each DART iteration to regenerate gradient/hessian statistics.

dart_skip_drop

Type: numeric. DART booster tree skip rate, defauls to 0.

  • Probability to to skipping any drop on one iteration.

  • Tips: leave it alone unless you know what you are doing.

  • Smaller/Larger is not always better.

  • Defines the probability of skipping dropping during each DART iteration to regenerate gradient/hessian statistics.

dart_sampling

Type: character. DART booster sampling distribution, defauls to "uniform". Other choice is "weighted".

  • Uniform weight application for trees.

  • Tips: leave it alone unless you know what you are doing.

  • Use sample_type = "uniform" to setup uniform sampling for dropped trees.

  • You may use also sample_type = "weighted" to drop trees in proportion to their weights, defined by normalize_type.

  • Smaller/Larger is not always better.

  • Defines the probability of skipping dropping during each DART iteration to regenerate gradient/hessian statistics.

dart_norm

Type: character. DART booster weight normalization, defauls to "tree". Other choice is "forest".

  • Weight normalization method for trees.

  • Tips: leave it alone unless you know what you are doing.

  • Smaller/Larger is not always better.

  • Normalizing the weight of trees differently allows to put an emphasis on the earliest/latest trees built, leading to different tree structures.

dart_min_1

Type: numeric. DART booster drop at least one tree, defauls to 0. Other choice is 1.

  • Minimum of one dropped tree at any iteration.

  • Tips: leave it alone unless you know what you are doing.

  • Smaller/Larger is not always better.

  • Dropping at least one tree at each iteration allows to build different trees.

...

Other parameters to pass to xgboost's params.

Details

The following parameters were removed the following reasons:

You may add them without any issues unlike other parameters.

Value

The xgboost model.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
library(Matrix)
library(xgboost)

data(agaricus.train, package = "xgboost")
data(agaricus.test, package = "xgboost")

dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)

logregobj <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  preds <- 1/(1 + exp(-preds))
  grad <- preds - labels
  hess <- preds * (1 - preds)
  return(list(grad = grad, hess = hess))
}
evalerror <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
  return(list(metric = "error", value = err))
}

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = "binary:logistic",
                          metric = "auc",
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5)

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = logregobj,
                          metric = "auc",
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5)

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = "binary:logistic",
                          metric = evalerror,
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5,
                          maximize = FALSE)

# CAN'T DO THIS, IGNORE ANY NOT 1st METRIC
model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = logregobj,
                          metric = c("rmse", "auc"),
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5)

model <- Laurae.xgb.train(train = dtrain,
                          watchlist = watchlist,
                          verbose = 1,
                          objective = logregobj,
                          metric = evalerror,
                          tree_depth = 2,
                          learn_shrink = 1,
                          learn_threads = 1,
                          iteration_max = 5,
                          maximize = FALSE)

# CAN'T DO THIS
# model <- Laurae.xgb.train(train = dtrain,
#                           watchlist = watchlist,
#                           verbose = 1,
#                           objective = logregobj,
#                           metric = c(evalerror, "auc"),
#                           tree_depth = 2,
#                           learn_shrink = 1,
#                           learn_threads = 1,
#                           iteration_max = 5,
#                           maximize = FALSE)

Laurae2/LauraeDS documentation built on May 29, 2019, 2:25 p.m.