XGBTrainer: Extreme Gradient Boosting Trainer

Description Details Public fields Methods Examples

Description

Trains a XGBoost model in R

Details

Trains a Extreme Gradient Boosting Model. XGBoost belongs to a family of boosting algorithms that creates an ensemble of weak learner to learn about data. It is a wrapper for original xgboost R package, you can find the documentation here: http://xgboost.readthedocs.io/en/latest/parameter.html

Public fields

booster

the trainer type, the values are gbtree(default), gblinear, dart:gbtree

objective

specify the learning task. Check the link above for all possible values.

nthread

number of parallel threads used to run, default is to run using all threads available

silent

0 means printing running messages, 1 means silent mode

n_estimators

number of trees to grow, default = 100

learning_rate

Step size shrinkage used in update to prevents overfitting. Lower the learning rate, more time it takes in training, value lies between between 0 and 1. Default = 0.3

gamma

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. Value lies between 0 and infinity, Default = 0

max_depth

the maximum depth of each tree, default = 6

min_child_weight

Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be. Value lies between 0 and infinity. Default = 1

subsample

Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1

colsample_bytree

Subsample ratio of columns when constructing each tree. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1

lambda

L2 regularization term on weights. Increasing this value will make model more conservative. Default = 1

alpha

L1 regularization term on weights. Increasing this value will make model more conservative. Default = 0

eval_metric

Evaluation metrics for validation data, a default metric will be assigned according to objective

print_every

print training log after n iterations. Default = 50

feval

custom evaluation function

early_stopping

Used to prevent overfitting, stops model training after this number of iterations if there is no improvement seen

maximize

If feval and early_stopping_rounds are set, then this parameter must be set as well. When it is TRUE, it means the larger the evaluation score the better.

custom_objective

custom objective function

save_period

when it is non-NULL, model is saved to disk after every save_period rounds, 0 means save at the end.

save_name

the name or path for periodically saved model file.

xgb_model

a previously built model to continue the training from. Could be either an object of class xgb.Booster, or its raw data, or the name of a file with a previously saved model.

callbacks

a list of callback functions to perform various task during boosting. See callbacks. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.

verbose

If 0, xgboost will stay silent. If 1, xgboost will print information of performance. If 2, xgboost will print some additional information. Setting verbose > 0 automatically engages the cb.evaluation.log and cb.print.evaluation callback functions.

watchlist

what information should be printed when verbose=1 or verbose=2. Watchlist is used to specify validation set monitoring during training. For example user can specify watchlist=list(validation1=mat1, validation2=mat2) to watch the performance of each round's model on mat1 and mat2

num_class

set number of classes in case of multiclassification problem

weight

a vector indicating the weight for each row of the input.

na_missing

by default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. This parameter is only used when input is a dense matrix.

feature_names

internal use, stores the feature names for model importance

cv_model

internal use

Methods

Public methods


Method new()

Usage
XGBTrainer$new(
  booster,
  objective,
  nthread,
  silent,
  n_estimators,
  learning_rate,
  gamma,
  max_depth,
  min_child_weight,
  subsample,
  colsample_bytree,
  lambda,
  alpha,
  eval_metric,
  print_every,
  feval,
  early_stopping,
  maximize,
  custom_objective,
  save_period,
  save_name,
  xgb_model,
  callbacks,
  verbose,
  num_class,
  weight,
  na_missing
)
Arguments
booster

the trainer type, the values are gbtree(default), gblinear, dart:gbtree

objective

specify the learning task. Check the link above for all possible values.

nthread

number of parallel threads used to run, default is to run using all threads available

silent

0 means printing running messages, 1 means silent mode

n_estimators

number of trees to grow, default = 100

learning_rate

Step size shrinkage used in update to prevents overfitting. Lower the learning rate, more time it takes in training, value lies between between 0 and 1. Default = 0.3

gamma

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. Value lies between 0 and infinity, Default = 0

max_depth

the maximum depth of each tree, default = 6

min_child_weight

Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be. Value lies between 0 and infinity. Default = 1

subsample

Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1

colsample_bytree

Subsample ratio of columns when constructing each tree. Subsampling will occur once in every boosting iteration. Value lies between 0 and 1. Default = 1

lambda

L2 regularization term on weights. Increasing this value will make model more conservative. Default = 1

alpha

L1 regularization term on weights. Increasing this value will make model more conservative. Default = 0

eval_metric

Evaluation metrics for validation data, a default metric will be assigned according to objective

print_every

print training log after n iterations. Default = 50

feval

custom evaluation function

early_stopping

Used to prevent overfitting, stops model training after this number of iterations if there is no improvement seen

maximize

If feval and early_stopping_rounds are set, then this parameter must be set as well. When it is TRUE, it means the larger the evaluation score the better.

custom_objective

custom objective function

save_period

when it is non-NULL, model is saved to disk after every save_period rounds, 0 means save at the end.

save_name

the name or path for periodically saved model file.

xgb_model

a previously built model to continue the training from. Could be either an object of class xgb.Booster, or its raw data, or the name of a file with a previously saved model.

callbacks

a list of callback functions to perform various task during boosting. See callbacks. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.

verbose

If 0, xgboost will stay silent. If 1, xgboost will print information of performance. If 2, xgboost will print some additional information. Setting verbose > 0 automatically engages the cb.evaluation.log and cb.print.evaluation callback functions.

num_class

set number of classes in case of multiclassification problem

weight

a vector indicating the weight for each row of the input.

na_missing

by default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. This parameter is only used when input is a dense matrix.

Details

Create a new 'XGBTrainer' object.

Returns

A 'XGBTrainer' object.

Examples
library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)

Method cross_val()

Usage
XGBTrainer$cross_val(X, y, nfolds = 5, stratified = TRUE, folds = NULL)
Arguments
X

data.frame

y

character, name of target variable

nfolds

integer, number of folds

stratified

logical, whether to use stratified sampling

folds

the list of CV folds' indices - either those passed through the folds parameter or randomly generated.

Details

Trains the xgboost model using cross validation scheme

Returns

NULL, trains a model and saves it in memory

Examples
library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)

# do cross validation to find optimal value for n_estimators
xgb$cross_val(X = df, y = 'Species',nfolds = 3, stratified = TRUE)

Method fit()

Usage
XGBTrainer$fit(X, y, valid = NULL)
Arguments
X

data.frame, training data

y

character, name of target variable

valid

data.frame, validation data

Details

Fits the xgboost model on given data

Returns

NULL, trains a model and keeps it in memory

Examples
library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)
xgb$fit(df, 'Species')

Method predict()

Usage
XGBTrainer$predict(df)
Arguments
df

data.frame, test data set

Details

Returns predicted values for a given test data

Returns

xgboost predictions

Examples
#' library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)
xgb$fit(df, 'Species')

# make predictions
preds <- xgb$predict(as.matrix(iris[,1:4]))

Method show_importance()

Usage
XGBTrainer$show_importance(type = "plot", topn = 10)
Arguments
type

character, could be 'plot' or 'table'

topn

integer, top n features to display

Details

Shows feature importance plot

Returns

a table or a plot of feature importance

Examples
library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)
xgb$fit(df, 'Species')
xgb$show_importance()

Method clone()

The objects of this class are cloneable with this method.

Usage
XGBTrainer$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
## ------------------------------------------------
## Method `XGBTrainer$new`
## ------------------------------------------------

library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)

## ------------------------------------------------
## Method `XGBTrainer$cross_val`
## ------------------------------------------------

library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)

# do cross validation to find optimal value for n_estimators
xgb$cross_val(X = df, y = 'Species',nfolds = 3, stratified = TRUE)

## ------------------------------------------------
## Method `XGBTrainer$fit`
## ------------------------------------------------

library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)
xgb$fit(df, 'Species')

## ------------------------------------------------
## Method `XGBTrainer$predict`
## ------------------------------------------------

#' library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)
xgb$fit(df, 'Species')

# make predictions
preds <- xgb$predict(as.matrix(iris[,1:4]))

## ------------------------------------------------
## Method `XGBTrainer$show_importance`
## ------------------------------------------------

library(data.table)
df <- copy(iris)

# convert characters/factors to numeric
df$Species <- as.numeric(as.factor(df$Species))-1

# initialise model
xgb <- XGBTrainer$new(objective = 'multi:softmax',
                      maximize = FALSE,
                      eval_metric = 'merror',
                      num_class=3,
                      n_estimators = 2)
xgb$fit(df, 'Species')
xgb$show_importance()

superml documentation built on April 28, 2020, 9:05 a.m.