knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
library(lightgbm.py) library(mlbench)
There are several ways to perform cross validation (CV) with this lightgbm implementation. To start this example, let's create a dataset a split into training data (split$train_index
) and validation data (split$test_index
).
data("PimaIndiansDiabetes2") dataset <- data.table::as.data.table(PimaIndiansDiabetes2) target_col <- "diabetes" id_col <- NULL split <- sklearn_train_test_split( dataset, target_col, split = 0.7, seed = 17, return_only_index = TRUE, stratify = TRUE ) table(dataset[split$train_index, target_col, with = F]) table(dataset[split$test_index, target_col, with = F])
We then have to also set basic parameters of the instantiated learner.
# instantiate the learner lgb_learner <- LightGBM$new() # initialize the training data lgb_learner$init_data( dataset = dataset[split$train_index, ], target_col = target_col, id_col = id_col ) # set basic parameters lgb_learner$param_set$values <- list( "objective" = "binary", "learning_rate" = 0.1, "seed" = 17, "metric" = "auc" ) lgb_learner$positive <- "pos"
Using the default settings, a 5-fold CV is used to find the optimal number of num_boost_round
. num_boost_round
is set here to specify the maximal number of boosting iterations. early_stopping_rounds
is set to specify the number of boosting iterations after which the training is stopped, if the metric does not improve anymore.
lgb_learner$num_boost_round <- 100 lgb_learner$early_stopping_rounds <- 10 #lgb_learner$categorical_feature <- c("pregnant", "age") #lgb_learner$categorical_feature <- c(0L, 7L) lgb_learner$train()
The number of folds of the CV can also be specified:
lgb_learner$cv_folds <- 10
It is also possible to perform the CV manually, using the train_cv
function:
lgb_learner$num_boost_round <- 5000 lgb_learner$early_stopping_rounds <- 1000 lgb_learner$train_cv()
Please note that in this case, the learner automatically adjusts the parameter num_boost_round
for the subsequent training step:
lgb_learner$num_boost_round
The training can then be performed, using the train
function:
lgb_learner$train()
The optimal number of num_boost_round
can also be found by using a validation dataset. Therefore, this implementation provides the function valids
. However, you need to instantiate the learner first, initialize the training data and provide at least the learner's objective parameter as described above.
Then you can switch off the default CV setting by setting the field nrounds_by_cv = FALSE
:
lgb_learner$nrounds_by_cv <- FALSE
Next, you can pass the validation dataset to the the function:
lgb_learner$valids(validset = dataset[split$test_index, ])
The validation data and validation label can now be inspected:
head(lgb_learner$valid_data$data) head(lgb_learner$valid_label)
Again, num_boost_round
and early_stopping_rounds
are set to specify the limits of the boosting process. Then the train
function can be executed:
lgb_learner$num_boost_round <- 5000 lgb_learner$early_stopping_rounds <- 1000 lgb_learner$train()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.