In kapsner/lightgbm.py: Use Python's LightGBM Module in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
dir.create("./png")

library(lightgbm.py)
library(MLmetrics)

Install the package

Make sure, the reticulate package is configured properly on your system (reticulate version >= 1.14) and is pointing to a python environment. If not, you can e.g. install miniconda:

reticulate::install_miniconda(
  path = reticulate::miniconda_path(),
  update = TRUE,
  force = FALSE
)

reticulate::py_config()

Before you can use the lightgbm.py package with GPU acceleration, you need to compile the lightgbm python package according to its documentation. You can compile the GPU version on Linux/ in docker with the following commands:

pip install lightgbm --install-option=--gpu

Load the dataset

The data must be provided as a data.table object. To simplify the subsequent steps, the target column name and the ID column name are assigned to the variables target_col and id_col, respectively.

data("iris")
dataset <- data.table::as.data.table(iris)
target_col <- "Species"
id_col <- NULL

To evaluate the model performance, the dataset is split into a training set and a test set with sklearn_train_test_split. This function is a wrapper around python sklearn's sklearn.model_selection.train_test_split method and ensures a stratified sampling for categorical target variables.

split <- sklearn_train_test_split(
  dataset,
  target_col,
  split = 0.7,
  seed = 17,
  return_only_index = TRUE,
  stratify = TRUE
)
table(dataset[split$train_index, target_col, with = F])
table(dataset[split$test_index, target_col, with = F])

Instantiate the lightgbm learner

Initially, the LightGBM class needs to be instantiated:

lgb_learner <- LightGBM$new()
lgb_learner$init_data(
  dataset = dataset[split$train_index, ],
  target_col = target_col,
  id_col = id_col
)

Configure the learner

Next, the learner parameters need to be set. At least, the objective parameter needs to be provided! Almost all possible parameters have been implemented here. You can inspect them using the following command:

lgb_learner$param_set

In order to use the GPU acceleration, the parameter device_type = "gpu" (default: "cpu") needs to be set. According to the LightGBM parameter manual, 'it is recommended to use the smaller max_bin (e.g. 63) to get the better speed up'.

lgb_learner$param_set$values <- list(
  "objective" = "multiclass",
  "learning_rate" = 0.1,
  "seed" = 17L,
  "metric" = "multi_logloss",
  "device_type" = "gpu",
  "max_bin" = 63L
)

Train the learner

The learner is now ready to be trained by using its train function. The parameters num_boost_round and early_stopping_rounds can be set here. Please refer to the LightGBM manual for further details these parameters.

lgb_learner$num_boost_round <- 100
lgb_learner$early_stopping_rounds <- 10
lgb_learner$train()

Evaluate the model performance

The learner's predict function returns a list object, which consists of the predicted probabilities for each class and the predicted class labels:

predictions <- lgb_learner$predict(newdata = dataset[split$test_index,])
head(predictions)

In order to calculate the model metrics, the test's set target variable has to be transformed accordingly to the learner's target variable's transformation. The value mappings are stored in the learner's object value_mapping:

# before transformation
head(dataset[split$test_index, get(target_col)])

# use the learners transform_target-method
target_test <- lgb_learner$trans_tar$transform_target(
  vector = dataset[split$test_index, get(target_col)],
  mapping = "dvalid"
)
# after transformation
head(target_test)

lgb_learner$trans_tar$value_mapping_dvalid

# revalue the predictions
colnames(predictions) <- c(0, 1, 2)
pred_classes <- sapply(seq_len(nrow(predictions)), function(x) {
  ret <- colnames(predictions)[which(predictions[x, ] ==
                                       max(predictions[x, ]))]
  return(ret)
})

Now, several model metrics can be calculated:

MLmetrics::ConfusionMatrix(
  y_true = target_test,
  y_pred = pred_classes
)
MLmetrics::Accuracy(
  y_true = target_test,
  y_pred = pred_classes
)
MLmetrics::MultiLogLoss(
  y_true = target_test,
  y_pred = predictions
)

The variable importance plot can be calculated by using the learner's importance function:

imp <- lgb_learner$importance()
imp$raw_values

filename <- "./png/lgb.py_imp_multiclass.png"
grDevices::png(
    filename = filename,
    res = 150,
    height = 1000,
    width = 1500
  )
print(imp$plot)
grDevices::dev.off()