In lightgbm: Light Gradient Boosting Machine

knitr::opts_chunk$set(
  collapse = TRUE
  , comment = "#>"
  , warning = FALSE
  , message = FALSE
)

Introduction

Welcome to the world of LightGBM, a highly efficient gradient boosting implementation (Ke et al. 2017).

library(lightgbm)

# limit number of threads used, to be respectful of CRAN's resources when it checks this vignette
data.table::setDTthreads(1L)
setLGBMthreads(2L)

This vignette will guide you through its basic usage. It will show how to build a simple binary classification model based on a subset of the bank dataset (Moro, Cortez, and Rita 2014). You will use the two input features "age" and "balance" to predict whether a client has subscribed a term deposit.

The dataset

The dataset looks as follows.

data(bank, package = "lightgbm")

bank[1L:5L, c("y", "age", "balance")]

# Distribution of the response
table(bank$y)

Training the model

The R-package of LightGBM offers two functions to train a model:

lgb.train(): This is the main training logic. It offers full flexibility but requires a Dataset object created by the lgb.Dataset() function.
lightgbm(): Simpler, but less flexible. Data can be passed without having to bother with lgb.Dataset().

Using the `lightgbm()` function

In a first step, you need to convert data to numeric. Afterwards, you are ready to fit the model by the lightgbm() function.

# Numeric response and feature matrix
y <- as.numeric(bank$y == "yes")
X <- data.matrix(bank[, c("age", "balance")])

# Train
fit <- lightgbm(
  data = X
  , label = y
  , params = list(
    num_leaves = 4L
    , learning_rate = 1.0
    , objective = "binary"
  )
  , nrounds = 10L
  , verbose = -1L
)

# Result
summary(predict(fit, X))

It seems to have worked! And the predictions are indeed probabilities between 0 and 1.

Using the `lgb.train()` function

Alternatively, you can go for the more flexible interface lgb.train(). Here, as an additional step, you need to prepare y and X by the data API lgb.Dataset() of LightGBM. Parameters are passed to lgb.train() as a named list.

# Data interface
dtrain <- lgb.Dataset(X, label = y)

# Parameters
params <- list(
  objective = "binary"
  , num_leaves = 4L
  , learning_rate = 1.0
)

# Train
fit <- lgb.train(
  params
  , data = dtrain
  , nrounds = 10L
  , verbose = -1L
)

Try it out! If stuck, visit LightGBM's documentation for more details.

# Cleanup
if (file.exists("lightgbm.model")) {
  file.remove("lightgbm.model")
}

References

Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." In Advances in Neural Information Processing Systems 30 (NIPS 2017).

Moro, Sérgio, Paulo Cortez, and Paulo Rita. 2014. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems 62: 22–31.