knitr::opts_chunk$set(collapse = TRUE) # devtools::load_all() library(compboost) required.pcks = c("ggplot2") dependencies = all( unlist(lapply(required.pcks, requireNamespace, quietly = TRUE)) )
compboost
was designed to provide a component-wise boosting framework with
maximal flexibility. This document gives an introduction to the classes that
must be set and how to access the data which are generated during the fitting
process. In this document we are using the C++
looking API which was
generated using @eddelbuettel2017exposing
Rcpp modules.
We will have a look at:
To get a deeper understanding about the functionality and how the classes are related see the C++ documentation of the package.
We use the titanic dataset with binary
classification on survived
. First of all we store the train and test data
in two data frames and prevent compboost
from crashing by removing all rows
containing NA
s:
# Store train and test data: df.train = na.omit(titanic::titanic_train) df.test = na.omit(titanic::titanic_test) str(df.train)
In the next step we transform the response to values $y \in {-1, 1}$ and do another split on the training dataset:
# Response label have to be in {-1, 1}: response = df.train$Survived * 2 - 1 # Train and evaluation split for training: set.seed(1111) idx.train = sample(x = seq_len(nrow(df.train)), size = 0.6 * nrow(df.train)) idx.eval = setdiff(seq_len(nrow(df.train)), idx.train)
This split will be used while the training to calculate the out of bag risk.
The data classes can just handle matrices. Hence, the user is responsible for giving an appropriate data matrix to a specific base-learner. For instance the spline base-learner/factory can just handle a matrix with one column while the polynomial base-learner/factory can handle arbitrary matrices. A linear base-learner with intercept can be achieved by giving a matrix with an intercept column which contains just ones and an ordinary data column.
In compboost
the factories accept two data object as arguments. The first one
is the data source and the second one the data target (which should be an
empty data object). The factory then does the following:
We now want to include the ticket price Fare
and the age of the passenger
Age
by using spline base-learner.
# Fare: # -------- # Define data source: data.source.fare = InMemoryData$new(as.matrix(df.train$Fare[idx.train]), "Fare") # Define data target: data.target.fare = InMemoryData$new() # Define spline factory: spline.factory.fare = BaselearnerPSpline$new(data_source = data.source.fare, data_target = data.target.fare, degree = 3, n_knots = 20, penalty = 10, differences = 2) # Age: # -------- # Define data source: data.source.age = InMemoryData$new(as.matrix(df.train$Age[idx.train]), "Age") # Define data target: data.target.age = InMemoryData$new() # Define spline factory: spline.factory.age = BaselearnerPSpline$new(data_source = data.source.age, data_target = data.target.age, degree = 3, n_knots = 20, penalty = 10, differences = 2)
Remember the work-flow of the factory which takes the data source, transform the
data and write it into the target. We can access the transformed data of the
target by calling the member function getData()
:
spline.factory.age$getData()[1:10, 1:5]
An idiosyncrasy of splines here is that they initialize sparse matrices by default
which can be handled very well by the data object. Hence, calling getData()
of the
data target object returns an empty matrix:
data.target.age$getData()
We also want to have out of bag information while fitting the algorithm. Therefore, we have to define another data object containing the data source of the evaluation data:
# Define evaluation data objects: data.eval.fare = InMemoryData$new(as.matrix(df.train$Fare[idx.eval]), "Fare") data.eval.age = InMemoryData$new(as.matrix(df.train$Age[idx.eval]), "Age")
Since there isn't an automated transformation of categorical data to an
appropriate matrix yet, we have to handle categorical features manually.
Therefore we are just using the two feature
sex
and Pclass
:
table(df.train$Sex) table(df.train$Pclass)
In component-wise boosting we use one hot encoding (dummy encoding) and give each binary vector as one source data matrix. This is done for every category of the vector. As base-learner we use a linear base-learner to estimate group specific means. The advantage of using every group as own base-learner is the possibility, that just important groups are selected. It is not necessary to update all parameters (means) of the categorical feature simultaneously. This procedure also reduces the bias of the model selection which is done inherent in component-wise boosting see [@hofner2011framework].
The problem now is, that we have to define a data source and target for every
group within the categorical features. To avoid copy and pasting we use a for
loop to dynamically store the object into a list. Note that the S4
setting
makes it more difficult to assign objects to a list. Therefore we assign an
empty list before creating and storing the S4
object:
# Gender: # -------- # Unique groups: classes.sex = unique(df.train$Sex) # Frame for the data and factory: data.sex.list = list() data.sex.list[["source"]] = list() data.sex.list[["target"]] = list() data.sex.list[["test"]] = list() data.sex.list[["factory"]] = list() for (class in classes.sex) { # Create dummy variable and feature name: class.temp = ifelse(df.train$Pclass == class, 1, 0) data.name = paste0("Sex.", class) # Define data source: data.sex.list[["source"]][[data.name]] = list() data.sex.list[["source"]][[data.name]] = InMemoryData$new( as.matrix(class.temp[idx.train]), # data data.name # data identifier ) # Define data target: data.sex.list[["target"]][[data.name]] = list() data.sex.list[["target"]][[data.name]] = InMemoryData$new() # Define oob data for logging: data.sex.list[["test"]][[data.name]] = list() data.sex.list[["test"]][[data.name]] = InMemoryData$new( as.matrix(class.temp[idx.eval]), #data data.name # data identifier ) # Define Factory object: data.sex.list[["factory"]][[data.name]] = list() data.sex.list[["factory"]][[data.name]] = BaselearnerPolynomial$new( data_source = data.sex.list[["source"]][[data.name]], data_target = data.sex.list[["target"]][[data.name]], degree = 1, intercept = FALSE ) } # Passenger Class: # ------------------- # Unique groups: classes.pclass = unique(df.train$Pclass) # Frame for the data and factory: data.pclass.list = list() data.pclass.list[["source"]] = list() data.pclass.list[["target"]] = list() data.pclass.list[["test"]] = list() data.pclass.list[["factory"]] = list() for (class in classes.pclass) { # Create dummy variable and feature name: class.temp = ifelse(df.train$Pclass == class, 1, 0) data.name = paste0("Pclass.", class) # Define data source: data.pclass.list[["source"]][[data.name]] = list() data.pclass.list[["source"]][[data.name]] = InMemoryData$new( as.matrix(class.temp[idx.train]), # data data.name # data identifier ) # Define data target: data.pclass.list[["target"]][[data.name]] = list() data.pclass.list[["target"]][[data.name]] = InMemoryData$new() # Define oob data for logging: data.pclass.list[["test"]][[data.name]] = list() data.pclass.list[["test"]][[data.name]] = InMemoryData$new( as.matrix(class.temp[idx.eval]), # data data.name # data identifier ) # Define Factory object: data.pclass.list[["factory"]][[data.name]] = list() data.pclass.list[["factory"]][[data.name]] = BaselearnerPolynomial$new( data_source = data.pclass.list[["source"]][[data.name]], data_target = data.pclass.list[["target"]][[data.name]], degree = 1, intercept = FALSE ) }
Finally we need to register all the base-learner factories we want to use for modeling:
# Create new factory list: factory.list = BlearnerFactoryList$new() # Numeric factories: factory.list$registerFactory(spline.factory.fare) factory.list$registerFactory(spline.factory.age) # Categorial features: for (lst in data.sex.list[["factory"]]) { factory.list$registerFactory(lst) } for (lst in data.pclass.list[["factory"]]) { factory.list$registerFactory(lst) } # Print registered factories: factory.list
Since we are interested in binary classification we can use the binomial loss. This loss is used while training and determines the pseudo residuals as well as the empirical risk which we want to minimize:
loss.bin = LossBinomial$new() loss.bin
Since we are in the boosting context the classical way of selecting the best base-learner within one iteration by using the greedy optimizer:
used.optimizer = OptimizerCoordinateDescent$new()
As mentioned above, we have to define every element of the algorithm by ourselves. This also includes the logger which also acts as stopper. That means, that we have to define the logger and if that logger should also be used as stopper.
This logger just logs the current iteration and stops if max_iterations
is
reached. In our case we want to stop after 2500 iterations:
log.iterations = LoggerIteration$new(use_as_stopper = TRUE, max_iterations = 2500)
Note that the arguments as max_iterations
are just used if we define the
logger also as stopper. Otherwise the arguments are ignored.
This logger logs the elapsed time. The time unit can be one of microseconds
,
seconds
or minutes
. The logger stops if max_time
is reached:
log.time = LoggerTime$new(use_as_topper = FALSE, max_time = 120, time_unit = "seconds")
This logger logs the inbag risk by calculating the empirical risk using the training data. Note that it is necessary to specify a loss which is used to calculate the empirical risk. In the most common situation we use the the same loss as used for training to display the progress of the fitting:
log.inbag = LoggerInbagRisk$new(use_as_stopper = FALSE, used_loss = loss.bin, eps_for_break = 0.05)
The out of bag risk logger does basically the same as the inbag risk logger but calculates the empirical risk using another data source. Therefore, the new data object have to be a list with data sources containing the evaluation data:
# List with out of bag data sources: oob.list = list() # Numeric featurs: oob.list[[1]] = data.eval.fare oob.list[[2]] = data.eval.age # Categorial features: for (lst in data.sex.list[["test"]]) { oob.list = c(oob.list, lst) } for (lst in data.pclass.list[["test"]]) { oob.list = c(oob.list, lst) }
Finally we create the out of bag risk object by also specifying the
corresponding y
labels:
log.oob = LoggerOobRisk$new(use_as_stopper = FALSE, used_loss = loss.bin, eps_for_break = 0.05, oob_data = oob.list, oob_response = response[idx.eval])
The risk logger in combination with a custom loss can also be used to log
performance measures. We illustrate this procedure using the AUC
measure from
mlr:
# Define custom "loss function" aucLoss = function (truth, response) { # Convert response on f basis to probs using sigmoid: probs = 1 / (1 + exp(-response)) # Calculate AUC: mlr:::measureAUC(probabilities = probs, truth = truth, negative = -1, positive = 1) } # Define also gradient and constant initalization since they are necessary for # the custom loss: gradDummy = function (trutz, response) { return (NA) } constInitDummy = function (truth, response) { return (NA) } # Define loss: auc.loss = LossCustom$new(aucLoss, gradDummy, constInitDummy)
Now we can create a new inbag and out of bag logger to log the AUC while fitting the model:
log.inbag.auc = LoggerInbagRisk$new(use_as_stopper = FALSE, used_loss = auc.loss, eps_for_break = 0.05) log.oob.auc = LoggerOobRisk$new(use_as_stopper = FALSE, used_loss = auc.loss, eps_for_break = 0.05, oob_data = oob.list, oob_response = response[idx.eval])
This procedure can be used for any other risk measure. For a detailed
description on how to extending compboost
with custom losses or base-learner
see the extending compboost vignette.
Finally, we need to define a logger list object and register all the logger we want to track:
# Define new logger list: logger.list = LoggerList$new() # Register logger: logger.list$registerLogger(" iteration.logger", log.iterations) logger.list$registerLogger("time.logger", log.time) logger.list$registerLogger("inbag.binomial", log.inbag) logger.list$registerLogger("oob.binomial", log.oob) logger.list$registerLogger("inbag.auc", log.inbag.auc) logger.list$registerLogger("oob.auc", log.oob.auc) logger.list
Now after defining all object which are required by compboost
we can define
the compboost
object with a learning rate of 0.05
and stopper rule that
the algorithm should stop when the first stopper is fulfilled:
# Initialize object: cboost = Compboost_internal$new( response = response[idx.train], learning_rate = 0.05, stop_if_all_stopper_fulfilled = FALSE, factory_list = factory.list, loss = loss.bin, logger_list = logger.list, optimizer = used.optimizer ) # Train the model (we want to print the trace): cboost$train(trace = -1)
To get the fitted parameters we can use the getEstimatedParameter()
function:
params = cboost$getEstimatedParameter() str(params)
It is also possible to get the trace how the base-learner are fitted by calling
getSelectedBaselearner()
blearner.trace = cboost$getSelectedBaselearner() table(blearner.trace) blearner.trace[1:10]
We want to predict new labels for the out of bag data:
# Get predicted scores and probability (with sigmoid): scores = cboost$predict(oob.list, response = FALSE) prob.scores = 1 / (1 + exp(-scores)) # This can also be achieved by calling response = TRUE: all.equal(prob.scores, cboost$predict(oob.list, response = TRUE)) # Calculate labels with threshold of 0.5: pred.labels = ifelse(prob.scores > 0.5, 1, -1) # Calculate confusion matrix: table(pred = pred.labels, truth = response[idx.eval])
Looking at the confusion matrix we have a good sensitivity but bad false positive rate. Therefore it would be more informative to take a look at the AUC and the ROC curve. To compute the AUC and the ROC curve we can proceed as follows:
library(ggplot2) # True labels as binary vector (0, 1): labels = (response[idx.eval] + 1) / 2 labels = labels[order(scores, decreasing = TRUE)] myroc = data.frame( TPR = cumsum(labels)/sum(labels), FPR = cumsum(!labels)/sum(!labels), Labels = labels ) # AUC: mlr::measureAUC(probabilities = prob.scores, truth = response[idx.eval], negative = -1, positive = 1) ggplot(data = myroc, aes(x = FPR, y = TPR)) + geom_abline(intercept = 0, slope = 1) + geom_line(size = 2) + ggtitle("ROC Curve")
The fastest way to continuing the training is to use the setToIteration()
function. If we set the algorithm to an iteration bigger than the actual
maximal iteration, then compboost
automatically trains the remaining
base-learner:
cboost$setToIteration(k = 3000) cboost
The drawback of using setToIteration()
is, that the function doesn't
continuing logging. The logger data for the second training (from iteration
501 to 1000) is then just a vector of the iterations.
Additionally, it is possible to continuing the training using the
continueTraining()
function. This function takes a boolean to indicate if the
trace should be printed and another logger list to get more control about the
retraining. For instance, we can continuing the training for 3 seconds and see
how far we get. We also want to reuse the out of bag logger:
# Define new time logger: new.time.logger = LoggerTime$new(use_as_stopper = TRUE, max_time = 3, time_unit = "seconds") # Define new logger list and register logger: new.logger.list = LoggerList$new() # Define new oob logger to prevent old logger data from overwriting: new.oob.log = LoggerOobRisk$new(use_as_stopper = FALSE, used_loss = loss.bin, eps_for_break = 0.05, oob_data = oob.list, oob_response = response[idx.eval]) new.oob.auc.log = LoggerOobRisk$new(use_as_stopper = FALSE, used_loss = auc.loss, eps_for_break = 0.05, oob_data = oob.list, oob_response = response[idx.eval]) new.logger.list$registerLogger("time", new.time.logger) new.logger.list$registerLogger("oob.binomial", new.oob.log) new.logger.list$registerLogger("oob.auc", new.oob.auc.log) # Continue training: cboost$continueTraining(trace = 0, logger_list = new.logger.list) cboost
Note: With setToIteration()
it is also possible to set compboost
to an
iteration smaller than the already trained ones. This becomes handy if one
would like set the algorithm to an iteration corresponding to the minimum
of the out of bag risk.
To compare the inbag and the out of bag AUC we first have to get the logger data:
cboost.log = cboost$getLoggerData() str(cboost.log)
This list contains all the data collected while training and retraining. Therefore, three list elements. We are interested in the first one:
cboost.log = cboost.log[[1]] str(cboost.log)
Next we create a data frame which we use for plotting:
auc.data = data.frame( iteration = rep(seq_len(nrow(cboost.log$logger.data)), 2), risk.type = rep(c("Inbag", "OOB"), each = nrow(cboost.log$logger.data)), AUC = c(cboost.log$logger.data[, 2], cboost.log$logger.data[, 4]), emp.risk = c(cboost.log$logger.data[, 3], cboost.log$logger.data[, 5]) ) p1 = ggplot(data = auc.data, aes(x = iteration, y = AUC, colour = risk.type)) + geom_line(size = 2) + ggtitle("AUC per Iteration") p2 = ggplot(data = auc.data, aes(x = iteration, y = emp.risk, colour = risk.type)) + geom_line(size = 2) + ggtitle("Empirical Risk per Iteration") gridExtra::grid.arrange(p1, p2, ncol = 2)
A surprising behavior is the AUC lines. We would expect the out of bag AUC lower than the inbag AUC which isn't the case here. If we leave that out the curves shows the usual behavior. The out of bag curve for the AUC raises till approximately 1200 iterations and then decrease. On the other hand, the out of bag risk also falls till approximately 1200 and then starts to increase. This are clear signs that for more than 1200 iterations the algorithm starts to overfit. We should think about using the model at the 1200th iteration.
One of the key advantages of component-wise boosting is to have an interpretable
model. For instance it is now possible to illustrate the effect of fare.
Therefore, we can use the transformData()
function of the
spline.factory.fare
object to create the spline basis for new observations:
params = cboost$getEstimatedParameter() params.fare = params$Fare_spline_degree_3 x.fare = seq(from = min(df.train$Fare), to = max(df.train$Fare), length.out = 100) x.basis = spline.factory.fare$transformData(as.matrix(x.fare)) x.response = x.basis %*% params.fare plot.data = data.frame(x = x.fare, y = as.numeric(x.response)) ggplot(data = plot.data, aes(x = x, y = y)) + geom_line(size = 2) + xlab("Ticket Costs (Fare)") + ylab("Additive Contribution") + labs(title="Effect of Age on Survival", subtitle="The higher the additive contribution the higher the chance of survial")
We recognize, that this curve results from taking the parameter after
r length(cboost$getSelectedBaselearner())
iterations. That could be way too
much and could tend to overfitting. Because of the out of bag behavior, we set
the iteration to 1200:
cboost$setToIteration(k = 1200) params = cboost$getEstimatedParameter() params.fare = params$Fare_spline_degree_3 x.fare = seq(from = min(df.train$Fare), to = max(df.train$Fare), length.out = 100) x.basis = spline.factory.fare$transformData(as.matrix(x.fare)) x.response = x.basis %*% params.fare plot.data = data.frame(x = x.fare, y = as.numeric(x.response)) ggplot(data = plot.data, aes(x = x, y = y)) + geom_line(size = 2) + xlab("Ticket Costs (Fare)") + ylab("Additive Contribution") + labs(title="Effect of Age on Survival", subtitle="The higher the additive contribution the higher the chance of survial")
Finally we want to get an idea on how the curve evolves over the iterations.
Therefore we can use ImageMagick
to create a gif
. The approach is to get the parameter matrix by
getParameterMatrix()
, then select for the spline base-learner corresponding to
Fare
and create the gif
:
param.mat = cboost$getParameterMatrix() idx.fare = cboost$getSelectedBaselearner() == "Fare: spline with degree 3" sum(idx.fare)
Now we create a new directory to store the single frames and create the gif
using ImageMagick
:
idx.fare = which(idx.fare) # Get index of parameter corresponding to fare spline learner: param.idx = grepl(pattern = "Fare", x = param.mat$parameter.names) # Create new directory for the frames dir.create("frames") for (i in seq_along(idx.fare)){ # Get parameter of i-th iteration params.fare = param.mat$parameter.matrix[idx.fare[i], param.idx] # Compute response for every iteration x.response = x.basis %*% params.fare plot.data = data.frame(x = x.fare, y = as.numeric(x.response)) # Create plot: ggplot(data = plot.data, aes(x = x, y = y)) + geom_line(size = 2) + xlab("Ticket Costs (Fare)") + ylab("Additive Contribution") + ylim(-1, 4) + labs(title="Effect of Age on Survival", subtitle="The higher the additive contribution the higher the chance of survial") # Save ggplot: ggsave(paste0("frames/frame", stringr::str_pad(i, 4, pad = "0"), ".png"), width = 9, height = 6, dpi = 50) } # Convert the frames to gif: system("magick -delay 1 frames/*.png frames/fare_spline.gif") # Remove all created pngs: file.remove(paste0("frames/", list.files(path = "frames", pattern = ".png")))
knitr::include_graphics("figures/fare_spline.gif")
Note that the name of the executable of ImageMagick
differs between the OS.
On Windows we need magick
while on Linux we have to use convert
.
We know that define everything using the C++
class style is very odd, but
it reflects best the underlying C++
class system. Additionally, using the
class system gives the user maximal flexibility and control about the
algorithm. An R
API which looks more familiar to the most users is in
progress and one of the most important next task.
All the sample analyses we have made here are just to give an idea how
compboost
can be used to train a component-wise boosting model and how
to access the data which are gained while the fitting process.
Since compboost
is in an very early stage, the functionality isn't very
comprehensive. For instance there is just one optimizer at the moment and
one loss class for binary classification. There is also no multi-class
support at the moment.
If someone wants to use a custom loss or base-learner we have implemented
ways to extend compboost without recompiling the C++
code. Therefore see
the extending compboost vignette.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.