In ModelOriented/forester: Quick and Simple Tools for Training and Testing of Tree-Based Models

library(DALEX)
library(forester)
knitr::opts_chunk$set(echo = FALSE, comment = NA, warning = FALSE, message = FALSE)

The best models

This is the r train_output$type task.

The best model evaluated on testing set is: r paste(train_output$score_test$name[[1]], sep="' '", collapse=", "), whereas for the validation set it is: r paste(train_output$score_valid$name[[1]], sep="' '", collapse=", ").

The models inside the forester package are trained on the training set, the Bayesian Optimization is tuned according to the testing set, and the validation set is never seen during the training. The training set should not be used for evaluation as the models always perform the best for the data they have seen (overfitting). The least biased dataset is the validation set, however we can also use the testing set, ex. to check if the models overfit.

The names of the models were created by a pattern Engine_TuningMethod_Id, where:

Engine - describes the engine used for the training (random_forest, xgboost, decision_tree, lightgbm, catboost),
TuningMethod - describes how the model was tuned (basic for basic parameters, RS for random search, bayes for Bayesian optimization),
Id - is used for separating the random search parameters sets.

More details about the dataset are present at the end of the report.

Best models for validation dataset

score_frame          <- train_output$score_valid
score_rounded        <- score_frame

for(i in 5:ncol(score_rounded)) {
  score_rounded[, i] <- round(as.numeric(score_rounded[, i]), 4)
}

if(!is.null(metric)){
  score_rounded <- score_rounded[order(score_rounded[metric]), ]
} else {
  score_rounded <- score_rounded[order(score_rounded[, 'rmse']), ]
}

score_rounded <- score_rounded[1:10, -c(3, 4)]
knitr::kable(score_rounded)

\newpage

Model comparison

Residuals

The residual boxplot indicates whether the model generally fits well to the data or not. Ideally, the residuals are all close to 0, relative to the scale of the response variable (the larger true values, the larger residuals are considered small). It means, that when the box representing the interquantile range (IQR) is around 0, the model is well-fit. The dots on the plot represent the outliers of the model, which are incomparably large to other observations. Large amounts of the outliers might indicate, that the model doesn't fit well to the data.

With the usage of this plot we can evaluate if best models fit well to the presented task (residuals close to 0, only a few outliers), and compare them between each other in order to choose the best one. We can suspect that the model with residuals closer to 0 (or a narrower IQR (box)) performs better.

if (params$train_output$type == 'regression') {
  plot(params$train_output, models = NULL, type = 'residuals', metric = 'rmse')
}

\newpage

Observed vs prediction

This scatter plot on the x axis presents us the observed values of the dataset, whereas the y axis corresponds to models predictions. The line drawn on the plot represents the estimated 'fit-line' where x = y. In this case we want particular observations (dots) to be close to each other (to estimated line). The further from the line an observation is, the worse the quality of this prediction is.

In this case we say that the model where observations are closer to each other is better. This visualization lets us also check if the model overfits or not. If predicted values for the train subplot are close to the line, whereas, for the train subplot are far from it, it suggests that the model might be overfitted, so it generalizes poorly for the unseen data.

if (train_output$type == 'regression') {
  plot(train_output, models = NULL, type = 'train-test-observed-predicted', metric = 'rmse')
}

\newpage

Train vs test plot

This scatter plot tackles the issue of the overfitting, and compares large amounts of models at once. On the x axis we provide the metrics value evaluated on the training dataset, whereas on the y axis we have the same for the testing dataset. Models performance is assessed in two ways. Firstly, we want the model to have as small value as possible on the testing dataset (so we want it to be lower than other models). Secondly, we want to choose the model which is close to the x = y line, because it means that the model is not overfitted, so it generalizes better. In most cases we want to chose the model that is less overfitted, even though it has worse performance.

if (train_output$type == 'regression') {
  plot(train_output, models = min(5, nrow(train_output$score_test)), type = 'train-test', metric = 'rmse')
}

\newpage

Plots for the best model

Feature Importance

The final visualization presents us with the feature importance plot which lets us understand what's happening inside the best model evaluated on validation set. Feature Importance (FI) shows us the most important variables for the model, and the bigger the absolute value, the more important a variable is. Large FI values for a feature indicate that if we permute the values for the column randomly, it changes the final outcomes drastically.

engine <- NULL
if (grepl('ranger', params$train_output$score_valid$name[1])) {
  engine <- c('ranger')
} else if (grepl('xgboost', params$train_output$score_valid$name[1])) {
  engine <- c('xgboost')
} else if (grepl('decision_tree', params$train_output$score_valid$name[1])) {
  engine <- c('decision_tree')
} else if (grepl('lightgbm', params$train_output$score_valid$name[1])) {
  engine <- c('lightgbm')
} else if (grepl('catboost', params$train_output$score_valid$name[1])) {
  engine <- c('catboost')
}

if (engine != c('catboost') && !is.null(engine)) { # For catboost there is an error with DALEX::model_parts().
  draw_feature_importance(params$train_output$models_list[[params$train_output$score_valid$name[1]]], 
                          params$train_output$valid_data, 
                          params$train_output$y)
} else {
  print('Feature importance unavailable for catboost model.')
}

\newpage

Details about data

checked_data <- check_data(train_output$data, train_output$y, verbose = FALSE)
for (i in 1:length(checked_data$str)) {
  cat(checked_data$str[i])
  cat('\n')
}