CascadeForest: Cascade Forest implementation in R
In Laurae2/Laurae: Advanced High Performance Data Science Toolbox for R

Description Usage Arguments Details Value Examples

This function attempts to replicate Cascade Forest using xgboost. It performs Complete-Random Tree Forest in a Neural Network directed acrylic graph like in Neural Networks, but only for simple graphs (e.g use previous layer output data for next layer training each time). You can specify your learning objective using objective and the metric to check for using eval_metric. You can plug custom objectives instead of the objectives provided by xgboost. As with any uncalibrated machine learning methods, this method suffers uncalibrated outputs. Therefore, the usage of scale-dependent metrics is discouraged (please use scale-invariant metrics, such as Accuracy, AUC, R-squared, Spearman correlation...).

CascadeForest(training_data, validation_data, training_labels,
  validation_labels, folds, boosting = FALSE, nthread = 1, cascade_lr = 1,
  training_start = NULL, validation_start = NULL, cascade_forests = rep(4,
  5), cascade_trees = 500, cascade_rf = 2,
  cascade_seeds = 1:length(cascade_forests), objective = "reg:linear",
  eval_metric = Laurae::df_rmse, multi_class = FALSE, early_stopping = 2,
  maximize = FALSE, verbose = TRUE, low_memory = FALSE,
  essentials = FALSE, garbage = FALSE, work_dir = NULL,
  fail_safe = 65536)

`training_data`	Type: data.table. The training data. Columns are added during training if `low_memory == TRUE`, so you may want to clean it up if you use `low_memory == TRUE` and interrupt training.
`validation_data`	Type: data.table. The validation data to check for metric performance. Set to `NULL` if you want to use out of fold validation data instead of a custom validation data set. Columns are added during training if `low_memory == TRUE`, so you may want to clean it up if you use `low_memory == TRUE` and interrupt training.
`training_labels`	Type: numeric vector. The training labels.
`validation_labels`	Type: numeric vector. The validation labels.
`folds`	Type: list. The folds as list for cross-validation.
`boosting`	Type: logical. Whether to perform boosting or not for training. It may converge faster, but may overfit faster and therefore needs control via `cascade_lr`. Defaults to `FALSE`.
`nthread`	Type: numeric. The number of threads using for multithreading. 1 means singlethread (uses only one core). Higher may mean faster training if the memory overhead is not too large. Defaults to `1`.
`cascade_lr`	Type: numeric vector or numeric. The shrinkage affected to each tree per layer to avoid overfitting, for each layer. You may specify a vector to change the learning rate per layer, such as `c(0.4, 0.3, 0.2, 0.1, 0.05)` so you can perform boosting afterwards. Defaults to `1`.
`training_start`	Type: numeric vector. The initial training prediction labels. Set to `NULL` if you do not know what you are doing. Defaults to `NULL`.
`validation_start`	Type: numeric vector. The initial validation prediction labels. Set to `NULL` if you do not know what you are doing. Defaults to `NULL`.
`cascade_forests`	Type: numeric vector (mandatory). The number of forest models per layer in the architecture to create for the Cascade Forest. Inputting 0 means it will take the latest number forever until it stops after convergence using `early_stopping`. For instance, to activate infinite training on the default parameters, use `c(rep(4, 5), 0)`. Defaults to `rep(4, 5)`.
`cascade_trees`	Type: numeric vector or numeric. The number of trees per forest model per layer in the architecture to create for the Cascade Forest. You may specify a vector to change the learning rate per layer, such as `500` so you can perform boosting afterwards. Defaults to `1000`.
`cascade_rf`	Type: numeric vector or numeric. The number of Random Forest model per layer in the architecture to create for the Cascade Forest. You may specify a vector to change the learning rate per layer, such as `c(1, 1, 2, 3, 5)` so you can perform boosting afterwards. Defaults to `2`.
`cascade_seeds`	Type: numeric vector or numeric. Random seed for reproducibility per layer. Do not set it to a value which is identical throughout the architecture, you will train on the same features over and over otherwise! When using a single value as seed, it automatically adds 1 each time an advance in the layer is made. Defaults to `1:length(cascade_forests)`.
`objective`	Type: character or function. The function which leads `boosting` loss. See `xgboost::xgb.train`. Defaults to `"reg:linear"`.
`eval_metric`	Type: function. The function which evaluates `boosting` loss. Must take two arguments in the following order: `preds, labels` (they may be named in another way) and returns a metric. Defaults to `Laurae::df_rmse`.
`multi_class`	Type: numeric. Defines the number of classes internally for whether you are doing multi class classification or not to use specific routines for multiclass problems when using `return_list == FALSE`. Defaults to `2`, which is for regression and binary classification.
`early_stopping`	Type: numeric. Defines how many architecture layers without improvement to require before stopping early (therefore, you must remove 1 to that value - for instance, a stopping of 2 means it will stop after 3 failures to improve). 0 means instantly stop at the first failure for improvement. -1 means no stopping. Requires `validation_data` to be able to stop early. Defaults to `2`.
`maximize`	Type: logical. Whether to maximize or not the loss evaluation metric. Defaults to `FALSE`.
`verbose`	Type: logical. Whether to print training evaluation. Defaults to `TRUE`.
`low_memory`	Type: logical. Whether to perform the data.table transformations in place to lower memory usage. Defaults to `FALSE`.
`essentials`	Type: logical. Whether to store intermediary predictions or not. Set it to `TRUE` if you encounter memory issues. Defaults to `FALSE`.
`garbage`	Type: logical. Whether to perform garbage collect regularly. Defaults to `FALSE`.
`work_dir`	Type: character, allowing concatenation with another character text (ex: "dev/tools/save_in_this_folder/" = add slash, or "dev/tools/save_here/prefix_" = don't add slash). The working directory to store models. If you provide a working directory, the models will be saved inside that directory (and all other models will get wiped if they are under the same names). It will lower severely the memory usage as the models will not be saved anymore in memory. Combined with `garbage == TRUE`, you achieve the lowest possible memory usage in this Deep Forest implementation. Defaults to `NULL`, which means store models in memory.
`fail_safe`	Type: numeric. In case of infinite training (`cascade_forests`'s last value equal to 0), this limits the number of training iterations. Defaults to `65536`.

For implementation details of Cascade Forest / Complete-Random Tree Forest / Multi-Grained Scanning / Deep Forest, check this: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-283942390 by Laurae.

Cascade Forests tend to aim what Neural Networks are doing: architecturing the model in multiple layers. Cascade Forests abuse the stacking ensemble method to perform training on these layers. Using randomness of Random Forests and Complete-Random Tree Forests, a Cascade Forest aims to outperform simple Convolutional Neural Networks (CNNs). The computational cost, however, is massive and should be taken into account before learning a large (and potentially unlimited) Cascade Forest. Due to their nature and to stacking ensemble properties, Cascade Forests have a hard time to overfit themselves.

Putting a Cascade Forest on top a Multi-Grained Scanning model results in a gcForest

Laurae recommends using xgboost or LightGBM on top of gcForest or Cascade Forest. See the rationale here: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-284689795.

A data.table based on target.

## Not run: 
# Load libraries
library(data.table)
library(Matrix)
library(xgboost)

# Create data
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
agaricus_data_train <- data.table(as.matrix(agaricus.train$data))
agaricus_data_test <- data.table(as.matrix(agaricus.test$data))
agaricus_label_train <- agaricus.train$label
agaricus_label_test <- agaricus.test$label
folds <- Laurae::kfold(agaricus_label_train, 5)

# Train a model (binary classification)
model <- CascadeForest(training_data = agaricus_data_train, # Training data
                       validation_data = agaricus_data_test, # Validation data
                       training_labels = agaricus_label_train, # Training labels
                       validation_labels = agaricus_label_test, # Validation labels
                       folds = folds, # Folds for cross-validation
                       boosting = FALSE, # Do not touch this unless you are expert
                       nthread = 1, # Change this to use more threads
                       cascade_lr = 1, # Do not touch this unless you are expert
                       training_start = NULL, # Do not touch this unless you are expert
                       validation_start = NULL, # Do not touch this unless you are expert
                       cascade_forests = rep(4, 5), # Number of forest models
                       cascade_trees = 10, # Number of trees per forest
                       cascade_rf = 2, # Number of Random Forest in models
                       cascade_seeds = 1:5, # Seed per layer
                       objective = "binary:logistic",
                       eval_metric = Laurae::df_logloss,
                       multi_class = 2, # Modify this for multiclass problems
                       early_stopping = 2, # stop after 2 bad combos of forests
                       maximize = FALSE, # not a maximization task
                       verbose = TRUE, # print information during training
                       low_memory = FALSE)

# Attempt to perform fake multiclass problem
agaricus_label_train[1:100] <- 2

# Train a model (multiclass classification)
model <- CascadeForest(training_data = agaricus_data_train, # Training data
                       validation_data = agaricus_data_test, # Validation data
                       training_labels = agaricus_label_train, # Training labels
                       validation_labels = agaricus_label_test, # Validation labels
                       folds = folds, # Folds for cross-validation
                       boosting = FALSE, # Do not touch this unless you are expert
                       nthread = 1, # Change this to use more threads
                       cascade_lr = 1, # Do not touch this unless you are expert
                       training_start = NULL, # Do not touch this unless you are expert
                       validation_start = NULL, # Do not touch this unless you are expert
                       cascade_forests = rep(4, 5), # Number of forest models
                       cascade_trees = 10, # Number of trees per forest
                       cascade_rf = 2, # Number of Random Forest in models
                       cascade_seeds = 1:5, # Seed per layer
                       objective = "multi:softprob",
                       eval_metric = Laurae::df_logloss,
                       multi_class = 3, # Modify this for multiclass problems
                       early_stopping = 2, # stop after 2 bad combos of forests
                       maximize = FALSE, # not a maximization task
                       verbose = TRUE, # print information during training
                       low_memory = FALSE)

## End(Not run)