CRTreeForest: Complete-Random Tree Forest implementation in R

Description Usage Arguments Details Value Examples

Description

This function attempts to replicate Complete-Random Tree Forests using xgboost. It performs Random Forest n_forest times using n_trees trees. You can specify your learning objective using objective and the metric to check for using eval_metric. You can plug custom objectives instead of the objectives provided by xgboost. As with any uncalibrated machine learning methods, this method suffers uncalibrated outputs. Therefore, the usage of scale-dependent metrics is discouraged (please use scale-invariant metrics, such as Accuracy, AUC, R-squared, Spearman correlation...).

Usage

1
2
3
4
5
6
CRTreeForest(training_data, validation_data, training_labels, validation_labels,
  folds, nthread = 1, lr = 1, training_start = NULL,
  validation_start = NULL, n_forest = 5, n_trees = 1000,
  random_forest = 0, seed = 0, objective = "reg:linear",
  eval_metric = Laurae::df_rmse, return_list = TRUE, multi_class = 2,
  verbose = " ", garbage = FALSE, work_dir = NULL)

Arguments

training_data

Type: data.table. The training data.

validation_data

Type: data.table. The validation data with labels to check for metric performance. Set to NULL if you want to use out of fold validation data instead of a custom validation data set.

training_labels

Type: numeric vector. The training labels.

validation_labels

Type: numeric vector. The validation labels.

folds

Type: list. The folds as list for cross-validation.

nthread

Type: numeric. The number of threads using for multithreading. 1 means singlethread (uses only one core). Higher may mean faster training if the memory overhead is not too large. Defaults to 1.

lr

Type: numeric. The shrinkage affected to each tree to avoid overfitting. Defaults to 1, which means no adjustment.

training_start

Type: numeric vector. The initial training prediction labels. Set to NULL if you do not know what you are doing. Defaults to NULL.

validation_start

Type: numeric vector. The initial validation prediction labels. Set to NULL if you do not know what you are doing. Defaults to NULL.

n_forest

Type: numeric. The number of forest models to create for the Complete-Random Tree Forest. Defaults to 5.

n_trees

Type: numeric. The number of trees per forest model to create for the Complete-Random Tree Forest. Defaults to 1000.

random_forest

Type: numeric. The number of Random Forest in the forest. Defaults to 0.

seed

Type: numeric. Random seed for reproducibility. Defaults to 0.

objective

Type: character or function. The function which leads boosting loss. See xgboost::xgb.train. Defaults to "reg:linear".

eval_metric

Type: function. The function which evaluates boosting loss. Must take two arguments in the following order: preds, labels (they may be named in another way) and returns a metric. Defaults to Laurae::df_rmse.

return_list

Type: logical. Whether lists should be returned instead of concatenated frames for predictions. Defaults to TRUE.

multi_class

Type: numeric. Defines the number of classes internally for whether you are doing multi class classification or not to use specific routines for multiclass problems when using return_list == FALSE. Defaults to 2, which is for regression and binary classification.

verbose

Type: character. Whether to print for training evaluation. Use "" for no printing (double quotes without space between quotes). Defaults to " " (double quotes with space between quotes.

garbage

Type: logical. Whether to perform garbage collect regularly. Defaults to FALSE.

work_dir

Type: character, allowing concatenation with another character text (ex: "dev/tools/save_in_this_folder/" = add slash, or "dev/tools/save_here/prefix_" = don't add slash). The working directory to store models. If you provide a working directory, the models will be saved inside that directory (and all other models will get wiped if they are under the same names). It will lower severely the memory usage as the models will not be saved anymore in memory. Combined with garbage == TRUE, you achieve the lowest possible memory usage in this Deep Forest implementation. Defaults to NULL, which means store models in memory.

Details

For implementation details of Cascade Forest / Complete-Random Tree Forest / Multi-Grained Scanning / Deep Forest, check this: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-283942390 by Laurae.

Actually, this function creates a layer of a Cascade Forest. That layer is comprised of two possible elements: Complete-Random Tree Forests (using PFO mode: Probability Averaging + Full Height + Original training samples) and Random Forests. You may choose between them.

Complete-Random Tree Forests in PFO mode are the best random learners inside the Complete-Random Tree Forest families (at least 50

Laurae recommends using xgboost or LightGBM on top of gcForest or Cascade Forest. See the rationale here: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-284689795.

Value

A data.table based on target.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
## Not run: 
# Load libraries
library(data.table)
library(Matrix)
library(xgboost)

# Create data
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
agaricus_data_train <- data.table(as.matrix(agaricus.train$data))
agaricus_data_test <- data.table(as.matrix(agaricus.test$data))
agaricus_label_train <- agaricus.train$label
agaricus_label_test <- agaricus.test$label
folds <- Laurae::kfold(agaricus_label_train, 5)

# Train a model (binary classification)
model <- CRTreeForest(training_data = agaricus_data_train, # Training data
                      validation_data = agaricus_data_test, # Validation data
                      training_labels = agaricus_label_train, # Training labels
                      validation_labels = agaricus_label_test, # Validation labels
                      folds = folds, # Folds for cross-validation
                      nthread = 1, # Change this to use more threads
                      lr = 1, # Do not touch this unless you are expert
                      training_start = NULL, # Do not touch this unless you are expert
                      validation_start = NULL, # Do not touch this unless you are expert
                      n_forest = 5, # Number of forest models
                      n_trees = 10, # Number of trees per forest
                      random_forest = 2, # We want only 2 random forest
                      seed = 0,
                      objective = "binary:logistic",
                      eval_metric = Laurae::df_logloss,
                      return_list = TRUE, # Set this to FALSE for a data.table output
                      multi_class = 2, # Modify this for multiclass problems
                      verbose = " ")

# Attempt to perform fake multiclass problem
agaricus_label_train[1:100] <- 2

# Train a model (multiclass classification)
model <- CRTreeForest(training_data = agaricus_data_train, # Training data
                      validation_data = agaricus_data_test, # Validation data
                      training_labels = agaricus_label_train, # Training labels
                      validation_labels = agaricus_label_test, # Validation labels
                      folds = folds, # Folds for cross-validation
                      nthread = 1, # Change this to use more threads
                      lr = 1, # Do not touch this unless you are expert
                      training_start = NULL, # Do not touch this unless you are expert
                      validation_start = NULL, # Do not touch this unless you are expert
                      n_forest = 5, # Number of forest models
                      n_trees = 10, # Number of trees per forest
                      random_forest = 2, # We want only 2 random forest
                      seed = 0,
                      objective = "multi:softprob",
                      eval_metric = Laurae::df_logloss,
                      return_list = TRUE, # Set this to FALSE for a data.table output
                      multi_class = 3, # Modify this for multiclass problems
                      verbose = " ")

## End(Not run)

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.