CascadeForest_pred: Cascade Forest Predictor implementation in R

Description Usage Arguments Details Value Examples

Description

This function attempts to predict from Cascade Forest using xgboost.

Usage

1
2
3
CascadeForest_pred(model, data, folds = NULL, layer = NULL,
  prediction = TRUE, multi_class = NULL, data_start = NULL,
  return_list = FALSE, low_memory = FALSE)

Arguments

model

Type: list. A model trained by CascadeForest.

data

Type: data.table. A data to predict on. If passing training data, it will predict as if it was out of fold and you will overfit (so, use the list train_preds instead please).

folds

Type: list. The folds as list for cross-validation if using the training data. Otherwise, leave NULL. Defaults to NULL.

layer

Type: numeric. The layer you want to predict on. If not provided (NULL), attempts to guess by taking the last layer of the model. Defaults to NULL.

prediction

Type: logical. Whether the predictions of the forest ensemble are averaged. Set it to FALSE for debugging / feature engineering. Setting it to TRUE overrides return_list. Defaults to TRUE.

multi_class

Type: numeric. How many classes you got. Set to 2 for binary classification, or regression cases. Set to NULL to let it try guessing by reading the model. Defaults to NULL.

data_start

Type: vector of numeric. The initial prediction labels. Set to NULL if you do not know what you are doing. Defaults to NULL.

return_list

Type: logical. Whether lists should be returned instead of concatenated frames for predictions. Defaults to TRUE.

low_memory

Type: logical. Whether to perform the data.table transformations in place to lower memory usage. Defaults to FALSE.

Details

For implementation details of Cascade Forest / Complete-Random Tree Forest / Multi-Grained Scanning / Deep Forest, check this: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-283942390 by Laurae.

Value

A data.table or a list based on data predicted using model.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
## Not run: 
# Load libraries
library(data.table)
library(Matrix)
library(xgboost)

# Create data
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
agaricus_data_train <- data.table(as.matrix(agaricus.train$data))
agaricus_data_test <- data.table(as.matrix(agaricus.test$data))
agaricus_label_train <- agaricus.train$label
agaricus_label_test <- agaricus.test$label
folds <- Laurae::kfold(agaricus_label_train, 5)

# Train a model (binary classification)
model <- CascadeForest(training_data = agaricus_data_train, # Training data
                       validation_data = agaricus_data_test, # Validation data
                       training_labels = agaricus_label_train, # Training labels
                       validation_labels = agaricus_label_test, # Validation labels
                       folds = folds, # Folds for cross-validation
                       boosting = FALSE, # Do not touch this unless you are expert
                       nthread = 1, # Change this to use more threads
                       cascade_lr = 1, # Do not touch this unless you are expert
                       training_start = NULL, # Do not touch this unless you are expert
                       validation_start = NULL, # Do not touch this unless you are expert
                       cascade_forests = rep(4, 5), # Number of forest models
                       cascade_trees = 10, # Number of trees per forest
                       cascade_rf = 2, # Number of Random Forest in models
                       cascade_seeds = 0, # Seed per layer
                       objective = "binary:logistic",
                       eval_metric = Laurae::df_logloss,
                       multi_class = 2, # Modify this for multiclass problems
                       early_stopping = 2, # stop after 2 bad combos of forests
                       maximize = FALSE, # not a maximization task
                       verbose = TRUE, # print information during training
                       low_memory = FALSE)

# Predict from model
new_preds <- CascadeForest_pred(model, agaricus_data_test, prediction = FALSE)

# We can check whether we have equal predictions, it's all TRUE!
all.equal(model$train_means, CascadeForest_pred(model,
                                                agaricus_data_train,
                                                folds = folds))
all.equal(model$valid_means, CascadeForest_pred(model,
                                                agaricus_data_test))

# Attempt to perform fake multiclass problem
agaricus_label_train[1:100] <- 2

# Train a model (multiclass classification)
model <- CascadeForest(training_data = agaricus_data_train, # Training data
                       validation_data = agaricus_data_test, # Validation data
                       training_labels = agaricus_label_train, # Training labels
                       validation_labels = agaricus_label_test, # Validation labels
                       folds = folds, # Folds for cross-validation
                       boosting = FALSE, # Do not touch this unless you are expert
                       nthread = 1, # Change this to use more threads
                       cascade_lr = 1, # Do not touch this unless you are expert
                       training_start = NULL, # Do not touch this unless you are expert
                       validation_start = NULL, # Do not touch this unless you are expert
                       cascade_forests = rep(4, 5), # Number of forest models
                       cascade_trees = 10, # Number of trees per forest
                       cascade_rf = 2, # Number of Random Forest in models
                       cascade_seeds = 0, # Seed per layer
                       objective = "multi:softprob",
                       eval_metric = Laurae::df_logloss,
                       multi_class = 3, # Modify this for multiclass problems
                       early_stopping = 2, # stop after 2 bad combos of forests
                       maximize = FALSE, # not a maximization task
                       verbose = TRUE, # print information during training
                       low_memory = FALSE)

# Predict from model for mutliclass problems
new_preds <- CascadeForest_pred(model, agaricus_data_test, prediction = FALSE)

# We can check whether we have equal predictions, it's all TRUE!
all.equal(model$train_means, CascadeForest_pred(model,
                                                agaricus_data_train,
                                                folds = folds))
all.equal(model$valid_means, CascadeForest_pred(model,
                                                agaricus_data_test))

## End(Not run)

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.