CRTreeForest_pred: Complete-Random Tree Forest Predictor implementation in R

Description Usage Arguments Details Value Examples

Description

This function attempts to predict from Complete-Random Tree Forests using xgboost. Predictions are deferred to CRTreeForest_pred_internals.

Usage

1
2
3
CRTreeForest_pred(model, data, folds = NULL, prediction = FALSE,
  multi_class = NULL, data_start = NULL, return_list = TRUE,
  work_dir = NULL)

Arguments

model

Type: list. A model trained by CRTreeForest.

data

Type: data.table. A data to predict on. If passing training data, it will predict as if it was out of fold and you will overfit (so, use the list train_preds instead please).

folds

Type: list. The folds as list for cross-validation if using the training data. Otherwise, leave NULL. Defaults to NULL.

prediction

Type: logical. Whether the predictions of the forest ensemble are averaged. Set it to FALSE for debugging / feature engineering. Setting it to TRUE overrides return_list. Defaults to FALSE.

multi_class

Type: numeric. How many classes you got. Set to 2 for binary classification, or regression cases. Set to NULL to let it try guessing by reading the model. Defaults to NULL.

data_start

Type: vector of numeric. The initial prediction labels. Set to NULL if you do not know what you are doing. Defaults to NULL.

return_list

Type: logical. Whether lists should be returned instead of concatenated frames for predictions. Defaults to TRUE.

work_dir

Type: character, without slash at end (ex: "dev/tools/save_in_this_folder"). The working directory where models are stored, if using external model files as memory. Defaults to NULL, which means models are in memory. It will attempt to detect automatically the working directory from the model if it is available.

Details

For implementation details of Cascade Forest / Complete-Random Tree Forest / Multi-Grained Scanning / Deep Forest, check this: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-283942390 by Laurae.

Value

A data.table or a list based on data predicted using model.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
## Not run: 
# Load libraries
library(data.table)
library(Matrix)
library(xgboost)

# Create data
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
agaricus_data_train <- data.table(as.matrix(agaricus.train$data))
agaricus_data_test <- data.table(as.matrix(agaricus.test$data))
agaricus_label_train <- agaricus.train$label
agaricus_label_test <- agaricus.test$label
folds <- Laurae::kfold(agaricus_label_train, 5)

# Train a model (binary classification)
model <- CRTreeForest(training_data = agaricus_data_train, # Training data
                      validation_data = agaricus_data_test, # Validation data
                      training_labels = agaricus_label_train, # Training labels
                      validation_labels = agaricus_label_test, # Validation labels
                      folds = folds, # Folds for cross-validation
                      nthread = 1, # Change this to use more threads
                      lr = 1, # Do not touch this unless you are expert
                      training_start = NULL, # Do not touch this unless you are expert
                      validation_start = NULL, # Do not touch this unless you are expert
                      n_forest = 5, # Number of forest models
                      n_trees = 10, # Number of trees per forest
                      random_forest = 2, # We want only 2 random forest
                      seed = 0,
                      objective = "binary:logistic",
                      eval_metric = Laurae::df_logloss,
                      return_list = TRUE, # Set this to FALSE for a data.table output
                      multi_class = 2, # Modify this for multiclass problems
                      verbose = " ")

# Predict from model
new_preds <- CRTreeForest_pred(model, agaricus_data_test, return_list = FALSE)

# We can check whether we have equal predictions, it's all TRUE!
all.equal(model$train_preds, CRTreeForest_pred(model, agaricus_data_train, folds = folds))
all.equal(model$valid_preds, CRTreeForest_pred(model, agaricus_data_test))
all.equal(model$train_means, CRTreeForest_pred(model,
                                               agaricus_data_train,
                                               folds = folds,
                                               return_list = FALSE,
                                               prediction = TRUE))
all.equal(model$valid_means, CRTreeForest_pred(model,
                                               agaricus_data_test,
                                               return_list = FALSE,
                                               prediction = TRUE))

# Attempt to perform fake multiclass problem
agaricus_label_train[1:100] <- 2

# Train a model (multiclass classification)
model <- CRTreeForest(training_data = agaricus_data_train, # Training data
                      validation_data = agaricus_data_test, # Validation data
                      training_labels = agaricus_label_train, # Training labels
                      validation_labels = agaricus_label_test, # Validation labels
                      folds = folds, # Folds for cross-validation
                      nthread = 1, # Change this to use more threads
                      lr = 1, # Do not touch this unless you are expert
                      training_start = NULL, # Do not touch this unless you are expert
                      validation_start = NULL, # Do not touch this unless you are expert
                      n_forest = 5, # Number of forest models
                      n_trees = 10, # Number of trees per forest
                      random_forest = 2, # We want only 2 random forest
                      seed = 0,
                      objective = "multi:softprob",
                      eval_metric = Laurae::df_logloss,
                      return_list = TRUE, # Set this to FALSE for a data.table output
                      multi_class = 3, # Modify this for multiclass problems
                      verbose = " ")

# Predict from model for mutliclass problems
new_preds <- CRTreeForest_pred(model, agaricus_data_test, return_list = FALSE)

# We can check whether we have equal predictions, it's all TRUE!
all.equal(model$train_preds, CRTreeForest_pred(model, agaricus_data_train, folds = folds))
all.equal(model$valid_preds, CRTreeForest_pred(model, agaricus_data_test))
all.equal(model$train_means, CRTreeForest_pred(model,
                                               agaricus_data_train,
                                               folds = folds,
                                               return_list = FALSE,
                                               prediction = TRUE))
all.equal(model$valid_means, CRTreeForest_pred(model,
                                               agaricus_data_test,
                                               return_list = FALSE,
                                               prediction = TRUE))

## End(Not run)

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.