Description Usage Arguments Details Value Examples
This function attempts to replicate Multi-Grained Scanning using xgboost. It performs Random Forest n_forest
times using n_trees
trees on your data using a sliding window to create features. You can specify your learning objective using objective
and the metric to check for using eval_metric
. You can plug custom objectives instead of the objectives provided by xgboost
. As with any uncalibrated machine learning methods, this method suffers uncalibrated outputs. Therefore, the usage of scale-dependent metrics is discouraged (please use scale-invariant metrics, such as Accuracy, AUC, R-squared, Spearman correlation...).
1 2 3 4 5 | MGScanning(data, labels, folds, dimensions = 1, depth = 10, stride = 1,
nthread = 1, lr = 1, training_start = NULL, validation_start = NULL,
n_forest = 2, n_trees = 30, random_forest = 1, seed = 0,
objective = "reg:linear", eval_metric = Laurae::df_rmse,
multi_class = 2, verbose = TRUE, garbage = FALSE, work_dir = NULL)
|
data |
Type: data.table ( |
labels |
Type: numeric vector. The training labels. |
folds |
Type: list. The folds as list for cross-validation. |
dimensions |
Type: numeric. The dimensions of the data. Only supported is |
depth |
Type: numeric. The size of the sliding window applied. Use a vector of size 2 when using two dimensions (row, col). Do not make it larger than |
stride |
Type: numeric. The stride (sliding steps) applied to each sliding window. Use a vector of size 2 when using two dimensions (row, col). Defaults to |
nthread |
Type: numeric. The number of threads using for multithreading. 1 means singlethread (uses only one core). Higher may mean faster training if the memory overhead is not too large. Defaults to |
lr |
Type: numeric. The shrinkage affected to each tree to avoid overfitting. Defaults to |
training_start |
Type: numeric vector. The initial training prediction labels. Set to |
validation_start |
Type: numeric vector. The initial validation prediction labels. Set to |
n_forest |
Type: numeric. The number of forest models to create for the Complete-Random Tree Forest. Defaults to |
n_trees |
Type: numeric. The number of trees per forest model to create for the Complete-Random Tree Forest. Defaults to |
random_forest |
Type: numeric. The number of Random Forest in the forest. Defaults to |
seed |
Type: numeric. Random seed for reproducibility. Defaults to |
objective |
Type: character or function. The function which leads |
eval_metric |
Type: function. The function which evaluates |
multi_class |
Type: logical. Defines internally whether you are doing multi class classification or not to use specific routines for multiclass problems when using |
verbose |
Type: character. Whether to print for training evaluation. Use |
garbage |
Type: logical. Whether to perform garbage collect regularly. Defaults to |
work_dir |
Type: character, allowing concatenation with another character text (ex: "dev/tools/save_in_this_folder/" = add slash, or "dev/tools/save_here/prefix_" = don't add slash). The working directory to store models. If you provide a working directory, the models will be saved inside that directory (and all other models will get wiped if they are under the same names). It will lower severely the memory usage as the models will not be saved anymore in memory. Combined with |
For implementation details of Cascade Forest / Complete-Random Tree Forest / Multi-Grained Scanning / Deep Forest, check this: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-283942390 by Laurae.
Multi-Grained Scanning attempts to perform a sort of specialized convolution using the stacking ensemble method from Cascade Forests. They do so by using one layer of Cascade Forest, which can be trained manually using CRTreeForest
. The depth
defines how wide the feature selection is done on each iteration of training. The window slides down/right by stride
every time the training finishes to attempt to learn something else. A low stride
allows fine-grained training, while a larger stride
will attempt to go fast over the data. This could be said the same about depth
, where a small value increases randomness but a large value decreases it (if a powerful feature is present in nearly all the trainings, then you are basically screwed up).
Using Multi-Grained Scanning before a Cascade Forest results in a gcForest.
Laurae recommends using xgboost or LightGBM on top of gcForest or Cascade Forest. See the rationale here: https://github.com/Microsoft/LightGBM/issues/331#issuecomment-284689795.
A data.table based on target
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | ## Not run:
# Load libraries
library(data.table)
library(Matrix)
library(xgboost)
# Create data
data(agaricus.train, package = "lightgbm")
data(agaricus.test, package = "lightgbm")
agaricus_data_train <- data.table(as.matrix(agaricus.train$data))
agaricus_data_test <- data.table(as.matrix(agaricus.test$data))
agaricus_label_train <- agaricus.train$label
agaricus_label_test <- agaricus.test$label
folds <- Laurae::kfold(agaricus_label_train, 5)
# Train a model (binary classification) - FAST VERSION
model <- MGScanning(data = agaricus_data_train, # Training data
labels = agaricus_label_train, # Training labels
folds = folds, # Folds for cross-validation
dimensions = 1, # Change this for 2 dimensions if needed
depth = 10, # Change this to change the sliding window size
stride = 20, # Change this to change the sliding window speed
nthread = 1, # Change this to use more threads
lr = 1, # Do not touch this unless you are expert
training_start = NULL, # Do not touch this unless you are expert
validation_start = NULL, # Do not touch this unless you are expert
n_forest = 2, # Number of forest models
n_trees = 30, # Number of trees per forest
random_forest = 1, # We want only 2 random forest
seed = 0,
objective = "binary:logistic",
eval_metric = Laurae::df_logloss,
multi_class = 2, # Modify this for multiclass problems)
verbose = TRUE)
# Train a model (binary classification) - SLOW
model <- MGScanning(data = agaricus_data_train, # Training data
labels = agaricus_label_train, # Training labels
folds = folds, # Folds for cross-validation
dimensions = 1, # Change this for 2 dimensions if needed
depth = 10, # Change this to change the sliding window size
stride = 1, # Change this to change the sliding window speed
nthread = 1, # Change this to use more threads
lr = 1, # Do not touch this unless you are expert
training_start = NULL, # Do not touch this unless you are expert
validation_start = NULL, # Do not touch this unless you are expert
n_forest = 2, # Number of forest models
n_trees = 30, # Number of trees per forest
random_forest = 1, # We want only 2 random forest
seed = 0,
objective = "binary:logistic",
eval_metric = Laurae::df_logloss,
multi_class = 2, # Modify this for multiclass problems)
verbose = TRUE)
# Create predictions
data_predictions <- model$preds
# Example on fake pictures (matrices) and multiclass problem
# Generate fake images
new_data <- list(matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20),
matrix(rnorm(n = 400), ncol = 20, nrow = 20))
# Generate fake labels
new_labels <- c(2, 1, 0, 2, 1, 0, 2, 1, 0, 0)
# Train a model (multiclass problem)
model <- MGScanning(data = new_data, # Training data
labels = new_labels, # Training labels
folds = list(1:3, 3:6, 7:10), # Folds for cross-validation
dimensions = 2,
depth = 10,
stride = 1,
nthread = 1, # Change this to use more threads
lr = 1, # Do not touch this unless you are expert
training_start = NULL, # Do not touch this unless you are expert
validation_start = NULL, # Do not touch this unless you are expert
n_forest = 2, # Number of forest models
n_trees = 10, # Number of trees per forest
random_forest = 1, # We want only 2 random forest
seed = 0,
objective = "multi:softprob",
eval_metric = Laurae::df_logloss,
multi_class = 3, # Modify this for multiclass problems)
verbose = TRUE)
# Matrix output is 10x600
dim(model$preds)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.