Description Usage Arguments Details Value Examples
This function allows you to cross-validate a LightGBM model.
It is recommended to have your x_train and x_val sets as data.table, and to use the development data.table version.
To install data.table development version, please run in your R console: install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table")
.
The speed increase to create the train and test files can exceed 1,000x over write.table in certain cases.
To store evaluation metrics throughout the training, you MUST run this function with verbose = FALSE
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | lgbm.cv(y_train, x_train, bias_train = NA, x_test = NA,
SVMLight = is(x_train, "dgCMatrix"), data_has_label = TRUE,
NA_value = "nan", lgbm_path = "path/to/LightGBM.exe",
workingdir = getwd(), train_name = paste0("lgbm_train", ifelse(SVMLight,
".svm", ".csv")), val_name = paste0("lgbm_val", ifelse(SVMLight, ".svm",
".csv")), test_name = paste0("lgbm_test", ifelse(SVMLight, ".svm", ".csv")),
init_score = ifelse(is.na(bias_train), NA, paste(train_name, ".weight", sep
= "")), files_exist = FALSE, save_binary = FALSE,
train_conf = "lgbm_train.conf", pred_conf = "lgbm_pred.conf",
test_conf = "lgbm_test.conf", validation = TRUE, unicity = FALSE,
folds = 5, folds_weight = NA, stratified = TRUE, fold_seed = 0,
fold_cleaning = 50, predictions = TRUE, predict_leaf_index = FALSE,
separate_val = TRUE, separate_tests = TRUE,
output_preds = "lgbm_predict.txt", test_preds = "lgbm_predict_test.txt",
verbose = TRUE, log_name = "lgbm_log.txt", full_quiet = FALSE,
full_console = FALSE, importance = FALSE,
output_model = "lgbm_model.txt", input_model = NA, num_threads = 2,
histogram_pool_size = -1, is_sparse = TRUE, two_round = FALSE,
application = "regression", learning_rate = 0.1, num_iterations = 10,
early_stopping_rounds = NA, num_leaves = 127, min_data_in_leaf = 100,
min_sum_hessian_in_leaf = 10, max_bin = 255, feature_fraction = 1,
feature_fraction_seed = 2, bagging_fraction = 1, bagging_freq = 0,
bagging_seed = 3, is_sigmoid = TRUE, sigmoid = 1,
is_unbalance = FALSE, max_position = 20, label_gain = c(0, 1, 3, 7, 15,
31, 63), metric = "l2", metric_freq = 1, is_training_metric = FALSE,
ndcg_at = c(1, 2, 3, 4, 5), tree_learner = "serial",
is_pre_partition = FALSE, data_random_seed = 1, num_machines = 1,
local_listen_port = 12400, time_out = 120, machine_list_file = "")
|
y_train |
Type: vector. The training labels. |
x_train |
Type: data.table (preferred), data.frame, or dgCMatrix (with |
bias_train |
Type: numeric or vector of numerics. The initial weights of the training data. If a numeric is provided, then the weights are identical for all the training samples. Otherwise, use the vector as weights. Defaults to |
x_test |
Type: data.table (preferred), data.frame, or dgCMatrix (with |
SVMLight |
Type: boolean. Whether the input is a dgCMatrix to be output to SVMLight format. Setting this to |
data_has_label |
Type: boolean. Whether the data has labels or not. Do not modify this. Defaults to |
NA_value |
Type: numeric or character. What value replaces NAs. Use |
lgbm_path |
Type: character. Where is stored LightGBM? Include only the folder to it. Defaults to |
workingdir |
Type: character. The working directory used for LightGBM. Defaults to |
train_name |
Type: character. The name of the default training data file for the model. Defaults to |
val_name |
Type: character. The name of the default validation data file for the model. Defaults to |
test_name |
Type: character. The name of the testing data file for the model. Defaults to |
init_score |
Type: string. The file name of initial (bias) training scores to start training LightGBM, which contains |
files_exist |
Type: boolean. Whether the training (and testing) files are already existing. It overwrites files if there are any existing. Defaults to |
save_binary |
Type: boolean. Whether data should be saved as binary files for faster load. The name takes automatically the name from the |
train_conf |
Type: character. The name of the training configuration file for the model. Defaults to |
pred_conf |
Type: character. The name of the prediction configuration file for the model. Defaults to |
test_conf |
Type: character. The name of the testing prediction configuration file for the model. Defaults to |
validation |
Type: boolean. Whether LightGBM performs validation during the training, by outputting metrics for the validation data. Defaults to |
unicity |
Type: boolean. Whether to overwrite each train/validation file. If not, adds a tag to each file. Defaults to |
folds |
Type: integer, vector of two integers, vector of integers, or list. If a integer is supplied, performs a |
folds_weight |
Type: vector of numerics. The weights assigned to each fold. If no weight is supplied ( |
stratified |
Type: boolean. Whether the folds should be stratified (keep the same label proportions) or not. Defaults to |
fold_seed |
Type: integer or vector of integers. The seed for the random number generator. If a vector of integer is provided, its length should be at least longer than |
fold_cleaning |
Type: integer. When using cross-validation, data must be subsampled. This parameter controls how aggressive RAM usage should be against speed. The lower this value, the more aggressive the method to keep memory usage as low as possible. Defaults to |
predictions |
Type: boolean. Whether cross-validated predictions should be returned. Defaults to |
predict_leaf_index |
Type: boolean. When |
separate_val |
Type: boolean. Whether out of fold predictions should be returned separately as raw as possible (a list with the predictions, and another ilst with the averaged predictions). Defaults to |
separate_tests |
Type: boolean. Whether weighted testing predictions should be returned separately as raw as possible (a list with the predictions, and another ilst with the averaged predictions). Defaults to |
output_preds |
Type: character. The file name of the prediction results for the model. Defaults to |
test_preds |
Type: character. The file name of the prediction results for the model. Defaults to |
verbose |
Type: boolean/integer. Whether to print a lot of debug messages in the console or not. 0 is FALSE and 1 is TRUE. Defaults to |
log_name |
Type: character. The logging (sink) file to output (like 'log.txt'). Defaults to |
full_quiet |
Type: boolean. Whether file writing is quiet or not. When set to |
full_console |
Type: boolean. Whether a dedicated console should be visible. Defaults to |
importance |
Type: boolean. Should LightGBM perform feature importance? Defaults to |
output_model |
Type: character. The file name of output model. Defaults to |
input_model |
Type: character. The file name of input model. You MUST user a different |
num_threads |
Type: integer. The number of threads to run for LightGBM. It is recommended to not set it higher than the amount of physical cores in your computer. Defaults to |
histogram_pool_size |
Type: integer. The maximum cache size (in MB) allocated for LightGBM histogram sketching. Values below |
is_sparse |
Type: boolean. Whether sparse optimization is enabled. Do not set this to |
two_round |
Type: boolean. LightGBM maps data file to memory and load features from memory to maximize speed. If the data is too large to fit in memory, use TRUE. Defaults to |
application |
Type: character. The label application to learn. Must be either |
learning_rate |
Type: numeric. The shrinkage rate applied to each iteration. Lower values lowers overfitting speed, while higher values increases overfitting speed. Defaults to |
num_iterations |
Type: integer. The number of boosting iterations LightGBM will perform. Defaults to |
early_stopping_rounds |
Type: integer. The number of boosting iterations whose validation metric is lower than the best is required for LightGBM to automatically stop. Defaults to |
num_leaves |
Type: integer. The number of leaves in one tree. Roughly, a recommended value is |
min_data_in_leaf |
Type: integer. Minimum number of data in one leaf. Higher values potentially decrease overfitting. Defaults to |
min_sum_hessian_in_leaf |
Type: numeric. Minimum sum of hessians in one leaf to allow a split. Higher values potentially decrease overfitting. Defaults to |
max_bin |
Type: integer. The maximum number of bins created per feature. Lower values potentially decrease overfitting. Defaults to |
feature_fraction |
Type: numeric (0, 1). Column subsampling percentage. For instance, 0.5 means selecting 50% of features randomly for each iteration. Lower values potentially decrease overfitting, while training faster. Defaults to |
feature_fraction_seed |
Type: integer. Random starting seed for the column subsampling ( |
bagging_fraction |
Type: numeric (0, 1). Row subsampling percentage. For instance, 0.5 means selecting 50% of rows randomly for each iteration. Lower values potentially decrease overfitting, while training faster. Defaults to |
bagging_freq |
Type: integer. The frequency of row subsampling ( |
bagging_seed |
Type: integer. Random starting seed for the row subsampling ( |
is_sigmoid |
Type: boolean. Whether to use a sigmoid transformation of raw predictions. Defaults to |
sigmoid |
Type: numeric. "The sigmoid parameter". Defaults to |
is_unbalance |
Type: boolean. For binary classification, setting this to TRUE might be useful when the training data is unbalanced. Defaults to |
max_position |
Type: integer. For lambdarank, optimize NDCG for that specific value. Defaults to |
label_gain |
Type: vector of integers. For lambdarank, relevant gain for labels. Defaults to |
metric |
Type: character, or vector of characters. The metric to optimize. There are 6 available: |
metric_freq |
Type: integer. The frequency to report the metric(s). Defaults to |
is_training_metric |
Type: boolean. Whether to report the training metric in addition to the validation metric. Defaults to |
ndcg_at |
Type: vector of integers. Evaluate NDCG metric at these values. Defaults to |
tree_learner |
Type: character. The type of learner use, between |
is_pre_partition |
Type: boolean. Whether data is pre-partitioned for parallel learning. Defaults to |
data_random_seed |
Type: integer. Random starting seed for the parallel learner. Defaults to |
num_machines |
Type: integer. When using parallel learning, the number of machines to use. Defaults to |
local_listen_port |
Type: integer. The TCP listening port for the local machines. Allow this port in the firewall before training. |
time_out |
Type: integer. The socket time-out in minutes. Defaults to |
machine_list_file |
Type: character. The file that contains the machine list for parallel learning. A line in that file much correspond to one IP and one port for one machine, separated by space instead of a colon ( |
The most important parameters are lgbm_path
and workingdir
: they setup where LightGBM is and where temporary files are going to be stored. lgbm_path
is the full path to LightGBM executable, and includes the executable name and file extension (like C:/Laurae/LightGBM/windows/x64/Release/LightGBM.exe
). workingdir
is the working directory for the temporary files for LightGBM. It creates a lot of necessary files to make LightGBM work (defined by output_model, output_preds, train_conf, train_name, val_name, pred_conf
).
train_conf
, train_name
, and val_name
defines respectively the configuration file name, the train file name, and the validation file name. They are created under this name when files_exist
is set to FALSE
.
unicity
defines whether to create separate files (if TRUE
) or to save space by writing over the same file (if FALSE
). Predicting does not work with FALSE
. Files are taking the names you provided (or the default ones) while adding a "_X" to the file name before the file extension if unicity = FALSE
.
Once you filled these variables (and if they were appropriate), you should fill y_train, x_train
. If you need model validation, fill also y_val, x_val
. y is your label (a vector), while x is your data.table (preferred) or a data.frame or a matrix.
Then, you are up to choose what you want, including hyperparameters to verbosity control.
To get the metric tables, you MUST use verbose = FALSE
. It cannot be fetched without. sink()
does not work.
If for some reason you lose the ability to print in the console, run sink()
in the console several times until you get an error.
A list of LightGBM models whose structure is defined in lgbm.train documentation in Value. Returns a list of character variables if LightGBM is not found under lgbm_path. In addition, weighted out of fold predictions Validation
are provided if predictions
is set to TRUE
, and weighted averaged testing predictions Testing
are provided if predictions
is set to TRUE
with a testing set, and weights Weights
if predictions
is set to TRUE
. Also, aggregated feature importance is provided if importance
is set to TRUE
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | ## Not run:
#5-fold cross-validated LightGBM, on very simple data.
library(Laurae)
library(stringi)
library(Matrix)
library(sparsity)
library(data.table)
remove(list = ls()) # WARNING: CLEANS EVERYTHING IN THE ENVIRONMENT
setwd("C:/LightGBM/temp") # DIRECTORY FOR TEMP FILES
DT <- data.table(Split1 = c(rep(0, 50), rep(1, 50)),
Split2 = rep(c(rep(0, 25), rep(0.5, 25)), 2))
DT$Split3 <- rep(c(rep(0, 10), rep(0.25, 15)), 4)
DT$Split4 <- rep(c(rep(0, 5), rep(0.1, 5), rep(0, 5), rep(0.1, 10)), 4)
DT$Split5 <- rep(c(rep(0, 5), rep(0.05, 5), rep(0, 10), rep(0.05, 5)), 4)
label <- as.numeric((DT$Split2 == 0) & (DT$Split1 == 0) & (DT$Split3 == 0))
trained <- lgbm.cv(y_train = label,
x_train = DT,
bias_train = NA,
folds = 5,
unicity = TRUE,
application = "binary",
num_iterations = 1,
early_stopping_rounds = 1,
learning_rate = 5,
num_leaves = 16,
min_data_in_leaf = 1,
min_sum_hessian_in_leaf = 1,
tree_learner = "serial",
num_threads = 1,
lgbm_path = "C:/LightGBM/windows/x64/Release/lightgbm.exe",
workingdir = getwd(),
validation = FALSE,
files_exist = FALSE,
verbose = TRUE,
is_training_metric = TRUE,
save_binary = TRUE,
metric = "binary_logloss")
str(trained)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.