lgbm.cv.prep: LightGBM Cross-Validated Model Preparation
In Laurae2/Laurae: Advanced High Performance Data Science Toolbox for R

Description Usage Arguments Value Examples

This function allows you to prepare the cross-validatation of a LightGBM model. It is recommended to have your x_train and x_val sets as data.table (or data.frame), and the data.table development version. To install data.table development version, please run in your R console: install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table"). SVMLight conversion requires Laurae's sparsity package, which can be installed using devtools:::install_github("Laurae2/sparsity"). SVMLight format extension used is .svm. Does not handle weights or groups.

lgbm.cv.prep(y_train, x_train, x_test = NA, SVMLight = is(x_train,
  "dgCMatrix"), data_has_label = FALSE, NA_value = "nan",
  workingdir = getwd(), train_all = FALSE, test_all = FALSE,
  cv_all = TRUE, train_name = paste0("lgbm_train", ifelse(SVMLight, ".svm",
  ".csv")), val_name = paste0("lgbm_val", ifelse(SVMLight, ".svm", ".csv")),
  test_name = paste0("lgbm_test", ifelse(SVMLight, ".svm", ".csv")),
  verbose = TRUE, folds = 5, folds_weight = NA, stratified = TRUE,
  fold_seed = 0, fold_cleaning = 50)

`y_train`	Type: vector. The training labels.
`x_train`	Type: data.table or dgCMatrix (with `SVMLight = TRUE`). The training features.
`x_test`	Type: data.table or dgCMatrix (with `SVMLight = TRUE`). The testing features, if necessary. Not providing a data.frame or a matrix results in at least 3x memory usage. Defaults to `NA`.
`SVMLight`	Type: boolean. Whether the input is a dgCMatrix to be output to SVMLight format. Setting this to `TRUE` enforces you must provide labels separately (in `y_train`) and headers will be ignored. This is default behavior of SVMLight format. Defaults to `FALSE`.
`data_has_label`	Type: boolean. Whether the data has labels or not. Do not modify this. Defaults to `FALSE`.
`NA_value`	Type: numeric or character. What value replaces NAs. Use `"na"` if you want to specify "missing". It is not recommended to use something else, even by soemthing like a numeric value out of bounds (like `-999` if all your values are greater than `-999`). You should change from the default `"na"` if they have a real numeric meaning. Defaults to `"na"`.
`workingdir`	Type: character. The working directory used for LightGBM. Defaults to `getwd()`.
`train_all`	Type: boolean. Whether the full train data should be exported to the requested format for usage with `lgbm.train`. Defaults to `FALSE`.
`test_all`	Type: boolean. Whether the full test data should be exported to the requested format for usage with `lgbm.train`. Defaults to `FALSE`.
`cv_all`	Type: boolean. Whether the full cross-validation data should be exported to the requested format for usage with `lgbm.cv`. Defaults to `TRUE`.
`train_name`	Type: character. The name of the default training data file for the model. Defaults to `paste0('lgbm_train', ifelse(SVMLight, '.svm', '.csv'))`.
`val_name`	Type: character. The name of the default validation data file for the model. Defaults to `paste0('lgbm_val', ifelse(SVMLight, '.svm', '.csv'))`.
`test_name`	Type: character. The name of the testing data file for the model. Defaults to `paste0('lgbm_test', ifelse(SVMLight, '.svm', '.csv'))`.
`verbose`	Type: boolean. Whether `fwrite` data is output. Defaults to `TRUE`.
`folds`	Type: integer, vector of two integers, vector of integers, or list. If a integer is supplied, performs a `folds`-fold cross-validation. If a vector of two integers is supplied, performs a `folds[1]`-fold cross-validation repeated `folds[2]` times. If a vector of integers (larger than 2) was provided, each integer value should refer to the fold, of the same length of the training data. Otherwise (if a list was provided), each element of the list must refer to a fold and they will be treated sequentially. Defaults to `5`.
`folds_weight`	Type: vector of numerics. The weights assigned to each fold. If no weight is supplied (`NA`), the weights are automatically set to `rep(1/length(folds))` for an average (does not mix well with folds with different sizes). When the folds are automatically created by supplying `fold` a vector of two integers, then the weights are automatically computed. Defaults to `NA`.
`stratified`	Type: boolean. Whether the folds should be stratified (keep the same label proportions) or not. Defaults to `TRUE`.
`fold_seed`	Type: integer or vector of integers. The seed for the random number generator. If a vector of integer is provided, its length should be at least longer than `n`. Otherwise (if an integer is supplied), it starts each fold with the provided seed, and adds 1 to the seed for every repeat. Defaults to `0`.
`fold_cleaning`	Type: integer. When using cross-validation, data must be subsampled. This parameter controls how aggressive RAM usage should be against speed. The lower this value, the more aggressive the method to keep memory usage as low as possible. Defaults to `50`.

The folds and folds_weight elements in a list if cv_all = TRUE. All files are output and ready to use for lgbm.cv with files_exist = TRUE. If using train_all, it is ready to be used with lgbm.train and files_exist = TRUE. Returns "Success" if cv_all = FALSE and the code does not error mid-way.

## Not run: 
Prepare files for cross-validation.
trained.cv <- lgbm.cv(y_train = targets,
                      x_train = data[1:1500, ],
                      workingdir = file.path(getwd(), "temp"),
                      train_conf = 'lgbm_train.conf',
                      train_name = 'lgbm_train.csv',
                      val_name = 'lgbm_val.csv',
                      folds = 3)

## End(Not run)