xgb.ncv: xgboost repeated cross-validation (Repeated k-fold)
In Laurae2/Laurae: Advanced High Performance Data Science Toolbox for R

Description Usage Arguments Value Examples

This function allows you to run a repeated cross-validation using xgboost, to get out of fold predictions, and to get predictions from each fold on external data. It currently does not work for non 1-column prediction (only works for binary classification and regression). Verbosity is automatic and cannot be removed. In case you need this function without verbosity, please compile the package after removing verbose messages. In addition, a sink is forced. Make sure to run sink() if you interrupt (or if xgboost interrupts) prematurely the execution of the function. Otherwise, you end up with no more messages printed to your R console.

xgb.ncv(data, label, extra_data = NA, out_of_fold = TRUE, nfolds = 5,
  ntimes = 3, nthread = 2, seed = 11111, verbose = 1,
  print_every_n = 1, sinkfile = "debug.txt", booster = "gbtree",
  eta = 0.3, max_depth = 6, min_child_weight = 1, gamma = 0,
  subsample = 1, colsample_bytree = 1, num_parallel_tree = 1,
  maximum_rounds = 1e+05, objective = "binary:logistic",
  eval_metric = "logloss", maximize = FALSE, early_stopping_rounds = 50)

`data`	The data as a matrix or sparse matrix.
`label`	The label associated with the data.
`extra_data`	The data you want to predict on using the fold models.
`out_of_fold`	Should we predict out of fold? (this includes both `data` and `extra_data`). Defaults to `TRUE`.
`nfolds`	How many folds should we use for the validation? The greater the better (increases linearly*ntimes the computation time. Defaults to `5`.
`ntimes`	How many folds should we use? The greater the more stable results (increases linearly*nfolds the computation time.) Defaults to `3`.
`nthread`	How many threads to run for xgboost? Defaults to `2`.
`seed`	Which seed should we use globally for all commands dependent on a random seed? Defaults to `11111`.
`verbose`	Should we print verbose data in xgboost? xgboost messages will be sinked in any case. Defaults to `1`.
`print_every_n`	Every how many iterations should we print verbose data? xgboost messages will be sinked in any case.Defaults to `1`.
`sinkfile`	What file name to give to the sink? This is where printed messages of xgboost will be stored. Defaults to `"debug.txt"`.
`booster`	What xgboost booster to use? Defaults to `"gbtree"` and must not be changed (does NOT work otherwise).
`eta`	The shrinkage in xgboost. The lower the better, but increases exponentially the computation time as it gets lower. Defaults to `0.3`.
`max_depth`	The maximum depth of each tree in xgboost. Defaults to `6`.
`min_child_weight`	The minimum hessian weight needed in a child node. Defaults to `1`.
`gamma`	The minimum loss reduction needed in a child node. Defaults to `0`.
`subsample`	The sampling ratio of observations during each iteration. Use `0.632` to simulate Random Forests. Defaults to `1`.
`colsample_bytree`	The sampling ratio of features during each iteration. Defaults to `1`.
`num_parallel_tree`	How many trees to grow per iteration? A number higher than `1` simulates boosted Random Forests. Defaults to `1`.
`maximum_rounds`	How many rounds until giving up boosting if not stopped early? Defaults to `100000`.
`objective`	The objective function. Defaults to `"binary:logistic"`.
`eval_metric`	The evaluation metric. Defaults to `"logloss"`.
`maximize`	Should we maximize the evaluation metric? Defaults to `FALSE`.
`early_stopping_rounds`	How many rounds the evaluation metric does not follow the maximization rule to force stopping a boosting iteration of xgboost on a fold? Defaults to `50`.

A list with two to four elements: "scores" for the scored folds (data.frame), "folds" for the folds IDs (list), "preds" for out of fold predictions (data.frame), and "extra" for extra data predictions per fold (data.frame).

1
2
3

#Pick your xgb.cv function, replace data by the initial matrix, insert the label,
#check ntimes to the value you want, and change the sinkfile.
#Unlist params if needed, and add the seed as a parameter.

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.