SL.xgboost_cv: XGBoost SuperLearner wrapper with internal cross-validation...
In ck37r: Chris Kennedy's R Toolkit

Supports the Extreme Gradient Boosting package for SuperLearnering, which is a variant of gradient boosted machines (GBM). Conducts internal cross-validation and stops when performance plateaus.

SL.xgboost_cv(
  Y,
  X,
  newX,
  family,
  obsWeights,
  id,
  ntrees = 5000L,
  early_stopping_rounds = 200L,
  nfold = 5L,
  max_depth = 4L,
  shrinkage = 0.1,
  minobspernode = 10L,
  subsample = 0.7,
  colsample_bytree = 0.8,
  gamma = 5,
  stratified = family$family == "binomial",
  eval_metric = ifelse(family$family == "binomial", "auc", "rmse"),
  print_every_n = 400L,
  nthread = getOption("sl.cores", 1L),
  verbose = 0,
  save_period = NULL,
  ...
)

`Y`	Outcome variable
`X`	Covariate dataframe
`newX`	Optional dataframe to predict the outcome
`family`	"gaussian" for regression, "binomial" for binary classification, "multinomial" for multiple classification (not yet supported).
`obsWeights`	Optional observation-level weights (supported but not tested)
`id`	Optional id to group observations from the same unit (not used currently).
`ntrees`	How many trees to fit. Low numbers may underfit but high numbers may overfit, depending also on the shrinkage.
`early_stopping_rounds`	If performance has not improved in this many rounds, stop.
`nfold`	Number of internal cross-validation folds.
`max_depth`	How deep each tree can be. 1 means no interactions, aka tree stubs.
`shrinkage`	How much to shrink the predictions, in order to reduce overfitting.
`minobspernode`	Minimum observations allowed per tree node, after which no more splitting will occur.
`subsample`	Observation sub-sampling, to reduce tree correlation.
`colsample_bytree`	Column sub-sampling, to reduce tree correlation.
`gamma`	Metric for node-splitting, higher values result in less complex trees.
`stratified`	If stratified sampling should be used for binary outcomes, defaults to T.
`eval_metric`	Metric to use for early-stopping, defaults to AUC for classification and RMSE for regression.
`print_every_n`	Print estimation status every n rounds.
`nthread`	How many threads (cores) should xgboost use. Generally we want to keep this to 1 so that XGBoost does not compete with SuperLearner parallelization.
`verbose`	Verbosity of XGB fitting.
`save_period`	How often (in tree iterations) to save current model to disk during processing. If NULL does not save model, and if 0 saves model at the end.
`...`	Any remaining arguments (not used).

The performance of XGBoost, like GBM, is sensitive to the configuration settings. Therefore it is best to create multiple configurations using create.SL.xgboost and allow the SuperLearner to choose the best weights based on cross-validated performance.

If you run into errors please first try installing the latest version of XGBoost from CRAN.

ck37r documentation built on Feb. 6, 2020, 5:09 p.m.