xgbm: Wrapper to xgb.cv and xgb.train
In harrysouthworth/mhdm: Medical History Data Mining

Wrapper to xgb.cv and xgb.train

xgbm(
  formula,
  data,
  n.trees = 100,
  interaction.depth = 6,
  learning.rate = 0.1,
  weight = NULL,
  bag.fraction = 0.5,
  col.fraction = 1,
  cv.folds = 10,
  cv.class.stratify = FALSE,
  n.cores = NULL,
  n.minobsinnode = 3,
  leaf.penalty = 0,
  weight.penalty.L1 = 0,
  weight.penalty.L2 = 0,
  early.stopping.trees = 100,
  distribution = NULL,
  quant = 0.5,
  event = NULL,
  verbose = 100,
  fail.if.not.converged = TRUE
)

`formula, data`	Formula and data.frame from which to create the model.matrix and response.
`n.trees`	Number of trees to build.
`interaction.depth, learning.rate`	Interaction depth (6) and learning rate (0.1).
`bag.fraction`	Proportion of data in each tree (0.5). If `bag.fraction = 1` and, the whole data set is used as-is.
`col.fraction`	Proportion of columns of features to use. Defaults to `col.fraction = 1` and the only reason you'd want to reduce it is due to memory or processing time issues. This is passed as `colsample_bynode` for no reason other than that's what random forest does. I've no idea if one method is better than another.
`cv.folds`	Number of cross-validation folds (10).
`cv.class.stratify`	Whether to stratify cross-validation by the response values (FALSE).
`n.cores`	Number of cores to use (1).
`n.minobsinnode`	Minimum number of observations allowed in a tree node (3).
`leaf.penalty`	Penalty factor for the total number of leaves in trees (0).
`weight.penalty.L1, weight.penalty.L2`	L1 and L2 penalties for leaf weights (0 for L1 and 1 for L2).
`early.stopping.trees`	Passed through as `early_stopping_rounds` and defaults to 100.
`distribution`	The only values allowed are, "gaussian", "huber" (which uses pseudo-huber loss), "binomial", "multinomial", "poisson", "quantile" or "coxph". Others should be added as the (my) need arises. There is no default because experience suggests that leads too easily to mistakes.
`quant`	Quantile to be modelled when `distribution = "quantile"`. Defaults to `quant = 0.5`.
`event`	Only used when `distribution = "coxph"`. Indicates if the observational unit has an event (1) or is censored (0).
`verbose`	Control printing (100). Use `verbose = 0` for no printing.
`fail.if.not.converged`	Defaults to TRUE

The function takes on the job of turning the data and formula into the favoured stuff of xgboost and applies sensible metrics given the distribution: that is, it does maximum likelihood when a likelihood function is available (everything but quantile regression).

The response (and any other) variable should be transformed prior to using xgbm, if necessary. An apparent bug, somewhere or other, means that if the transformation is done via the formula, relative influence goes wrong.

For distribution = "coxph", you need to use the event argument.

For distribution = "poisson", if you need an offset, divide the response by the exposure and pass exposure in using the weight argument.

Some of the code is quite inefficient and probably annoying. One of the reasons is that it's best to fail quickly rather than wait for a lot of processing to be done and then fail.

harrysouthworth/mhdm documentation built on Feb. 4, 2022, 12:25 a.m.