xgbm: Wrapper to xgb.cv and xgb.train

Description Usage Arguments Details

View source: R/xgbm.R

Description

Wrapper to xgb.cv and xgb.train

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
xgbm(
  formula,
  data,
  n.trees = 100,
  interaction.depth = 6,
  learning.rate = 0.1,
  weight = NULL,
  bag.fraction = 0.5,
  col.fraction = 1,
  cv.folds = 10,
  cv.class.stratify = FALSE,
  n.cores = NULL,
  n.minobsinnode = 3,
  leaf.penalty = 0,
  weight.penalty.L1 = 0,
  weight.penalty.L2 = 0,
  early.stopping.trees = 100,
  distribution = NULL,
  quant = 0.5,
  event = NULL,
  verbose = 100,
  fail.if.not.converged = TRUE
)

Arguments

formula, data

Formula and data.frame from which to create the model.matrix and response.

n.trees

Number of trees to build.

interaction.depth, learning.rate

Interaction depth (6) and learning rate (0.1).

bag.fraction

Proportion of data in each tree (0.5). If bag.fraction = 1 and, the whole data set is used as-is.

col.fraction

Proportion of columns of features to use. Defaults to col.fraction = 1 and the only reason you'd want to reduce it is due to memory or processing time issues. This is passed as colsample_bynode for no reason other than that's what random forest does. I've no idea if one method is better than another.

cv.folds

Number of cross-validation folds (10).

cv.class.stratify

Whether to stratify cross-validation by the response values (FALSE).

n.cores

Number of cores to use (1).

n.minobsinnode

Minimum number of observations allowed in a tree node (3).

leaf.penalty

Penalty factor for the total number of leaves in trees (0).

weight.penalty.L1, weight.penalty.L2

L1 and L2 penalties for leaf weights (0 for L1 and 1 for L2).

early.stopping.trees

Passed through as early_stopping_rounds and defaults to 100.

distribution

The only values allowed are, "gaussian", "huber" (which uses pseudo-huber loss), "binomial", "multinomial", "poisson", "quantile" or "coxph". Others should be added as the (my) need arises. There is no default because experience suggests that leads too easily to mistakes.

quant

Quantile to be modelled when distribution = "quantile". Defaults to quant = 0.5.

event

Only used when distribution = "coxph". Indicates if the observational unit has an event (1) or is censored (0).

verbose

Control printing (100). Use verbose = 0 for no printing.

fail.if.not.converged

Defaults to TRUE

Details

The function takes on the job of turning the data and formula into the favoured stuff of xgboost and applies sensible metrics given the distribution: that is, it does maximum likelihood when a likelihood function is available (everything but quantile regression).

The response (and any other) variable should be transformed prior to using xgbm, if necessary. An apparent bug, somewhere or other, means that if the transformation is done via the formula, relative influence goes wrong.

For distribution = "coxph", you need to use the event argument.

For distribution = "poisson", if you need an offset, divide the response by the exposure and pass exposure in using the weight argument.

Some of the code is quite inefficient and probably annoying. One of the reasons is that it's best to fail quickly rather than wait for a lot of processing to be done and then fail.


harrysouthworth/mhdm documentation built on Feb. 4, 2022, 12:25 a.m.