s_XRF: XGBoost Random Forest Classification and Regression (C, R)

View source: R/s_XRF.R

s_XRFR Documentation

XGBoost Random Forest Classification and Regression (C, R)

Description

Tune hyperparameters using grid search and resampling, train a final model, and validate it

Usage

s_XRF(
  x,
  y = NULL,
  x.test = NULL,
  y.test = NULL,
  x.name = NULL,
  y.name = NULL,
  num_parallel_tree = 1000,
  booster = c("gbtree", "gblinear", "dart"),
  missing = NA,
  nrounds = 1,
  weights = NULL,
  ifw = TRUE,
  ifw.type = 2,
  upsample = FALSE,
  downsample = FALSE,
  resample.seed = NULL,
  obj = NULL,
  feval = NULL,
  xgb.verbose = NULL,
  print_every_n = 100L,
  early_stopping_rounds = 50L,
  eta = 1,
  gamma = 0,
  max_depth = 12,
  min_child_weight = 1,
  max_delta_step = 0,
  subsample = 0.75,
  colsample_bytree = 1,
  colsample_bylevel = 1,
  lambda = 0,
  alpha = 0,
  tree_method = "auto",
  sketch_eps = 0.03,
  base_score = NULL,
  objective = NULL,
  sample_type = "uniform",
  normalize_type = "forest",
  rate_drop = 0,
  one_drop = 0,
  skip_drop = 0,
  .gs = FALSE,
  grid.resample.params = setup.resample("kfold", 5),
  gridsearch.type = "exhaustive",
  metric = NULL,
  maximize = NULL,
  importance = TRUE,
  print.plot = FALSE,
  plot.fitted = NULL,
  plot.predicted = NULL,
  plot.theme = rtTheme,
  question = NULL,
  verbose = TRUE,
  grid.verbose = FALSE,
  trace = 0,
  save.gridrun = FALSE,
  n.cores = 1,
  nthread = rtCores,
  outdir = NULL,
  save.mod = ifelse(!is.null(outdir), TRUE, FALSE),
  ...
)

Arguments

x

Numeric vector or matrix / data frame of features i.e. independent variables

y

Numeric vector of outcome, i.e. dependent variable

x.test

Numeric vector or matrix / data frame of testing set features Columns must correspond to columns in x

y.test

Numeric vector of testing set outcome

x.name

Character: Name for feature set

y.name

Character: Name for outcome

num_parallel_tree

Integer: Number of trees to grow

booster

Character: Booster to use. Options: "gbtree", "gblinear"

missing

String or Numeric: Which values to consider as missing. Default = NA

nrounds

Integer: Maximum number of rounds to run. Can be set to a high number as early stopping will limit nrounds by monitoring inner CV error

weights

Numeric vector: Weights for cases. For classification, weights takes precedence over ifw, therefore set weights = NULL if using ifw. Note: If weight are provided, ifw is not used. Leave NULL if setting ifw = TRUE.

ifw

Logical: If TRUE, apply inverse frequency weighting (for Classification only). Note: If weights are provided, ifw is not used.

ifw.type

Integer 0, 1, 2 1: class.weights as in 0, divided by min(class.weights) 2: class.weights as in 0, divided by max(class.weights)

upsample

Logical: If TRUE, upsample cases to balance outcome classes (for Classification only) Note: upsample will randomly sample with replacement if the length of the majority class is more than double the length of the class you are upsampling, thereby introducing randomness

downsample

Logical: If TRUE, downsample majority class to match size of minority class

resample.seed

Integer: If provided, will be used to set the seed during upsampling. Default = NULL (random seed)

obj

Function: Custom objective function. See ?xgboost::xgboost

feval

Function: Custom evaluation function. See ?xgboost::xgboost

xgb.verbose

Integer: Verbose level for XGB learners used for tuning.

print_every_n

Integer: Print evaluation metrics every this many iterations

early_stopping_rounds

Integer: Training on resamples of x.train (tuning) will stop if performance does not improve for this many rounds

eta

[gS] Numeric (0, 1): Learning rate.

gamma

[gS] Numeric: Minimum loss reduction required to make further partition

max_depth

[gS] Integer: Maximum tree depth.

min_child_weight

[gS] Numeric: Minimum sum of instance weight needed in a child.

max_delta_step

[gS] Numeric: Maximum delta step we allow each leaf output to be. O means no constraint. 1-10 may help control the update, especially with imbalanced outcomes.

subsample

[gS] Numeric: subsample ratio of the training instance

colsample_bytree

[gS] Numeric: subsample ratio of columns when constructing each tree

colsample_bylevel

[gS] Numeric

lambda

[gS] L2 regularization on weights

alpha

[gS] L1 regularization on weights

tree_method

[gS] XGBoost tree construction algorithm

sketch_eps

[gS] Numeric (0, 1):

base_score

Numeric: The mean outcome response (Defaults to mean)

objective

(Default = NULL)

sample_type

Character. Default = "uniform"

normalize_type

Character. Default = "forest"

rate_drop

[gS] Numeric: Dropout rate for dart booster.

one_drop

[gS] Integer 0, 1: When this flag is enabled, at least one tree is always dropped during the dropout.

skip_drop

[gS] Numeric [0, 1]: Probability of skipping the dropout procedure during a boosting iteration. If a dropout is skipped, new trees are added in the same manner as gbtree. Non-zero skip_drop has higher priority than rate_drop or one_drop.

.gs

Internal use only

grid.resample.params

List: Output of setup.resample defining grid search parameters.

gridsearch.type

Character: Type of grid search to perform: "exhaustive" or "randomized".

metric

Character: Metric to minimize, or maximize if maximize = TRUE during grid search. Default = NULL, which results in "Balanced Accuracy" for Classification, "MSE" for Regression, and "Coherence" for Survival Analysis.

maximize

Logical: If TRUE, metric will be maximized if grid search is run.

importance

Logical: If TRUE, calculate variable importance.

print.plot

Logical: if TRUE, produce plot using mplot3 Takes precedence over plot.fitted and plot.predicted.

plot.fitted

Logical: if TRUE, plot True (y) vs Fitted

plot.predicted

Logical: if TRUE, plot True (y.test) vs Predicted. Requires x.test and y.test

plot.theme

Character: "zero", "dark", "box", "darkbox"

question

Character: the question you are attempting to answer with this model, in plain language.

verbose

Logical: If TRUE, print summary to screen.

grid.verbose

Logical: Passed to gridSearchLearn

trace

Integer: If higher than 0, will print more information to the console.

save.gridrun

Logical: If TRUE, save grid search models.

n.cores

Integer: Number of cores to use.

nthread

Integer: Number of threads for xgboost using OpenMP. Only parallelize resamples using n.cores or the xgboost execution using this setting. At the moment of writing, parallelization via this parameter causes a linear booster to fail most of the times. Therefore, default is rtCores for 'gbtree', 1 for 'gblinear'

outdir

Path to output directory. If defined, will save Predicted vs. True plot, if available, as well as full model output, if save.mod is TRUE

save.mod

Logical: If TRUE, save all output to an RDS file in outdir save.mod is TRUE by default if an outdir is defined. If set to TRUE, and no outdir is defined, outdir defaults to paste0("./s.", mod.name)

...

Additional arguments

Details

[gS]: indicates parameter will be autotuned by grid search if multiple values are passed. Learn more about XGBoost's parameters here: http://xgboost.readthedocs.io/en/latest/parameter.html

Value

rtMod object

Author(s)

E.D. Gennatas

See Also

train_cv for external cross-validation

Other Supervised Learning: s_AdaBoost(), s_AddTree(), s_BART(), s_BRUTO(), s_BayesGLM(), s_C50(), s_CART(), s_CTree(), s_EVTree(), s_GAM(), s_GBM(), s_GLM(), s_GLMNET(), s_GLMTree(), s_GLS(), s_H2ODL(), s_H2OGBM(), s_H2ORF(), s_HAL(), s_Isotonic(), s_KNN(), s_LDA(), s_LM(), s_LMTree(), s_LightCART(), s_LightGBM(), s_MARS(), s_MLRF(), s_NBayes(), s_NLA(), s_NLS(), s_NW(), s_PPR(), s_PolyMARS(), s_QDA(), s_QRNN(), s_RF(), s_RFSRC(), s_Ranger(), s_SDA(), s_SGD(), s_SPLS(), s_SVM(), s_TFN(), s_XGBoost()

Other Tree-based methods: s_AdaBoost(), s_AddTree(), s_BART(), s_C50(), s_CART(), s_CTree(), s_EVTree(), s_GBM(), s_GLMTree(), s_H2OGBM(), s_H2ORF(), s_LMTree(), s_LightCART(), s_LightGBM(), s_MLRF(), s_RF(), s_RFSRC(), s_Ranger(), s_XGBoost()


egenn/rtemis documentation built on Dec. 17, 2024, 6:16 p.m.