xgb.cv_pam: Runs xgb.train.ped on cross-validation sets

Description Usage Arguments

View source: R/xgboost-fit.R

Description

Runs xgb.train.ped on cross-validation sets

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
xgb.cv_pam(
  params = list(),
  data,
  nrounds,
  nfold = 4,
  cv_indices,
  ped_params = list(),
  nthread = 1L,
  verbose = FALSE,
  print_every_n = 1L,
  early_stopping_rounds = NULL,
  ...
)

Arguments

params

the list of parameters. The complete list of parameters is available in the online documentation. Below is a shorter summary:

1. General Parameters

  • booster which booster to use, can be gbtree or gblinear. Default: gbtree.

2. Booster Parameters

2.1. Parameter for Tree Booster

  • eta control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1 when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for eta implies larger value for nrounds: low eta value means model more robust to overfitting but slower to compute. Default: 0.3

  • gamma minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.

  • max_depth maximum depth of a tree. Default: 6

  • min_child_weight minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1

  • subsample subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with eta and increase nrounds. Default: 1

  • colsample_bytree subsample ratio of columns when constructing each tree. Default: 1

  • num_parallel_tree Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set colsample_bytree < 1, subsample < 1 and round = 1) accordingly. Default: 1

  • monotone_constraints A numerical vector consists of 1, 0 and -1 with its length equals to the number of features in the training data. 1 is increasing, -1 is decreasing and 0 is no constraint.

  • interaction_constraints A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from 0 (0 references the first column). Leave argument unspecified for no interaction constraints.

2.2. Parameter for Linear Booster

  • lambda L2 regularization term on weights. Default: 0

  • lambda_bias L2 regularization term on bias. Default: 0

  • alpha L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0

3. Task Parameters

  • objective specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:

    • reg:squarederror Regression with squared loss (Default).

    • reg:squaredlogerror: regression with squared log loss 1/2 * (log(pred + 1) - log(label + 1))^2. All inputs are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.

    • reg:logistic logistic regression.

    • reg:pseudohubererror: regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.

    • binary:logistic logistic regression for binary classification. Output probability.

    • binary:logitraw logistic regression for binary classification, output score before logistic transformation.

    • binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.

    • count:poisson: poisson regression for count data, output mean of poisson distribution. max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization).

    • survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR).

    • survival:aft: Accelerated failure time model for censored survival time data. See Survival Analysis with Accelerated Failure Time for details.

    • aft_loss_distribution: Probabilty Density Function used by survival:aft and aft-nloglik metric.

    • multi:softmax set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to num_class - 1.

    • multi:softprob same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.

    • rank:pairwise set xgboost to do ranking task by minimizing the pairwise loss.

    • rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized.

    • rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized.

    • reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.

    • reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

  • base_score the initial prediction score of all instances, global bias. Default: 0.5

  • eval_metric evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.

data

training dataset. xgb.train accepts only an xgb.DMatrix as the input. xgboost, in addition, also accepts matrix, dgCMatrix, or name of a local data file.

nrounds

max number of boosting iterations.

nfold

Number of cross-valdation folds.

ped_params

List of parameters used to transform data into PED format.

verbose

If 0, xgboost will stay silent. If 1, it will print information about performance. If 2, some additional information will be printed out. Note that setting verbose > 0 automatically engages the cb.print.evaluation(period=1) callback function.

print_every_n

Print each n-th iteration evaluation messages when verbose>0. Default is 1 which means all messages are printed. This parameter is passed to the cb.print.evaluation callback.

early_stopping_rounds

If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. Setting this parameter engages the cb.early.stop callback.

...

other parameters to pass to params.


adibender/pem.xgb documentation built on Sept. 10, 2021, 7:24 p.m.