| xgboost | R Documentation |
Fits an XGBoost model (boosted decision tree ensemble) to given x/y data.
See the tutorial Introduction to Boosted Trees for a longer explanation of what XGBoost does, and the rest of the XGBoost Tutorials for further explanations XGBoost's features and usage.
This function is intended to provide a user-friendly interface for XGBoost that follows R's conventions for model fitting and predictions, but which doesn't expose all of the possible functionalities of the core XGBoost library.
See xgb.train() for a more flexible low-level alternative which is similar across different
language bindings of XGBoost and which exposes additional functionalities such as training on
external memory data and learning-to-rank objectives.
See also the migration guide if coming from a previous version of XGBoost in the 1.x series.
By default, most of the parameters here have a value of NULL, which signals XGBoost to use its
default value. Default values are automatically determined by the XGBoost core library, and are
subject to change over XGBoost library versions. Some of them might differ according to the
booster type (e.g. defaults for regularization are different for linear and tree-based boosters).
See xgb.params() and the online documentation
for more details about parameters - but note that some of the parameters are not supported in
the xgboost() interface.
xgboost(
x,
y,
objective = NULL,
nrounds = 100L,
max_depth = NULL,
learning_rate = NULL,
min_child_weight = NULL,
min_split_loss = NULL,
reg_lambda = NULL,
weights = NULL,
verbosity = if (is.null(eval_set)) 0L else 1L,
monitor_training = verbosity > 0,
eval_set = NULL,
early_stopping_rounds = NULL,
print_every_n = 1L,
eval_metric = NULL,
nthreads = parallel::detectCores(),
seed = 0L,
base_margin = NULL,
monotone_constraints = NULL,
interaction_constraints = NULL,
reg_alpha = NULL,
max_bin = NULL,
max_leaves = NULL,
booster = NULL,
subsample = NULL,
sampling_method = NULL,
feature_weights = NULL,
colsample_bytree = NULL,
colsample_bylevel = NULL,
colsample_bynode = NULL,
tree_method = NULL,
max_delta_step = NULL,
scale_pos_weight = NULL,
updater = NULL,
grow_policy = NULL,
num_parallel_tree = NULL,
multi_strategy = NULL,
base_score = NULL,
seed_per_iteration = NULL,
device = NULL,
disable_default_eval_metric = NULL,
use_rmm = NULL,
max_cached_hist_node = NULL,
extmem_single_page = NULL,
max_cat_to_onehot = NULL,
max_cat_threshold = NULL,
sample_type = NULL,
normalize_type = NULL,
rate_drop = NULL,
one_drop = NULL,
skip_drop = NULL,
feature_selector = NULL,
top_k = NULL,
tweedie_variance_power = NULL,
huber_slope = NULL,
quantile_alpha = NULL,
aft_loss_distribution = NULL,
...
)
x |
The features / covariates. Can be passed as:
Note that categorical features are only supported for |
y |
The response variable. Allowed values are:
If If For binary classification, the last factor level of |
objective |
Optimization objective to minimize based on the supplied data, to be passed
by name as a string / character (e.g. If
If Note that not all possible Supported values are:
The following values are NOT supported by
|
nrounds |
Number of boosting iterations / rounds. Note that the number of default boosting rounds here is not automatically tuned, and different problems will have vastly different optimal numbers of boosting rounds. |
max_depth |
(for Tree Booster) (default=6, type=int32)
Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree. range: |
learning_rate |
(alias:
|
min_child_weight |
(for Tree Booster) (default=1)
Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than range: |
min_split_loss |
(for Tree Booster) (default=0, alias: range: |
reg_lambda |
(alias:
|
weights |
Sample weights for each row in If not |
verbosity |
Verbosity of printing messages. Valid values of 0 (silent), 1 (warning), 2 (info), and 3 (debug). |
monitor_training |
Whether to monitor objective optimization progress on the input data. Note that same 'x' and 'y' data are used for both model fitting and evaluation. |
eval_set |
Subset of the data to use as evaluation set. Can be passed as:
If passed, this subset of the data will be excluded from the training procedure, and the
evaluation metric(s) supplied under If passing a fraction, in classification problems, the evaluation set will be chosen in such a way that at least one observation of each class will be kept in the training data. For more elaborate evaluation variants (e.g. custom metrics, multiple evaluation sets, etc.),
one might want to use |
early_stopping_rounds |
Number of boosting rounds after which training will be stopped
if there is no improvement in performance (as measured by the last metric passed under
If |
print_every_n |
When passing Only has an effect when passing |
eval_metric |
(default according to objective)
|
nthreads |
Number of parallel threads to use. If passing zero, will use all CPU threads. |
seed |
Seed to use for random number generation. If passing |
base_margin |
Base margin used for boosting from existing model. If passing it, will start the gradient boosting procedure from the scores that are provided here - for example, one can pass the raw scores from a previous model, or some per-observation offset, or similar. Should be either a numeric vector or numeric matrix (for multi-class and multi-target objectives)
with the same number of rows as Note that, if it contains more than one column, then columns will not be matched by name to
the corresponding If |
monotone_constraints |
Optional monotonicity constraints for features. Can be passed either as a named list (when A value of The input for See the tutorial Monotonic Constraints for a more detailed explanation. |
interaction_constraints |
Constraints for interaction representing permitted interactions.
The constraints must be specified in the form of a list of vectors referencing columns in the
data, e.g. See the tutorial Feature Interaction Constraints for more information. |
reg_alpha |
(alias:
|
max_bin |
(for Tree Booster) (default=256, type=int32)
|
max_leaves |
(for Tree Booster) (default=0, type=int32)
Maximum number of nodes to be added. Not used by |
booster |
(default= |
subsample |
(for Tree Booster) (default=1) Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration. range: |
sampling_method |
(for Tree Booster) (default=
|
feature_weights |
Feature weights for column sampling. Can be passed either as a vector with length matching to columns of If |
colsample_bytree, colsample_bylevel, colsample_bynode |
(for Tree Booster) (default=1) This is a family of parameters for subsampling of columns.
One can set the |
tree_method |
(for Tree Booster) (default= Choices:
|
max_delta_step |
(for Tree Booster) (default=0) Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update. range: |
scale_pos_weight |
(for Tree Booster) (default=1)
Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: |
updater |
(for Linear Booster) (default=
|
grow_policy |
(for Tree Booster) (default=
|
num_parallel_tree |
(for Tree Booster) (default=1) Number of parallel trees constructed during each iteration. This option is used to support boosted random forest. |
multi_strategy |
(for Tree Booster) (default =
Version added: 2.0.0 Note: This parameter is working-in-progress. |
base_score |
|
seed_per_iteration |
(default= |
device |
(default=
For more information about GPU acceleration, see XGBoost GPU Support. In distributed environments, ordinal selection is handled by distributed frameworks instead of XGBoost. As a result, using Version added: 2.0.0 Note: if XGBoost was installed from CRAN, it won't have GPU support enabled, thus only |
disable_default_eval_metric |
(default= |
use_rmm |
Whether to use RAPIDS Memory Manager (RMM) to allocate cache GPU
memory. The primary memory is always allocated on the RMM pool when XGBoost is built
(compiled) with the RMM plugin enabled. Valid values are |
max_cached_hist_node |
(for Non-Exact Tree Methods) (default = 65536)
Maximum number of cached nodes for histogram. This can be used with the Version added: 2.0.0
|
extmem_single_page |
(for Non-Exact Tree Methods) (default = Version added: 3.0.0 Whether the GPU-based |
max_cat_to_onehot |
(for Non-Exact Tree Methods) A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Version added: 1.6.0 |
max_cat_threshold |
(for Non-Exact Tree Methods) Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Version added: 1.7.0 |
sample_type |
(for Dart Booster) (default=
|
normalize_type |
(for Dart Booster) (default=
|
rate_drop |
(for Dart Booster) (default=0.0) Dropout rate (a fraction of previous trees to drop during the dropout). range: |
one_drop |
(for Dart Booster) (default=0) When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper). |
skip_drop |
(for Dart Booster) (default=0.0) Probability of skipping the dropout procedure during a boosting iteration.
range: |
feature_selector |
(for Linear Booster) (default=
|
top_k |
(for Linear Booster) (default=0)
The number of top features to select in |
tweedie_variance_power |
(for Tweedie Regression (
|
huber_slope |
(for using Pseudo-Huber ( |
quantile_alpha |
(for using Quantile Loss ( Version added: 2.0.0 |
aft_loss_distribution |
(for using AFT Survival Loss ( |
... |
Not used. Some arguments that were part of this function in previous XGBoost versions are currently deprecated or have been renamed. If a deprecated or renamed argument is passed, will throw a warning (by default) and use its current equivalent instead. This warning will become an error if using the 'strict mode' option. If some additional argument is passed that is neither a current function argument nor a deprecated or renamed argument, a warning or error will be thrown depending on the 'strict mode' option. Important: |
For package authors using 'xgboost' as a dependency, it is highly recommended to use
xgb.train() in package code instead of xgboost(), since it has a more stable interface
and performs fewer data conversions and copies along the way.
A model object, inheriting from both xgboost and xgb.Booster. Compared to the regular
xgb.Booster model class produced by xgb.train(), this xgboost class will have an
additional attribute metadata containing information which is used for formatting prediction
outputs, such as class names for classification problems.
Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
data(mtcars)
# Fit a small regression model on the mtcars data
model_regression <- xgboost(mtcars[, -1], mtcars$mpg, nthreads = 1, nrounds = 3)
predict(model_regression, mtcars, validate_features = TRUE)
# Task objective is determined automatically according to the type of 'y'
data(iris)
model_classif <- xgboost(iris[, -5], iris$Species, nthreads = 1, nrounds = 5)
predict(model_classif, iris[1:10,])
predict(model_classif, iris[1:10,], type = "class")
# Can nevertheless choose a non-default objective if needed
model_poisson <- xgboost(
mtcars[, -1], mtcars$mpg,
objective = "count:poisson",
nthreads = 1,
nrounds = 3
)
# Can calculate evaluation metrics during boosting rounds
data(ToothGrowth)
xgboost(
ToothGrowth[, c("len", "dose")],
ToothGrowth$supp,
eval_metric = c("auc", "logloss"),
eval_set = 0.2,
monitor_training = TRUE,
verbosity = 1,
nthreads = 1,
nrounds = 3
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.