knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rlang_trace_top_env = rlang::current_env())
library(autostats) library(workflows) library(dplyr) library(tune) library(rsample) library(hardhat)
autostats
provides convenient wrappers for modeling, visualizing, and predicting using a tidy workflow. The emphasis is on rapid iteration and quick results using an intuitive interface based off the tibble
and tidy_formula
.
Set up the iris data set for modeling. Create dummies and any new columns before making the formula. This way the same formula can be use throughout the modeling and prediction process.
set.seed(34) iris %>% dplyr::as_tibble() %>% framecleaner::create_dummies(remove_first_dummy = TRUE) -> iris1 iris1 %>% tidy_formula(target = Petal.Length) -> petal_form petal_form
Use the rsample package to split into train and validation sets.
iris1 %>% rsample::initial_split() -> iris_split iris_split %>% rsample::analysis() -> iris_train iris_split %>% rsample::assessment() -> iris_val iris_split
Fit models to the training set using the formula to predict Petal.Length
. Variable importance using gain for each xgboost
model can be visualized.
auto_tune_xgboost
returns a workflow object with tuned parameters and requires some postprocessing to get a traind xgb.Booster
object like tidy_xgboost
. Tuning iterations set to 1
just so the vignette builds quickly. Default is n_iter = 100
iris_train %>% auto_tune_xgboost(formula = petal_form, n_iter = 7L, tune_method = "bayes") -> xgb_tuned_bayes xgb_tuned_bayes %>% parsnip::fit(iris_train) %>% hardhat::extract_fit_engine() -> xgb_tuned_fit_bayes xgb_tuned_fit_bayes %>% visualize_model()
xgboost
also can be tuned using a grid that is created internally using dials::grid_max_entropy
. The n_iter
parameter is passed to grid_size
. Parallelization is highly effective in this method, so the default argument parallel = TRUE
is recommended.
iris_train %>% auto_tune_xgboost(formula = petal_form, n_iter = 5L,trees = 20L, loss_reduction = 2, mtry = .5, tune_method = "grid", parallel = FALSE) -> xgb_tuned_grid xgb_tuned_grid %>% parsnip::fit(iris_train) %>% parsnip::extract_fit_engine() -> xgb_tuned_fit_grid xgb_tuned_fit_grid %>% visualize_model()
iris_train %>% tidy_xgboost(formula = petal_form) -> xgb_base
iris_train %>% tidy_xgboost(petal_form, trees = 250L, tree_depth = 3L, sample_size = .5, mtry = .5, min_n = 2) -> xgb_opt
Predictions are iteratively added to the validation data frame. The name of the column is automatically created using the models name and the prediction target.
xgb_base %>% tidy_predict(newdata = iris_val, form = petal_form) -> iris_val2 xgb_opt %>% tidy_predict(newdata = iris_val2, petal_form) -> iris_val3 iris_val3 %>% names()
Instead of evaluationg these prediction 1 by 1, This step is automated with eval_preds
. This function is specifically designed to evaluate predicted columns with names given from tidy_predict
.
iris_val3 %>% eval_preds()
tidy_shap
has similar syntax to tidy_predict
and can be used to get shapley values from xgboost
models on a validation set.
xgb_base %>% tidy_shap(newdata = iris_val, form = petal_form) -> shap_list
shap_list$shap_tbl
shap_list$shap_summary
shap_list$swarmplot
shap_list$scatterplots
Overfittingin the base config may be related to growing deep trees.
xgb_base %>% xgboost::xgb.plot.deepness()
xgb_base %>% xgboost::xgb.plot.deepness()
Plot the first tree in the model. The small \emph{cover} in terminal leaves suggests overfitting in the base model.
xgb_base %>% xgboost::xgb.plot.tree(model = ., trees = 1)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.