knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The first thing we will do is load the canonical data science data set.
set.seed(1234) library(modelpipe) library(knitr) library(tidyverse) library(ranger) library(yardstick) data(iris) kable(iris[1:5, ], caption = "Iris")
Now we'll break out our target variable and introduce some missing values.
target <- iris$Sepal.Length iris$Sepal.Length <- NULL iris$Species <- as.character(iris$Species) #map will convert to an integer iris[] <- map(iris, function(x) ifelse(runif(length(x)) < 0.05, NA, x)) iris$target <- target kable(iris[1:5, ], caption = "Iris")
This looks good. So now we split the data into train and test.
iris <- split_data(iris, perc_train = 0.80)
Now we can process it with our data prep step which is a light wrapper for vtreat.
iris_prepped <- prep_numeric(iris$df_train, iris$df_test)
You'll note that the output data.frame has treated the missing values and performed target variable encoding for our categorical variables.
kable(iris_prepped$df_train[1:5, ], caption = "Iris Prepped")
Now we can call our xgboost modeling function. Note that the function looks for a variable named "target" as our outcome.
mdl <- xgb_reg(iris_prepped$df_train, iris_prepped$df_test, tune_rounds = 5L, early_stopping_rounds = 10L) print(paste0("Test RMSE is: ", round(mdl$test_rmse, 3)))
Because most of the problems I model take way to long to run iteratively to get a sense of how much the RMSE or other typical regression metrics vary users are provided with the ability to plot bootstrap replicates of the model error metrics.
hist(mdl$boot_metrics$rmse, main = "RMSE") hist(mdl$boot_metrics$mae, main = "MAE") hist(mdl$boot_metrics$rsq, main = "RSquare") hist(mdl$boot_metrics$spearman_cor, main = "Spearman Cor.")
And thats it. We've cleaned the data, tuned hyperparameters on xgboost, and tested the model in a few lines of code.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.