In prescient/modelpipe: Fast and opinionated model development pipelines.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Numeric prediction

The first thing we will do is load the canonical data science data set.

set.seed(1234)
library(modelpipe)
library(knitr)
library(tidyverse)
library(ranger)
library(yardstick)
data(iris)
kable(iris[1:5, ], caption = "Iris")

Now we'll break out our target variable and introduce some missing values.

target <- iris$Sepal.Length
iris$Sepal.Length <- NULL
iris$Species <- as.character(iris$Species) #map will convert to an integer
iris[] <- map(iris, function(x) ifelse(runif(length(x)) < 0.05, NA, x))
iris$target <- target
kable(iris[1:5, ], caption = "Iris")

This looks good. So now we split the data into train and test.

iris <- split_data(iris, perc_train = 0.80)

Now we can process it with our data prep step which is a light wrapper for vtreat.

iris_prepped <- prep_numeric(iris$df_train, iris$df_test)

You'll note that the output data.frame has treated the missing values and performed target variable encoding for our categorical variables.

kable(iris_prepped$df_train[1:5, ], caption = "Iris Prepped")

Now we can call our xgboost modeling function. Note that the function looks for a variable named "target" as our outcome.

mdl <- xgb_reg(iris_prepped$df_train,
               iris_prepped$df_test,
               tune_rounds = 5L,
               early_stopping_rounds = 10L)
print(paste0("Test RMSE is: ", round(mdl$test_rmse, 3)))

Because most of the problems I model take way to long to run iteratively to get a sense of how much the RMSE or other typical regression metrics vary users are provided with the ability to plot bootstrap replicates of the model error metrics.

hist(mdl$boot_metrics$rmse, main = "RMSE")
hist(mdl$boot_metrics$mae, main = "MAE")
hist(mdl$boot_metrics$rsq, main = "RSquare")
hist(mdl$boot_metrics$spearman_cor, main = "Spearman Cor.")