knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Categorical prediction

The first thing we will do is load the titanic data. Not that there is a "target" variable. The package expects that we will always have a variable in the training and testing data.frames called "target" which is our variable for pediction.

library(modelpipe)
library(knitr)
library(tidyverse)
data(titanic)
kable(titanic[1:5, ], caption = "Titanic data set")

Lets remove the passenger_id as it isn't useful in predition.

titanic <- titanic %>%
  select(-passenger_id)

kable(titanic[1:5, ], caption = "Titanic data set")

Now we'll call three functions to prep the data and build the model. First, we will prep the data using prep_bin. This is simply a wrapper for the awesome vtreat package. The wrapper creates a cross frame experiment on the training data, applies the experiment to the test data, and formats the variable names.

titanic <- split_data(titanic, 0.8)
titanic_treated <- prep_bin(titanic$df_train,
                            titanic$df_test,
                            outcome_target = 1)

Our prepped data set has converted categorical variables to numeric variables using target encoding.

kable(titanic_treated$df_train[1:5, ], caption = "Target encoded training set")

Next we can take the data prepared by prep_bin and use it in our xgb_bin function. xgb_bin is a wrapper for xgboost that automatically performs hyperparamter tuning on an xgboost model using random search. The xgboost model is built with the lossguide growth policy making it act similarly to lightgbm.

mdl <- xgb_bin(df_train = titanic_treated$df_train,
               df_test  = titanic_treated$df_test,
               verbose  = F)

If interested we can take a look at the model hyperparameters selected and their impact on prediction accuracy (eval). Additionally, we could expand the number of combinations of hyper paramters used by specifying a large number of tune_rounds in the xgb_bin function.

kable(mdl$params_tested, caption = "Target encoded training set")

Finally, we can evaluate the model accuracy on test data.

print(paste0("Model AUC on test data is: ", round(mdl$roc_auc, 3)))

And that's it. We treated some messy data and trained a fairly good model in 3 function calls.



prescient/modelpipe documentation built on Dec. 25, 2019, 3:20 a.m.