In pbhogale/simplexgb: Simple wrapper for xgboost

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(tidyverse)
library(simplexgb)
library(janitor)

This vignette will use data from the house prices kaggle contest. This vignette will focus on thge simplest possible data cleaning and rely on the sensible defaults in simplexgb to construct a submission and evaluate the results on kaggle. For a more comprehensive data exploration of this data set check this kernel

Get the data

If you have the kaggle api, use kaggle competitions download -c house-prices-advanced-regression-techniques to obtain the data. See this blog to see how to setup the kaggle api, if you want to.

train <- read_csv("kaggle-house-prices/train.csv") %>% select(-Id) %>% clean_names()
test <- read_csv("kaggle-house-prices/test.csv") %>% clean_names()
test_id <- test$id
test <- test %>% select(-id)
sample_sub <- read_csv("kaggle-house-prices/sample_submission.csv")
sample_sub %>% head()

Dealing with missing values

This data set has a lot of missing values. If we removed all rows which had a missing value of them, we would lost a significant amount of data. If the user does not want to deal with missing values (see the detailed analysis done for this data set here) simplexgb use a simple heuristic for dealing with missing values. For character variables, it use "not available" as a new level, while for numeric variables it generates random numbers from a normalized distribution of the available values of that variable. Clearly, there are better ways to impute missing values that include dependencies and correlations.. but this will suffice for now.

Preparing the training structure

train_struct <- prepare_training_set(df = train, target_variable = "sale_price")

Cross validation

Guessing the hyperparameters

hyp <- guess_hyperparameters(train_structure = train_struct)
hyp %>% print()

Cross validation

cv_results <- cross_validate_xgb(train_structure = train_struct, hyperparameters = hyp, nfold = 5)
print(cv_results$metric)

It is not really clear what that means, but let us see how we perform if we make a submission.

Training a model

hyp <- guess_hyperparameters(train_structure = train_struct)
print(hyp)
xgbmodel <- train_model_xgb(train_structure = train_struct, hyperparameters = hyp)

Predicting on the test set

pred_df <- get_predictions_xgb(xgbmodel, test_df = test)
pred_df %>% head()

Constructing the submission

submission <- sample_sub
submission$Id <- test_id
submission %>% write_csv("kaggle-house-prices/submission_basic.csv")

We see that while simplexgb will get you from start to submission in no time at all, there is no short cut to understanding the data and careful cleaning and feature engineering. See this kernel for another example.

Linear model

model_struct <- train_linear_model(train_structure = train_struct, model_structure = xgbmodel, hyperparameters = hyp)
pred_df <- get_predictions_linear(model_struct, test_df = test) %>% abs()
pred_df %>% head()
submission <- sample_sub
submission$Id <- test_id
submission %>% write_csv("kaggle-house-prices/submission_linear.csv")

RF model

model_struct <- train_rf_model(train_structure = train_struct, model_structure = model_struct, hyperparameters = hyp)
pred_rf <- get_predictions_rf(model_structure = model_struct, test_df = test)
pred_rf %>% head()
submission <- sample_sub
submission$Id <- test_id
submission %>% write_csv("kaggle-house-prices/submission_rf.csv")

Avg of these

lin_pred <- read_csv("kaggle-house-prices/submission_linear.csv")
xgb_pred <- read_csv("kaggle-house-prices/submission_basic.csv")
rf_pred <- read_csv("kaggle-house-prices/submission_rf.csv")
pred <- tibble(Id = lin_pred$Id, SalePrice = (abs(lin_pred$SalePrice) + abs(xgb_pred$SalePrice) + abs(rf_pred$SalePrice)))
pred %>% write_csv("kaggle-house-prices/submission_average.csv")