knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(tidyverse) library(simplexgb) library(janitor)
This vignette will use data from the house prices kaggle contest. This vignette will focus on thge simplest possible data cleaning and rely on the sensible defaults in simplexgb
to construct a submission and evaluate the results on kaggle. For a more comprehensive data exploration of this data set check this kernel
Get the data
If you have the kaggle api, use kaggle competitions download -c house-prices-advanced-regression-techniques
to obtain the data. See this blog to see how to setup the kaggle api, if you want to.
train <- read_csv("kaggle-house-prices/train.csv") %>% select(-Id) %>% clean_names() test <- read_csv("kaggle-house-prices/test.csv") %>% clean_names() test_id <- test$id test <- test %>% select(-id) sample_sub <- read_csv("kaggle-house-prices/sample_submission.csv") sample_sub %>% head()
This data set has a lot of missing values. If we removed all rows which had a missing value of them, we would lost a significant amount of data. If the user does not want to deal with missing values (see the detailed analysis done for this data set here) simplexgb
use a simple heuristic for dealing with missing values. For character variables, it use "not available" as a new level, while for numeric variables it generates random numbers from a normalized distribution of the available values of that variable. Clearly, there are better ways to impute missing values that include dependencies and correlations.. but this will suffice for now.
train_struct <- prepare_training_set(df = train, target_variable = "sale_price")
Guessing the hyperparameters
hyp <- guess_hyperparameters(train_structure = train_struct) hyp %>% print()
Cross validation
cv_results <- cross_validate_xgb(train_structure = train_struct, hyperparameters = hyp, nfold = 5) print(cv_results$metric)
It is not really clear what that means, but let us see how we perform if we make a submission.
Training a model
hyp <- guess_hyperparameters(train_structure = train_struct) print(hyp) xgbmodel <- train_model_xgb(train_structure = train_struct, hyperparameters = hyp)
Predicting on the test set
pred_df <- get_predictions_xgb(xgbmodel, test_df = test) pred_df %>% head()
Constructing the submission
submission <- sample_sub submission$Id <- test_id submission %>% write_csv("kaggle-house-prices/submission_basic.csv")
We see that while simplexgb
will get you from start to submission in no time at all, there is no short cut to understanding the data and careful cleaning and feature engineering. See this kernel for another example.
model_struct <- train_linear_model(train_structure = train_struct, model_structure = xgbmodel, hyperparameters = hyp) pred_df <- get_predictions_linear(model_struct, test_df = test) %>% abs() pred_df %>% head() submission <- sample_sub submission$Id <- test_id submission %>% write_csv("kaggle-house-prices/submission_linear.csv")
model_struct <- train_rf_model(train_structure = train_struct, model_structure = model_struct, hyperparameters = hyp) pred_rf <- get_predictions_rf(model_structure = model_struct, test_df = test) pred_rf %>% head() submission <- sample_sub submission$Id <- test_id submission %>% write_csv("kaggle-house-prices/submission_rf.csv")
Avg of these
lin_pred <- read_csv("kaggle-house-prices/submission_linear.csv") xgb_pred <- read_csv("kaggle-house-prices/submission_basic.csv") rf_pred <- read_csv("kaggle-house-prices/submission_rf.csv") pred <- tibble(Id = lin_pred$Id, SalePrice = (abs(lin_pred$SalePrice) + abs(xgb_pred$SalePrice) + abs(rf_pred$SalePrice))) pred %>% write_csv("kaggle-house-prices/submission_average.csv")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.