knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file()) # Set WD to Root here::i_am("lab_mod/ch4-logreg-default.Rmd") library(here)
My motivation and code I used in here is an adaptation from Dr.Julia Silage's blog bird-baths. I've applied the similar modeling process to Default dataset from {ISLR} package.
The goal is to build logistic regression model to predict default status.
library(tidyverse) library(ISLR) theme_set(theme_bw())
Let's take a look at the Default data set.
It has 2 numeric variables: balance and income; and 2 factor variables: default and student
summary(Default)
Detailed summary can be done with skimr::skim()
skimr::skim(Default)
Please note that there are no missing values, which is great!
The goal here is to build a logistic regression model to predict default.
I will explore the relationship between default and other variables (income, balace, student). Let's make some plots!
Default %>% ggplot(aes(balance, income, color = default)) + geom_point(alpha = 0.4) + scale_color_brewer(palette = "Set1", direction = -1) + labs(title = "Default status by income and balance")
By visual inspection, balance looks like a better predictor for default than income.
p1 <- Default %>% ggplot(aes(balance, default, fill = student)) + geom_boxplot(alpha = 0.8) + scale_fill_brewer(palette = "Dark2") + labs(title = "Distribution of default", subtitle = "by balance and student status", caption = "Data from ISLR package") p1
This plot shows that balance and student seem to be a decent predictor of default.
Goal: Outcome variable = default (factor)
First, I will use glm() function to build logistic regression model using all predictors.
glm_fit_all <- glm(default ~ ., data = Default, family = binomial) summary(glm_fit_all)
p.value_income <- broom::tidy(glm_fit_all)$p.value[4]
Coefficient of income is not significant (p = r p.value_income). Let's try remove income out.
glm_fit_st_bal <- glm(default ~ student + balance, data = Default, family = binomial) glm_fit_st_bal
Model that include only student and balance has lower estimation of test error (i.e., AIC & BIC).
list("All Predictors" = glm_fit_all, "student + balance" = glm_fit_st_bal) %>% map_dfr( broom::glance, .id = "Model") %>% select(Model, AIC, BIC)
Now, let's apply full modeling process with tidymodels.
library(tidymodels)
First, we need to spending data budgets in 2 portions: training and testing data.
set.seed(123) # Split to Test and Train Default_split <- initial_split(Default, strata = default) Default_train <- training(Default_split) # 75% to traning data Default_test <- testing(Default_split) # 25% to test data Default_split
I will use 10-Fold Cross-Validation to resample the training data.
set.seed(234) Default_folds <- vfold_cv(Default_train, v = 10, strata = default) Default_folds
Note that I use strata = default because the frequency of each class is quite difference.
Default %>% count(default)
For each folds, 6750 rows will spend on fitting models and 750 spend on analysis of model performance:
Default_folds$splits[[1]]
glm_spec <- logistic_reg() glm_spec
rec_basic <- recipe(default ~ ., data = Default) %>% step_dummy(all_nominal_predictors()) summary(rec_basic)
Preview of engineered training data
You can see that student was replaced by student_Yes with 0-1 encoding.
What does it mean?
rec_basic %>% prep(log_change = T) %>% bake(new_data = Default_train)
From contrast() function we can see the dummy encoding of the student (factor) variable: No = 0, Yes = 1.
contrasts(Default$student)
workflow() will bundle blueprint of feature engineering and model specification together.
wf_basic <- workflow(rec_basic, glm_spec) wf_basic
doParallel::registerDoParallel() ctrl_preds <- control_resamples(save_pred = TRUE) ## Fit rs_basic <- fit_resamples(wf_basic, resamples = Default_folds, control = ctrl_preds) head(rs_basic)
From ROC curve we can see that it has pretty good upper-left bulging curve.
augment(rs_basic) %>% roc_curve(truth = default, .pred_No) %>% autoplot()
This would result in AUC (area under the ROC curve) and accuracy close to 1.
collect_metrics(rs_basic)
As we can see from the beginning, removing income from predictor result in better estimation of test error by AIC and BIC.
Now, I will remove income from the recipes:
rec_simple <- rec_basic %>% remove_role(income, old_role = "predictor") summary(rec_simple)
And I will update the WorkFlow
wf_simple <- workflow(rec_simple, glm_spec)
Then the rest is the same. So, I wrote a simple wrapper function to do it.
update_workflow <- function(wf) { ctrl_preds <- control_resamples(save_pred = TRUE) rs <- fit_resamples(wf, resamples = Default_folds, control = ctrl_preds) rs } rs_simple <- update_workflow(wf_simple)
rs_ls <- list("All Predictors" = rs_basic, "student + balance" = rs_simple)
roc_df <- rs_ls %>% map_dfr(~augment(.x) %>% roc_curve(truth = default, .pred_No), .id = "Predictors" )
This plot show comparison of ROC curves of the 2 logistic regression models.
student, balance and income.student and balance.roc_df %>% ggplot(aes(1-specificity, sensitivity, color = Predictors)) + geom_line(size = 1, alpha = 0.6, linetype = "dashed") + geom_abline(intercept = 0, slope = 1, linetype = "dotted")
rs_ls %>% map_dfr( collect_metrics, .id = "Features" )
You can see that a simpler model result in a similar (or may be slightly improved) AUC. So it's reasonable to prefer it over more complicated model.
In this section, I will evaluate the simpler model with 2 predictors (student, balance).
First, fit the model to training data.
Default_fit <- fit(wf_simple, Default_train) Default_fit
Then, use test data to predict default.
Default_pred <- augment(Default_fit, Default_test) %>% bind_cols( predict(Default_fit, Default_test, type = "conf_int") )
library(latex2exp) p2 <- Default_pred %>% ggplot(aes(balance, .pred_Yes, fill = student)) + geom_line(aes(color = student)) + geom_ribbon(aes(ymin = .pred_lower_Yes, ymax = .pred_upper_Yes), alpha = 0.3) + scale_fill_brewer(palette = "Dark2") + scale_color_brewer(palette = "Dark2") + labs(y = TeX("$\\hat{p}(default)$"), caption = "Area indicate 95% CI of the estimated probability") + labs(title = "Predicted probability of default", subtitle = "by logistic regression with 2 predictors") p2
This plot show estimated probability of default if we know the values of predictors: student and balance by using logistic regression model fitted on training data. The prediction was made by plugging test data to the model.
library(patchwork) p1 + p2
The left-sided plot showed default status by balance and student status as observed in the Default data set. After multivariate logistic regression model (default on balance and student) was fitted to the training data, the predicted probability of default using the model was shown in the right-sided plot.
If you have any comment or noticed any mistake, please feel free to share it with me in the comment section.
Thanks a lot for reading.
devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.