In CalvinData/ds202calvin:

# this is taken from https://cs.calvin.edu/courses/data/202/21fa/ex/ex07/ex07-modeling-inst.html
library(learnr) # remotes::install_github("rstudio/learnr", build_vignettes = TRUE)
knitr::opts_chunk$set(echo = TRUE, comment = "")
#tutorial_options(exercise.reveal_solution = TRUE)

# https://bookdown.org/yihui/rmarkdown-cookbook/reuse-chunks.html#embed-chunk
<<user-facing-setup>>

library(tidyverse)
library(tidymodels)
theme_set(theme_bw())
options(scipen = 5) # encourage metrics to print in fixed-point notation
options(dplyr.summarise.inform = FALSE) # silence a warning message


<<load-and-subset-data>>
<<train-test-split>>
<<fit-linreg-ames>>
<<make-train-predictions>>

Getting started

The goal of this exercise is to start getting familiar with modeling. We'll try the same regression task that we saw in the slides, and we'll use a linear regression model just like we did there.

Topics:

Linear regression for prediction
Compute metrics for regression

If you want to read more about tidymodels, you can read some of the tidymodels docs or Tidy Modeling with R. But those get deep into the details quickly; stick with us here and you'll be okay.

Data

We'll be using the Ames home sales dataset that we saw in class. If you're curious, you can look at the Data dictionary that the author provided. In the original paper, the author suggests working with a subset of the data. So let's do that:

# Get the data from the "modeldata" package, which comes with tidymodels.
ames <- read_builtin("ames", package = "modeldata") %>% 
  filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>% 
  mutate(Sale_Price = Sale_Price / 1000)

Exploratory analysis

Good analysis starts by exploring the data. One compelling way to explore data is by making plots. Let's make one together.

How many homes are we working with, after the above filtering? Remember that the dataframe is stored in a variable called ames, and we want the number of rows.

nrow(ames)

Make a plot of how the sale price (Sale_Price) related to the number of square feet of above-grade living area (Gr_Liv_Area) for different building types (Bldg_Type).

Your graph might look something like this:

<<eda-living-area-solution>>

ggplot(ames, aes(x = Gr_Liv_Area, y = Sale_Price)) +
  geom_point(size = .1, alpha = .25) +
  facet_wrap(vars(Bldg_Type)) +
  labs(x = "above-grade living area (ft^2)", y = "Sale Price ($1k)")

Modeling

We will now go through the basic steps of making a predictive model. We will add on to this workflow later, but this is a good start.

Each step will provide the full code for you.

Hold out some unseen data

Remember that our goal is to be able to predict what homes will sell for before they're sold.

But our dataset has only homes that were already sold. How can we possibly figure out how well we'd predict a sale price before it's sold?

Our strategy, which we'll discuss more in future weeks, will be to hold out a "testing set" of homes. We won't let our model see the actual sale price for these homes.

The homes where we do show the model the sale price we'll call the "training set" homes.

We'll make this split randomly but consistently: we'll first seed the random number generator so it always gives the same sequence of numbers.

set.seed(1234)

# Split our data randomly
ames_split <- initial_split(ames, prop = 2/3) 
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

Exercise 3: How many home sales were in the training set?

Exercise 4: How many home sales were in the testing set?

Construct a model specification

The basic tidymodels interface is

fit(model_spec, formula, data)

model_spec: the kind of model we want to fit. We'll use a linear_reg in regression mode (which is the default, but we'll be explicit).
formula
this goes thing-to-predict ~ var1 + var2 + ...
In our case, well use Gr_Liv_Area and Bldg_Type to predict Sale_Price.
data: we'll give the model-fitter our training data to use to try to find good coefficients for living area and the building types.

The result is a model object, which can make predictions for us.

linear_reg_fit <- fit(
  # What type of model?
  linear_reg(mode = "regression"),
  # Predict what target using what features?
  Sale_Price ~ Gr_Liv_Area + Bldg_Type,
  # What data to use?
  data = ames_train)

Visualize the Model Internals

If we just show the model object, we get a crude representation of its internals:

linear_reg_fit

Visualize the Model's Predictions

We can ask the model for its predictions by using predict. Let's get predictions for the training homes (the same homes we used to find the coefficients for our model):

ames_train_predictions_only <- 
  linear_reg_fit %>% 
    predict(ames_train)
ames_train_predictions_only

Exercise 5: How many predictions did the model make? Try to compute this without using the output we just computed.

Let's compare the predictions with the true sale prices. To do this, we'll want the predictions and the original data in the same data frame. Since they line up row-by-row, we can just staple the predictions column to the end:

bind_cols(ames_train_predictions_only, ames_train)

Usually I do this in a single pipeline:

# Verbose way:
ames_train_predictions <- 
  linear_reg_fit %>% 
    predict(ames_train) %>% 
    bind_cols(ames_train)

# Alternative shortcut:
ames_train_predictions <- linear_reg_fit %>% 
  augment(ames_train)

<<make-train-predictions>>

Residual by True

Make the following plot (which I intentionally haven't labeled well, to help you make it). (note: the column name is .pred, with a period at the beginning. They named it that way just in case there was already a column named pred.)

<<resid-by-true-solution>>

ames_train_predictions %>% 
  ggplot(aes(x = Sale_Price, y = Sale_Price - .pred)) +
    geom_point()

Exercise 6: Just looking at the plot: are these good predictions? About how far off are they? How would you fill in the blank: "The typical error for the predictions is about XXX dollars." Do this by eye, without running a calculation.

Quantify the Model's Performance

To quantify the model's error, let's compute the mean absolute error.

ames_train_predictions %>% 
  mutate(error = Sale_Price - .pred) %>% 
  summarize(mean(abs(error)))

The yardstick package includes functions that compute this and other metrics:

metrics <- yardstick::metric_set(rsq_trad, mae, mape, rmse)
ames_train_predictions %>% 
  metrics(truth = Sale_Price, estimate = .pred) %>% 
  select(-.estimator)

Now seeing that, write down a sentence that summarize the model's accuracy in two different ways (in your own notes).

Quantify the model's performance on the test set

So far we've been evaluating the model's performance on the training set.

But we already know the sale prices for those homes, so why do we really care about predicting them?

What we'd really like to know is how well we'd predict the sale price for homes that our model never actually saw.

Good thing we held out a test set!

Exercise 7: Repeat the visualization and quantification for the test set.

Appendix

Here's the code I used in the slides to plot the model predictions. I needed a few extra things:

sweep_model is a helper function to try out all combinations of multiple variables: all the living area values for each building type. The tidyr::expand_grid function helped me out there!
Individual geoms can be given a different data than the whole ggplot. I used that to draw the lines.

sweep_model <- function(model, ...) {
  X <- expand_grid(...)
  model %>% 
    predict(X) %>% 
    bind_cols(X)
}

ggplot(ames_train, aes(x = Gr_Liv_Area, y = Sale_Price)) +
  geom_point(alpha = .5, size = .5) +
  geom_line(data = sweep_model(
      linear_reg_fit,
      Gr_Liv_Area = seq(0, 4000, length.out = 500),
      Bldg_Type = levels(ames_train$Bldg_Type)
    ),
    mapping = aes(y = .pred),
    color = "red") +
  facet_wrap(vars(Bldg_Type)) +
  labs(x = "Living Area", y = "Sale Price ($1k)")

ames_test_predictions <- 
  linear_reg_fit %>% predict(ames_test) %>% bind_cols(ames_test)

Here's some results I got. I used this:

all_preds <- bind_rows(
  train = ames_train_predictions,
  test = ames_test_predictions,
  .id = 'set',
)

all_preds %>% 
  ggplot(aes(x = Sale_Price, y = Sale_Price - .pred, color = set)) +
    geom_point(size = .5, alpha = .5)

all_preds %>%
  group_by(set) %>% 
  metrics(truth = Sale_Price, estimate = .pred) %>% 
  arrange(desc(set)) %>% 
  select(-.estimator) %>% 
  pivot_wider(names_from = "set", values_from = ".estimate") %>% knitr::kable()