# this is taken from https://cs.calvin.edu/courses/data/202/21fa/ex/ex07/ex07-modeling-inst.html library(learnr) # remotes::install_github("rstudio/learnr", build_vignettes = TRUE) knitr::opts_chunk$set(echo = TRUE, comment = "") #tutorial_options(exercise.reveal_solution = TRUE) # https://bookdown.org/yihui/rmarkdown-cookbook/reuse-chunks.html#embed-chunk <<user-facing-setup>>
library(tidyverse) library(tidymodels) theme_set(theme_bw()) options(scipen = 5) # encourage metrics to print in fixed-point notation options(dplyr.summarise.inform = FALSE) # silence a warning message <<load-and-subset-data>> <<train-test-split>> <<fit-linreg-ames>> <<make-train-predictions>>
The goal of this exercise is to start getting familiar with modeling. We'll try the same regression task that we saw in the slides, and we'll use a linear regression model just like we did there.
Topics:
If you want to read more about tidymodels
, you can read some of the
tidymodels docs or Tidy Modeling with R.
But those get deep into the details quickly; stick with us here and you'll be okay.
We'll be using the Ames home sales dataset that we saw in class. If you're curious, you can look at the Data dictionary that the author provided. In the original paper, the author suggests working with a subset of the data. So let's do that:
# Get the data from the "modeldata" package, which comes with tidymodels. ames <- read_builtin("ames", package = "modeldata") %>% filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>% mutate(Sale_Price = Sale_Price / 1000)
Good analysis starts by exploring the data. One compelling way to explore data is by making plots. Let's make one together.
ames
, and we want the n
umber of row
s.nrow(ames)
Sale_Price
) related to the number of square feet of above-grade living area (Gr_Liv_Area
)
for different building types (Bldg_Type
).Your graph might look something like this:
<<eda-living-area-solution>>
ggplot(ames, aes(x = Gr_Liv_Area, y = Sale_Price)) + geom_point(size = .1, alpha = .25) + facet_wrap(vars(Bldg_Type)) + labs(x = "above-grade living area (ft^2)", y = "Sale Price ($1k)")
We will now go through the basic steps of making a predictive model. We will add on to this workflow later, but this is a good start.
Each step will provide the full code for you.
Remember that our goal is to be able to predict what homes will sell for before they're sold.
But our dataset has only homes that were already sold. How can we possibly figure out how well we'd predict a sale price before it's sold?
Our strategy, which we'll discuss more in future weeks, will be to hold out a "testing set" of homes. We won't let our model see the actual sale price for these homes.
The homes where we do show the model the sale price we'll call the "training set" homes.
We'll make this split randomly but consistently: we'll first seed the random number generator so it always gives the same sequence of numbers.
set.seed(1234) # Split our data randomly ames_split <- initial_split(ames, prop = 2/3) ames_train <- training(ames_split) ames_test <- testing(ames_split)
Exercise 3: How many home sales were in the training set?
Exercise 4: How many home sales were in the testing set?
The basic tidymodels
interface is
fit(model_spec, formula, data)
model_spec
: the kind of model we want to fit. We'll use a linear_reg
in regression
mode (which is the default, but we'll be explicit).formula
~
var1 +
var2 +
...Gr_Liv_Area
and Bldg_Type
to predict Sale_Price
.data
: we'll give the model-fitter our training data to use to try to find good coefficients for living area and the building types.The result is a model object, which can make predictions for us.
linear_reg_fit <- fit( # What type of model? linear_reg(mode = "regression"), # Predict what target using what features? Sale_Price ~ Gr_Liv_Area + Bldg_Type, # What data to use? data = ames_train)
If we just show the model object, we get a crude representation of its internals:
linear_reg_fit
We can ask the model for its predictions by using predict
. Let's get predictions for the training homes (the same homes we used to find the coefficients for our model):
ames_train_predictions_only <- linear_reg_fit %>% predict(ames_train) ames_train_predictions_only
Exercise 5: How many predictions did the model make? Try to compute this without using the output we just computed.
Let's compare the predictions with the true sale prices. To do this, we'll want the predictions and the original data in the same data frame. Since they line up row-by-row, we can just staple the predictions column to the end:
bind_cols(ames_train_predictions_only, ames_train)
Usually I do this in a single pipeline:
# Verbose way: ames_train_predictions <- linear_reg_fit %>% predict(ames_train) %>% bind_cols(ames_train) # Alternative shortcut: ames_train_predictions <- linear_reg_fit %>% augment(ames_train)
<<make-train-predictions>>
Make the following plot (which I intentionally haven't labeled well, to help you make it). (note: the column name is .pred
, with a period at the beginning. They named it that way just in case there was already a column named pred
.)
<<resid-by-true-solution>>
ames_train_predictions %>% ggplot(aes(x = Sale_Price, y = Sale_Price - .pred)) + geom_point()
Exercise 6: Just looking at the plot: are these good predictions? About how far off are they? How would you fill in the blank: "The typical error for the predictions is about XXX dollars." Do this by eye, without running a calculation.
To quantify the model's error, let's compute the mean absolute error.
ames_train_predictions %>% mutate(error = Sale_Price - .pred) %>% summarize(mean(abs(error)))
The yardstick
package includes functions that compute this and other metrics:
metrics <- yardstick::metric_set(rsq_trad, mae, mape, rmse) ames_train_predictions %>% metrics(truth = Sale_Price, estimate = .pred) %>% select(-.estimator)
Now seeing that, write down a sentence that summarize the model's accuracy in two different ways (in your own notes).
So far we've been evaluating the model's performance on the training set.
But we already know the sale prices for those homes, so why do we really care about predicting them?
What we'd really like to know is how well we'd predict the sale price for homes that our model never actually saw.
Good thing we held out a test set!
Exercise 7: Repeat the visualization and quantification for the test set.
Here's the code I used in the slides to plot the model predictions. I needed a few extra things:
sweep_model
is a helper function to try out all combinations of multiple variables: all the living area values for each building type.
The tidyr::expand_grid
function helped me out there!geom
s can be given a different data
than the whole ggplot
. I used that to draw the lines.sweep_model <- function(model, ...) { X <- expand_grid(...) model %>% predict(X) %>% bind_cols(X) } ggplot(ames_train, aes(x = Gr_Liv_Area, y = Sale_Price)) + geom_point(alpha = .5, size = .5) + geom_line(data = sweep_model( linear_reg_fit, Gr_Liv_Area = seq(0, 4000, length.out = 500), Bldg_Type = levels(ames_train$Bldg_Type) ), mapping = aes(y = .pred), color = "red") + facet_wrap(vars(Bldg_Type)) + labs(x = "Living Area", y = "Sale Price ($1k)")
ames_test_predictions <- linear_reg_fit %>% predict(ames_test) %>% bind_cols(ames_test)
Here's some results I got. I used this:
all_preds <- bind_rows( train = ames_train_predictions, test = ames_test_predictions, .id = 'set', )
all_preds %>% ggplot(aes(x = Sale_Price, y = Sale_Price - .pred, color = set)) + geom_point(size = .5, alpha = .5) all_preds %>% group_by(set) %>% metrics(truth = Sale_Price, estimate = .pred) %>% arrange(desc(set)) %>% select(-.estimator) %>% pivot_wider(names_from = "set", values_from = ".estimate") %>% knitr::kable()
To run any code chunk from this tutorial in your own environment, use:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.