In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

The purpose of this homework is to practice fitting multiple regression with categorical and non-linear predictors, and to examine overfitting. Please fill in the appropriate code and write answers to all questions in the answer sections, then submit a compiled pdf with your answers through Gradescope by 11pm on Sunday November 10th.

As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.

# Note: I only have permission to use this data for educational purposes so please do not share the data
download.file('https://yale.box.com/shared/static/gzu5lhulepp3zsyxptwxoeafpst1ccdv.rda', 'car_transactions.rda')


SDS230::download_image("xkcd_regression_fits.png")

library(knitr)
library(latex2exp)
library(dplyr)   
library(ggplot2)

options(scipen=999)

opts_chunk$set(tidy.opts=list(width.cutoff=60)) 
set.seed(230)  # set the random number generator to always give the same random numbers

$\$

Part 1: Fitting linear regression models with categorical predictors

As you will recall in the last two problem sets we explored the relationship between the price used Toyota Corollas sold for and other explanatory variables. This analysis was motivated by the example of when I had to sell my used Toyota Corolla after it broke down on the side of the highway. However, not only did I need to sell my Corolla, but I also needed to buy a new car. Let's continue to explore multiple regression models on the Edmunds.com data to see if we can get some insight into which car brands would be the best to buy.

$\$

Part 1.1 (5 points):

Toward the end of my search for a new car, the car models I was considering were the Mazda 3 and the Subaru Impreza. When buying the new car, I knew that at some point I was going to have to sell the car, so I was still interested in looking at how the prices of these car models decline as they are driven more miles. Let's build some regression models that could be helpful for seeing how different makes/models of cars decline in price as a function of the number of miles driven and other variables.

To start our analyses, please use the Edmunds car_transactions data frame to create a data frame called three_car_data that only has the car brands Mazda 3's, Subaru Imprezas, and BMW 5 series (one can dream, right?). To create this data frame, please complete the following steps:

Get only the models of Mazda 3's, Subaru Imprezas, and BMW 5 series car makes (i.e., MAZDA3, Impreza and 5 Series). Be sure to use the %in% operator instead of ==, or connect multiple equality == statements with the "or" operator |, when you filter for the models of the cars - using a single == statement will produce an incorrect answer.
Get the data for only used cars.
Use the droplevels() function on this data frame to remove all levels of the categorical data we are not using (i.e., remove levels for other makes of cars).

If you've filtered the data correctly the data frame should have 497 cases, so be sure to check that this is the case before moving on to other exercises.

Once you have created the three_car_data data frame, use the plot() and points() functions to create a scatter plot of the the price of a car as a function of miles driven, where the BMW's are plotted as black circles, the Subaru's are plotted as blue circles, and the Mazda's are plotted as red circles. Be sure to set the ylim argument in the plot function to be from 0 to $65,000 so that it captures all the points in the plot.

Note: For the rest of this question the term brand will refer to the make/model of the different types of cars.

load("car_transactions.rda")

$\$

Part 1.2 (12 points): Let's now fit a linear model for predicting price as a function of miles driven with a separate intercept for each brand. Use the summary() function to extract information about the linear model, and then answer the following questions:

How much does the price of a car decrease for each additional mile driven?
What is the reference car brand that the other car brands are being compared to?
What is the difference in car prices for each of the other brands relative to the reference car brand?
Does there appear to be statistically significant differences between the y-intercept of reference brand and each of the other car brands?
How much of the total sum of squares of car prices is mileage and car brand accounting for in this model based on the $R^2$ statistic?

Answers:

$\$

Part 1.3 (6 points): Now recreate the scatter plot you created in part 1.1 using the same colors but also add on the regression lines with different y-intercepts that you fit in part 1.2 (using the appropriate colors to match the colors of the points). You can plot the regression lines by passing the coefficients for intercept and slope to the abline function.

Based on this visualization, does it appear the model you fit in part 1.2 is doing a good job capturing the trends in this data?

Answer

$\$

Part 1.4 (6 points): Next fit a linear regression model for car price as a function of miles driven, but use separate y-intercepts and slopes for each of the 3 car brands. Once you have fit this model, please answer the following questions:

How much of the total sum of squares of car prices is this model capturing?
Based on this model, if a BMW 5 Series and Mazda 3 both had been driven 50,000 miles, what would be the difference in the car prices that the model predicts?

Hint: Think about what you would write differently in defining the model compared to the previous questions.

Answers

$\$

Part 1.5 (6 points): Now let's visualize these regression models. Start by recreating the scatter plot you created in part 1.1 using the same colors, but also add on the regression line with different y-intercepts and different slopes based on the model you fit in part 1.4 (again use the appropriate colors). Based on this visualization, does it seem that the slopes are different for all 3 car brands?

Answer

$\$

Part 2: Polynomial regression

As discussed in class, we can create models that better fit our data by adding polynomial expanded terms; i.e., we can create models of the form: $\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_1^3 + ...$. Let's explore fitting these models now. In particular, let's use polynomial regression to build models that predict the price that a car was purchased for (price_bought) as a function mileage_bought taken to different powers (i.e., $mileage$, $mileage^2$, etc.). To build these models, let's explore used Toyota Corolla data again so that we can compare some of the linear models we fit on the previous homework to models that contain non-linear functions of the predictors.

The code below recreates the used_corollas_all data frame you created on homework 7. Please use this data frame for the next two parts of this homework.

# load the data set
load("car_transactions.rda")


# get the used Corollas data
used_corollas_all <- car_transactions |>
  select(price_bought,
         mileage_bought,
         model_bought,
         new_or_used_bought) |>
  filter(model_bought == "Corolla", 
         new_or_used_bought == "U") |>
  na.omit(used_corollas)


# check the size of the used_corollas_all data frame
dim(used_corollas_all)

$\$

Part 2.1 (10 points):

Let's start exploring how polynomial regression can lead to better model fits, by predicting Corolla prices as a function of miles driven using polynomial models of degree 1, up to degree 5. For each model, we will then print the coefficient of determination, $R^2$, to see how much of the variability in the y values the model is capturing.

To do this, use a for loop where the counter index i of the for loop goes from 1 to 5. Then, inside the for loop do the following:

Use the lm() and poly() functions to fit a model of degree i and save it to an object called curr_model.
Use the summary() function to extract summary statistics from the curr_model and save these statistics in an object curr_summary.
Use the print() function to print the $R^2$ value that is stored in the curr_summary object.

Once you have the code working, in the answer section below fill in a table reporting the different $R^2$ values. Based on these values, discuss which model you think has the best fit to the used Corolla data. Should we trust the model with the highest R-squared value? Why or why not?

Hint: We only used 3 lines of code inside the for loop.

Answer

Model degree | R^2
--------------|------------| 1 | | 2 | | 3 | | 4 | |
5 | |

$\$

Part 2.2 (15 points): To gain better insight into which model to use, let's plot all of these polynomial models. To do this, again create a for loop that goes from 1 to 5, and inside the for loop do the following steps:

Use the lm() and poly() functions to fit a model of degree i and save it to an object called curr_model.
Use the data x_vals_df data frame below, in conjunction with the predict() function, to make predicted $\hat{y}$ values for x-values in the x_vals_df using the current model. Save this vector of predicted $\hat{y}$ values to an object called y_vals_predicted.
Use the plot() function to create a scatter plot of the data; i.e., plot price_bought as a function of the mileage_bought.
Use the points() function to add a the regression line values (from step 2) on this plot as a red line.

If the predicted regression line values are being cut off by the edge of the figure, go back to step 3 and adjust the y-limits of the scatter plot using the ylim = c(lower_lim, upper_lim) argument.

Based on these plots, describe in the answer section below, which model do you think is the best model and why it is better than each of the other models.

par(mfrow = c(2, 3))
x_vals_df <- data.frame(mileage_bought = 0:300000)

Answer

$\$

Part 2.3 (15 points): As we discussed in class, the $R^2$ value always increases (or stays the same) as more variables (or a higher degrees polynomials) are added to the model. Thus if one was to judge how "good" a model was based on the $R^2$ statistic value, one would always choose the most complex model that has the most explanatory variables in it.

As we also discussed in class, there are other statistics and methods that can potentially give better measures of how well a model will be able to make predictions on new data. Let's empirically evaluate how well these methods work in terms of there ability to assess how well a model fits the data.

The statistics for model fit that we will compare are: $R^2$, $R^2_{adj}$, $AIC$ and $BIC$. To assess how good these methods are for evaluating models, please create objects called all_r_squared, all_adj_r_squared, all_aic, and all_bic that will store the results from evaluating our different model evaluation statistics. Then create a for loop that does the following:

Use the lm() and poly() functions to fit a model of degree i and save it to an object called curr_model.
Use the summary() function to get a summary of the model. Then extract the $R^2$ and adjusted $R^2$ statistics and save them in the ith slot of the all_r_squared and all_adj_r_squared vectors.
Use the AIC() and BIC() functions on the model object to extract the $AIC$ and $BIC$ statistics and save these values to the ith slot of the all_aic and all_bic vectors.

Then, outside of the for loop, use the which.max() function and which.min() functions to determine which model each of these statistics would select. Fill in the table below by placing an x in the appropriate column for the degree model that you would select based on each statistic (you will fill in the last row of the table in part 2.4). Also comment on which statistics you think lead to the best model choice (looking at the plots you created in part 2.2 could be helpful to answer this question).

Answer

| | 1 | 2 | 3 | 4 | 5 | |-------------|---------|---------|---------|---------|---------| | $R^2$ | | | | | | | $R^2_{adj}$ | | | | | | | AIC | | | | | | | BIC | | | | | | | cross-val | | | | | |

Answer

$\$

Part 2.4 (15 points):

As have discussed in class, we can also use cross-validation to assess model fit. To do this we build a model on a training set of data and then evaluating the accuracy of the predictions on a separate test set of data. When evaluating the model, we can use the mean squared prediction error (MSPE) as a measure of how accurately the model is making predictions on new data. This measure is defined as:

$$MSPE = \frac{1}{m}\sum_{i = 1}^m(y_i - \hat{y_i})^2$$

Where the $y_i$ come from the m points in the test set.

For more information on this measure, see Wikipedia.

The code below splits the data in half and creates a training set that we will use to learn the estimated coefficients $\hat{\beta}$'s and also creates a test set with the other half of the data that we can evaluate how accurately the model can make predictions. Use a for loop to create models of degree 1 to 5 using the training data, have the models make predictions on the test set, and then calculate the MSPE based on their predictions. Store all the results accumulated in the for loop in a vector called all_MSPE. Then, add to the table above (in part 2.3) which model has the minimum MPSE for cross-validation by putting an x in the appropriate column. Also be sure to print out the MPSE values you get to show your work.

# create the training set and the test set
total_num_points <- dim(used_corollas_all)[1]
num_training_points <- floor(total_num_points/2)

training_data <- used_corollas_all[1:num_training_points, ]
test_data <- used_corollas_all[(num_training_points + 1):total_num_points, ]



# run a for loop to calculate the MSPE for models of degree 1 to 5 
# then find the model with the minimal MSPE

$\$

Part 2.5 (5 points):

In class we discussed cross-validation, where (for three-fold cross-validation) we would:

Split the Dataset: Divide your dataset into three equal parts; i.e., D1, D2, and D3.

Train and Validate: Perform three rounds of training and validation: - In the first round, use D1 and D2 as the training set and D3 as the test set. - In the second round, use D1 and D3 as the training set and D2 as the test set. - In the third round, use D2 and D3 as the training set and D1as the test set.

Evaluation: Calculate the MPSE from each round, and then take the average. This average will be your final evaluation metric.

Please run three-fold cross-validation procedure using the following steps:

Split the data into a training and test set. The training set should have two-thirds of the data and the test set should have one-third of the data. Hint: To create the training and test splits, it can be useful to first find the indices (i.e., row numbers) of data points that should be in the test set. You can then use the setdiff() function from all indices (row numbers) to get the remaining points that should be in the training set.
Learn the model parameter estimates on the training set using the lm() function.
Make predictions on the test set and calculate the MSPE. Save the MSPE in a vector called curr_CV_results that stores the MSPE results from each cross-validation split.
Repeat steps 1-3 putting aside a different 1/3 of the data each time for the test set and training on the remaining data; i.e., you should have a for loop that loops over the data, creates different training and test sets each time, and saves the MSPE for the current training/test split in the vector curr_CV_results.
Average the MSPE results over the 3 cross-validation splits to get the mean cross-validation results for a model of a given degree; i.e., take the mean of the values in the curr_CV_results vector and store this result in a vector called all_degree_results.
Repeat this procedure for models of degree 1-5 (i.e., by having an outer for loop that loops over models of different degrees), so that your vector all_degree_results mean MPSE for each of the different degree polynomials.

Hint: a good strategy is to get the code working for the degree 1 polynomial fit and once this is working then add the outer loop to get the code to work for the other degree polynomial models.

Print the mean cross-validation results from the models of each degree (i.e., print the 5 MSPE values in the all_degree_results vector). Does the three-fold cross-validation results lead to the same conclusions compared to using only 1 training and test split as was done in part 2.4?

Note: this question is a bit more difficult, so try your best and if you can't get it, it's only worth 5 points so it will not have much impact on your grade in the class.

Answer

$\$