In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

The purpose of this homework is: 1) to examine how influential points can affect regression models, 2) to examine the analysis of variance for regression and 3) to practice building multiple linear regression models. Please fill in the appropriate code and write answers to all questions in the answer sections, then submit a compiled pdf with your answers through Gradescope by 11pm on Sunday November 3rd.

As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.

Note: for all plots on this homework, please only use base R graphics

# Note: I only have permission to use this data for educational purposes so please do not share the data
download.file('https://yale.box.com/shared/static/gzu5lhulepp3zsyxptwxoeafpst1ccdv.rda', 'car_transactions.rda')

library(knitr)
library(latex2exp)
library(dplyr)   

options(scipen=999)

opts_chunk$set(tidy.opts=list(width.cutoff=60)) 
set.seed(230)  # set the random number generator to always give the same random numbers

$\$

Part 1: The effect of unusual points on regression models

In class we discussed several ways that data points can be unusual, which are:

Outliers (large residuals): data points that have unusual y values.
High leverage points: data points that have unusual x values.
Influential points: data points that are both outliers and have high leverage.

Let's continue using the Edmunds.com car data to explore statistics that quantify these unusual points, and also examine how unusual points can affect regression models. Remember that this data set has been made available to this class for educational purposes, so please do not share this data outside of the class.

$\$

Part 1.1 (6 points):

Let's start our analysis by again reducing the Edmunds data to only used Toyota Corollas. However, for this analysis, we will create two data sets:

used_corollas_all: that has for all used Corollas.
used_corollas: that has only used Corollas that have been driven less than 150,000 miles.

To create the data set that has all used Toyota Corollas, please complete the following steps, and save the results in a data frame called used_corollas_all:

The only variables that should be in the used_corollas_all data frame are:

a. model_bought: the model of the car b. new_or_used_bought: whether a car was new or used when it was purchased c. price_bought: the price the car was purchased for d. mileage_bought: the number of miles the car had when it was purchased

The only cases that should be in the used_corollas_all data frame are:

a. Used cars b. Toyota Corollas

Finally use the na.omit() function on the used_corollas data frame to remove cases that have missing values.

If you have done this correctly, there should be 249 cases in the used_corollas_all data frame. Please print out the number of rows using the dim()to check this is correct before going on to the next set of exercises.

Also, please recreate the used_corollas data set that has only used Corollas with less than 150,000 miles (hint: you should be able to do this in 1-2 lines of code). If you use the dim() function on the used_corollas data frame, you will see that there are 248 cases. Thus there is only one car that had been driven over 150,000 miles. We will examine how much of an affect this one car could have on our regression model that contains data on 248 other cars.

# load the data set
load("car_transactions.rda")

$\$

Part 1.2 (12 points):

Now compare the regression model that is fit all the data (lm_fit_all), to the regression that is fit to the data without the 150k mileage car (lm_fit). To do this please complete the following steps:

Fit a linear regression model to the used_corollas_all data that shows the predicted price as a function of the number of miles driven. Save this model to an object called lm_fit_all and print out the regression coefficients.
Create a scatter plot of the price as a function of the number of miles driven using all the data.
Add a red line to the plot that shows the regression line for the model with all used Corollas to this scatter plot.
Next create a model called lm_fit, that predicts the price of a car as a function of the number of miles driven using only only cars with less than 150,000 miles (i.e., use the used_corollas data frame you created above to recreate the linear model you created on homework 6).
Then add a green regression line to the scatter plot you created above that shows the linear model for cars with less than 150,000 miles (i.e., the model where the one high leverage point is not included).
Finally, print out the predicted value for the price of a car that has been driven 150,000 miles using both the lm_fit_all based on the model using all the data, and the lm_fit based on using only cars with less than 150,000 miles.
In the answer section, report whether these models seem similar in terms of their regression slopes and predictions.

Hint: For your linear regression models with lm(), make sure to use the syntax lm(variable2 ~ variable1, data = df) rather than lm(df$variable2 ~ df$variable1). This is important for getting predict() to return only a single value when you're using the model to predict a single point, and when using lm() to do multiple regression.

# fit a regression model to all the data and save it to the object lm_fit_all
# (note: we fit the model by specifying y as as function of x)





# display the regression coefficients




# plot the data as a scatter plot of the price as a function of the mileage driven






# add the regression line to the plot showing the fit from using all the data




# fit a model using the data that excluded the car that had been drive over 150k miles




# add a green line to the scatter plot of the regression line from the model that 
# excludes cars with over 150k miles





# make a prediction for a car driven 150k miles using both models

Answers:

$\$

Part 1.3 (6 points): Now create a 95% confidence interval for the value of the regression slope $\beta_1$ based on the model that does not include cars with more than 150,000 miles (i.e., the lm_fit model).

If we were assuming that this confidence interval is accurate, would the value that was estimated as regression slope using all the data (i.e., the $\hat{\beta_1}$ based on using the data in used_corollas_all that includes the car with 300,000 miles) be a plausible value for what the true parameter value $\beta_1$?

Answer

$\$

Part 1.4 (12 points):

Now let's create confidence intervals and predictions intervals based on the model that was fit to all the data and see if the regression line from model that does not include the 150k mileage car falls in these confidence intervals. To do this, please complete the following steps:

Sort the data frame used_corollas_all so that the rows are in the order from smallest number of miles driven to the most number of miles driven, and store it again the same object called used_corollas_all. Refit the lm_fit_all using this sorted data frame (as a sanity check, the coefficients found should be the same as before).
Create and print out the confidence interval values for the regression coefficients from this model. (i.e., the confidence intervals for $\beta_0$ and $\beta_1$)
Recreate the scatter plot based on this sorted used_corollas_all data.
Add to this plot the 99% confidence intervals for the regression line in orange (using the model from the sorted used_corollas_all data).
Add to this plot the 99% prediction intervals in blue (again using the model from the sorted used_corollas_all data).
Finally, add to this plot the regression line from the model that does not include the car with over 150,000 miles as a green line.
In the answer section, describe whether the lm_fit regression line falls within the lm_fit_all confidence interval, over the range of mileage where most of the data lies (hint: creating the plot in the console and zooming in can make this easier to see).
Also in the answer section, describe whether the lm_fit_all 99% prediction interval appears to cover the range values where 99% of the observed data lies.

# arrange the data from all the cars and refit the model



# create confidence intervals for the regression coefficients (betas) from the model using all the data




# create confidence interval for the regression line mu_y from the model that used all the data



# create a prediction interval for the regression model that used all the data




# create a scatter plot of all the data





# add the confidence intervals (in orange) for the regression line you created above to the scatter plot




# plot prediction intervals (in blue) you created above to the scatter plot





# add the regression line (in green) based on the model that only uses data from cars with less than 150,000 miles

Answer

$\$

Part 1.5 (12 points): Let's analyze the leverage and Cook's distance for the data points in the used_corollas_all data set. To do this please:

Calculate the leverage for the data points in this model (i.e., the hat values).
Plot the 20 largest leverage values as a bar plot (remember all plots in the homework should use base R graphics).
Plot the residuals as a function of the leverage for each point (use just the regular residuals, not the standardized or studentized residuals).
Use R's built in plot() functions to plot Cook's distance for each data point.
Also, use R's built in plot() functions to plot the standardized residuals as a function of the leverage for each point.
Based on the 'rules of thumb' we discussed in class, in the answer section report how many points are considered 'very unusual' for the different measures of:

a. Cook's distance
b. Standardized residuals
c. Studentized residuals
d. Leverage

As always, please be sure to also "show your work" by printing out the number of points that are considered very unusual for each of these measures.

# get the h_i "hat values" and plot the 20 largest values




# plot residuals as a function of the hat values





# plot Cook's distances for each data point 



# plot the standarized residuals as a function of the leverage



# report number of highly unusual points based on Cook's distance



# report number of highly unusual points based on leverage 



# report number of highly unusual points based on standardized residuals



# report number of highly unusual points based on studentized residuals

Answer

a.
b.
c.
d.

$\$

Part 1.6 (5 points): Above you fit two models: lm_fit_all which contained all the used Corollas and lm_fit which did not contain the high leverage car with 300,000 miles driven. Describe below which you think is a better model and why? Also describe any limitation to these models.

Note, there is not necessarily one correct answer here, but just make sure your answer is reasonable.

Answer

$\$

Part 2: Analysis of variance (ANOVA) for regression

Let's continue to use the Edmunds data to explore the analysis of variance (ANOVA) for regression. For this part of the homework, please use the lm_fit model, based on the used_corollas data that does not include the car that has been driven more than 150,000 miles.

$\$

Part 2.1 (5 points):

As you recall from class, the ANOVA breaks down the total variability in a response variable y (SSTotal) as a sum of the variability explained by the model (SSModel) and the sum of the variability not explained by the model (SSResidual).

To do this decomposition, we can pass a linear model that has been fit (e.g., the lm_fit object), to the anova() function. Please do this now, and in the answer section report the values for the model sum of squares (SSModel), the residual sum of squares (the SSResidual), and the total sum of squares (SSTotal).

Answers:

$\$

Part 2.2 (5 points):

We can also extract the SSTotal, SSModel and SSResidual from the original data and from values stored in the lm_fit object. Please use the following equations to calculate the SSTotal, SSModel and SSResidual values directly fom the lm_fit object, and report these values:

Use the used_corolla data frame to calculate the SSTotal using the formula $\sum_{i=1}^n(y_i - \bar{y})^2$. Save the results in an object called SSTotal and print the results.
Use the fitted.values in the lm_fit object to calculate SSModel using the formula $\sum_{i=1}^n(\hat{y}_i - \bar{y})^2$. Save the results in an object called SSModel and print the results.
Use the residuals in the lm_fit object to calculate SSResidual using the formula $\sum_{i=1}^n(y_i - \hat{y}_i)^2$. Save the results in an object called SSResidual and print the results.

To check you have the right answers, look at the values you got in question 2.1 above. Also show that SSTotal = SSModel + SSResidual for the values you calculated.

$\$

Part 2.3 (6 points): As we discussed in class, the coefficient of determination, $r^2$, is equal to the percentage of the variance explained by the linear model, i.e., $r^2 = SSModel/SSTotal$. For simple linear regression, $r^2$ is equal to the correlation coefficient squared, which it is why it is denoted $r^2$.

Please calculate the correlation coefficient between mileage_bought and price_bought, and square it (hint, the cor() function could be helpful). Then using the values calculated in part 2.2, show that this value is equal to SSModel/SSTotal, and that it is also equal to 1 - SSResidual/SSTotal. As always, be sure to show your work by printing these values to get credit for this problem.

$\$

Part 2.4 (5 points): Let's look at the relationship between the ANOVA F-statistic and the t-statistic. Use the summary() function on the lm_fit model to see the value of the t-statistic (as you had previously done in homework 6, part 2.3). Then, show that the value of t_stat squared is (approximately) equal to the F-statistic found with the anova() function in part 2.1 above, by printing the $t^2$ value and reporting the $t^2$ value and the F-statistic value in the answer section below.

Answers:

$\$

Part 3: Multiple linear regression

Let's continue to use the Edmunds.com car data to explore multiple linear regression where we try to predict a response variable $y$, based on several explanatory variables $x_1$, $x_2$, ..., $x_p$.

$\$

Part 3.1 (5 points): Let's start by using dplyr to derive a new data set from the original Edmunds car_transactions data set. Please create this data set in an object called car_transactions2 that has the following properties:

Create a new variable called years_old which is the difference between the year the car was sold, and the model year of the car.
Reduce the data to only coming from used cars.
Reduce the data so it only contains the variables: price_bought, mileage_bought, years_old, and msrp_bought.

If you have created this data frame correctly it should have 17,134 cases, so please check this before continuing (for the sake of simplicity, do not use na.omit() here).

Also, report what is the maximum and minimum value for the variable years_old. Does it make sense that the minimum value of years_old is negative? Please see if you can come up with an explanation why. Finally, report the price that the least and most expensive used cars sold for.

Answer:

$\$

Part 3.2 (5 points): Now use the pairs() function to visualize the correlation between all pairs of variables in the car_transactions2 data frame. Report whether any variable looks like it has a particularly strong linear relationship with price_bought and whether it makes sense that there would be a strong relationship between these variables.

Answer:

$\$

Part 3.3 (10 points): Next fit a multiple linear regression model predicting the price a car was bought for using the three variables mileage_bought, years_old, msrp_bought and save the linear fit to an object called lm_cars. Then use the summary() function to get information about the the linear regression model you fit by: a) saving the output of the summary() function to an object called summary_lm_cars, and b) printing the output so you can see the result.

Report below the following information:

Are all the regression coefficients statistically significant at the $\alpha = 0.05$ level?
Do the signs for the regression coefficients make sense? Explain why.
Report what percentage of the total sum of squares is explained by this model by looking at the values stored in the summary_lm_cars object.

Answers:

$\$