$\$
The purpose of this homework is: 1) to examine how influential points can affect regression models, 2) to examine the analysis of variance for regression and 3) to practice building multiple linear regression models. Please fill in the appropriate code and write answers to all questions in the answer sections, then submit a compiled pdf with your answers through Gradescope by 11pm on Sunday November 5th.
As always, if you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.
Note: for all plots on this homework, please only use base R graphics
# Note: I only have permission to use this data for educational purposes so please do not share the data download.file('https://yale.box.com/shared/static/gzu5lhulepp3zsyxptwxoeafpst1ccdv.rda', 'car_transactions.rda')
library(knitr) library(latex2exp) library(dplyr) options(scipen=999) opts_chunk$set(tidy.opts=list(width.cutoff=60)) set.seed(230) # set the random number generator to always give the same random numbers
$\$
In class we discussed several ways that data points can be unusual, which are:
Let's continue using the Edmunds.com car data to explore statistics that quantify these unusual points, and also examine how unusual points can affect regression models. Remember that this data set has been made available to this class for educational purposes, so please do not share this data outside of the class.
$\$
Part 1.1 (6 points):
Let's start our analysis by again reducing the Edmunds data to only used Toyota Corollas. However, for this analysis, we will create two data sets:
used_corollas_all
: that has for all used Corollas. used_corollas
: that has only used Corollas that have been driven less than 150,000 miles. To create the data set that has all used Toyota Corollas, please complete the
following steps, and save the results in a data frame called used_corollas_all
:
used_corollas_all
data frame are: a. model_bought
: the model of the car
b. new_or_used_bought
: whether a car was new or used when it was purchased
c. price_bought
: the price the car was purchased for
d. mileage_bought
: the number of miles the car had when it was purchased
used_corollas_all
data frame are: a. Used cars b. Toyota Corollas
na.omit()
function on the used_corollas
data frame to
remove cases that have missing values.If you have done this correctly, there should be 249 cases in the
used_corollas_all
data frame. Please check this before going on to the next
set of exercises.
Also, please recreate the used_corollas
data set, that has only used Corollas
with less than 150,000 miles (hint: you should be able to do this in 1-2 lines
of code). If you look at the used_corollas
data frame, you will see that there
are 248 cases. Thus there is only one car that had been driven over 150,000
miles. We will examine how much of an affect this one car could have on our
regression model that contains data on 248 other cars.
# load the data set load("car_transactions.rda")
$\$
Part 1.2 (12 points):
Now compare the regression model that is fit all the data (lm_fit_all
), to the
regression that is fit to the data without the 150k mileage car (lm_fit
). To
do this please complete the following steps:
Fit a linear regression model to the used_corollas_all
data that shows the
predicted price as a function of the number of miles driven. Save this model to
an object called lm_fit_all
and print out the regression coefficients.
Create a scatter plot of the price as a function of the number of miles driven using all the data.
Add a red line to the plot that shows the regression line for the model with all used Corollas to this scatter plot.
Next create a model called lm_fit
, that predicts the price of a car as a
function of the number of miles driven using only only cars with less than
150,000 miles (i.e., use the used_corollas
data frame you created above to
recreate the linear model you created on homework 6).
Then add a green regression line to the scatter plot you created above that shows the linear model for cars with less than 150,000 miles (i.e., the model where the one high leverage point is not included).
Finally, print out the predicted value for the price of a car that has been
driven 150,000 miles using both the lm_fit_all
based on the model using all
the data, and the lm_fit
based on using only cars with less than 150,000
miles.
In the answer section, report whether these models seem similar in terms of their regression slopes and predictions.
Hint: For your linear regression models with lm()
, make sure to use the syntax
lm(variable2 ~ variable1, data = df)
rather than lm(df$variable2 ~
df$variable1)
. This is important for getting predict()
to return only a
single value when you're using the model to predict a single point, and when
using lm()
to do multiple regression.
# fit a regression model to all the data and save it to the object lm_fit_all # (note: we fit the model by specifying y as as function of x) # display the regression coefficients # plot the data as a scatter plot of the price as a function of the mileage driven # add the regression line to the plot showing the fit from using all the data # fit a model using the data that excluded the car that had been drive over 150k miles # add a green line to the scatter plot of the regression line from the model that # excludes cars with over 150k miles # make a prediction for a car driven 150k miles using both models
Answers:
$\$
Part 1.3 (6 points): Now create a 95% confidence interval for the value of
the regression slope $\beta_1$ based on the model that does not include cars
with more than 150,000 miles (i.e., the lm_fit
model).
If we were assuming that this confidence interval is accurate, would the value
that was estimated as regression slope using all the data (i.e., the
$\hat{\beta_1}$ based on using the data in used_corollas_all
that includes the
car with 300,000 miles) be a plausible value for what the true parameter value
$\beta_1$?
Answer
$\$
Part 1.4 (12 points):
Now let's create confidence intervals and predictions intervals based on the model that was fit to all the data and see if the regression line from model that does not include the 150k mileage car falls in these confidence intervals. To do this, please complete the following steps:
Sort the data frame used_corollas_all
so that the rows are in the order
from smallest number of miles driven to the most number of miles driven, and
store it again the same object called used_corollas_all
. Refit the
lm_fit_all
using this sorted data frame (as a sanity check, the coefficients
found should be the same as before).
Create and print out the confidence interval values for the regression coefficients from this model. (i.e., the confidence intervals for $\beta_0$ and $\beta_1$)
Recreate the scatter plot based on this sorted used_corollas_all
data.
Add to this plot the 99% confidence intervals for the regression line
in orange (using the model from the sorted used_corollas_all
data).
Add to this plot the 99% prediction intervals in blue (again using the
model from the sorted used_corollas_all
data).
Finally, add to this plot the regression line from the model that does not include the car with over 150,000 miles as a green line.
In the answer section, describe whether the lm_fit
regression line falls
within the lm_fit_all
confidence interval,
over the range of mileage where most of the data lies (hint: creating the plot
in the console and zooming in can make this easier to see).
Also in the answer section, describe whether the lm_fit_all
99% prediction
interval appears to cover the range values where 99% of the observed data lies.
# arrange the data from all the cars and refit the model # create confidence intervals for the regression coefficients (betas) from the model using all the data # create confidence interval for the regression line mu_y from the model that used all the data # create a prediction interval for the regression model that used all the data # create a scatter plot of all the data # add the confidence intervals (in orange) for the regression line you created above to the scatter plot # plot prediction intervals (in blue) you created above to the scatter plot # add the regression line (in green) based on the model that only uses data from cars with less than 150,000 miles
Answer
$\$
Part 1.5 (12 points): Let's analyze the leverage and Cook's distance for
the data points in the used_corollas_all
data set. To do this please:
Calculate the leverage for the data points in this model (i.e., the hat values).
Plot the 20 largest leverage values as a bar plot (remember all plots in the homework should use base R graphics).
Plot the residuals as a function of the leverage for each point (use just the regular residuals, not the standardized or studentized residuals).
Use R's built in plot()
functions to plot Cook's distance for each data point.
Also, use R's built in plot()
functions to plot the standardized residuals
as a function of the leverage for each point.
Based on the 'rules of thumb' we discussed in class, in the answer section report how many points are considered 'very unusual' for the different measures of:
a. Cook's distance
b. Standardized residuals
c. Studentized residuals
d. Leverage
As always, please be sure to also "show your work" by printing out the number of points that are considered very unusual for each of these measures.
# get the h_i "hat values" and plot the 20 largest values # plot residuals as a function of the hat values # plot Cook's distances for each data point # plot the standarized residuals as a function of the leverage # report number of highly unusual points based on Cook's distance # report number of highly unusual points based on leverage # report number of highly unusual points based on standardized residuals # report number of highly unusual points based on studentized residuals
Answer
a.
b.
c.
d.
$\$
Part 1.6 (5 points): Above you fit two models: lm_fit_all
which contained
all the used Corollas and lm_fit
which did not contain the high leverage car
with 300,000 miles driven. Describe below which you think is a better model and
why? Also describe any limitation to these models.
Note, there is not necessarily one correct answer here, but just make sure your answer is reasonable.
Answer
$\$
Let's continue to use the Edmunds data to explore the analysis of variance
(ANOVA) for regression. For this part of the homework, please use the lm_fit
model, based on the used_corollas
data that does not include the car that
has been driven more than 150,000 miles.
$\$
Part 2.1 (5 points):
As you recall from class, the ANOVA breaks down the total variability in a response variable y (SSTotal) as a sum of the variability explained by the model (SSModel) and the sum of the variability not explained by the model (SSResidual).
To do this decomposition, we can pass a linear model that has been fit (e.g.,
the lm_fit
object), to the anova()
function. Please do this now, and in the
answer section report the values for the model sum of squares (SSModel), the
residual sum of squares (the SSResidual), and the total sum of squares
(SSTotal).
Answers:
$\$
Part 2.2 (5 points):
This is a "challenge problem" that you should try to figure out without getting help from the TAs.
We can also extract the SSTotal, SSModel and SSResidual from the original data
and from values stored in the lm_fit
object. Please use the following
equations to calculate the SSTotal, SSModel and SSResidual values directly fom
the lm_fit
object, and report these values:
Use the used_corolla
data frame to calculate the SSTotal using the formula
$\sum_{i=1}^n(y_i - \bar{y})^2$. Save the results in an object called SSTotal
and print the results.
Use the fitted.values
in the lm_fit
object to calculate SSModel using the
formula $\sum_{i=1}^n(\hat{y}_i - \bar{y})^2$. Save the results in an object
called SSModel
and print the results.
Use the residuals
in the lm_fit
object to calculate SSResidual using the
formula $\sum_{i=1}^n(y_i - \hat{y}_i)^2$. Save the results in an object called
SSResidual
and print the results.
To check you have the right answers, look at the values you got in question 2.1 above. Also show that SSTotal = SSModel + SSResidual for the values you calculated.
$\$
Part 2.3 (6 points): As we discussed in class, the coefficient of determination, $r^2$, is equal to the percentage of the variance explained by the linear model, i.e., $r^2 = SSModel/SSTotal$. For simple linear regression, $r^2$ is equal to the correlation coefficient squared, which it is why it is denoted $r^2$.
Please calculate the correlation coefficient between mileage_bought
and
price_bought
, and square it (hint, the cor()
function could be helpful).
Then using the values calculated in part 2.2, show that this value is equal to
SSModel/SSTotal, and that it is also equal to 1 - SSResidual/SSTotal. As always,
be sure to show your work by printing these values to get credit for this
problem.
$\$
Part 2.4 (5 points): Let's look at the relationship between the ANOVA
F-statistic and the t-statistic. Use the summary()
function on the lm_fit
model to see the value of the t-statistic (as you had previously done in
homework 6, part 2.3). Then, show that the value of t_stat
squared is
(approximately) equal to the F-statistic found with the anova()
function in
part 2.1 above, by printing the $t^2$ value and reporting the $t^2$ value and
the F-statistic value in the answer section below.
Answers:
$\$
Let's continue to use the Edmunds.com car data to explore multiple linear regression where we try to predict a response variable $y$, based on several explanatory variables $x_1$, $x_2$, ..., $x_p$.
$\$
Part 3.1 (5 points): Let's start by using dplyr to derive a new data set
from the original Edmunds car_transactions
data set. Please create this data
set in an object called car_transactions2
that has the following properties:
Create a new variable called years_old
which is the difference between the
year the car was sold, and the model year of the car.
Reduce the data to only coming from used cars.
Reduce the data so it only contains the variables: price_bought
,
mileage_bought
, years_old
, and msrp_bought
.
If you have created this data frame correctly it should have 17,134 cases, so
please check this before continuing (for the sake of simplicity, do not use
na.omit()
here).
Also, report what is the maximum and minimum value for the variable years_old
.
Does it make sense that the minimum value of years_old
is negative? Please see
if you can come up with an explanation why. Finally, report the price that the
least and most expensive used cars sold for.
Answer:
$\$
Part 3.2 (5 points): Now use the pairs()
function to visualize the
correlation between all pairs of variables in the car_transactions2
data
frame. Report whether any variable looks like it has a particularly strong
linear relationship with price_bought
and whether it makes sense that there
would be a strong relationship between these variables.
Answer:
$\$
Part 3.3 (10 points): Next fit a multiple linear regression model
predicting the price a car was bought for using the three variables
mileage_bought
, years_old
, msrp_bought
and save the linear fit to an
object called lm_cars
. Then use the summary()
function to get information
about the the linear regression model you fit by: a) saving the output of the
summary()
function to an object called summary_lm_cars
, and b) printing the
output so you can see the result.
Report below the following information:
Are all the regression coefficients statistically significant at the $\alpha = 0.05$ level?
Do the signs for the regression coefficients make sense? Explain why.
Report what percentage of the total sum of squares is explained by this model
by looking at the values stored in the summary_lm_cars
object.
Answers:
$\$
Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 7.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.