# IPSUR: Introduction to Probability and Statistics Using R # Copyright (C) 2018 G. Jay Kerns # # Chapter: Simple Linear Regression # # This file is part of IPSUR. # # IPSUR is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # IPSUR is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
# This chapter's package dependencies library(HH) library(lmtest) library(RcmdrMisc)
What do I want them to know?
Here we have two variables (X) and (Y). For our purposes, (X) is not random (so we will write (x)), but (Y) is random. We believe that (Y) depends in some way on (x). Some typical examples of ((x,Y)) pairs are
Given information about the relationship between (x) and (Y), we would like to predict future values of (Y) for particular values of (x). This turns out to be a difficult problem[^slr-1], so instead we first tackle an easier problem: we estimate (\mathbb{E}Y). How can we accomplish this? Well, we know that (Y) depends somehow on (x), so it stands to reason that \begin{equation} \mathbb{E} Y = \mu(x),\ \mbox{a function of }x. \end{equation}
[^slr-1]: Yogi Berra once said, "It is always difficult to make predictions, especially about the future."
But we should be able to say more than that. To focus our efforts we impose some structure on the functional form of (\mu). For instance,
This helps us in the sense that we concentrate on the estimation of just a few parameters, (\beta_{0}) and (\beta_{1}), say, rather than some nebulous function. Our modus operandi is simply to perform the random experiment (n) times and observe the (n) ordered pairs of data ((x_{1},Y_{1}),\ (x_{2},Y_{2}),\ \ldots,(x_{n},Y_{n})). We use these (n) data points to estimate the parameters.
More to the point, there are three simple linear regression (SLR) assumptions \index{regression assumptions} that will form the basis for the rest of this chapter:
\bigskip
```{block, type="assumption"} We assume that (\mu) is a linear function of (x), that is, \begin{equation} \mu(x)=\beta_{0}+\beta_{1}x, \end{equation} where (\beta_{0}) and (\beta_{1}) are unknown constants to be estimated. \bigskip
```{block, type="assumption"} We further assume that (Y_{i}) is (\mu(x_{i})), a "signal", plus some "error" (represented by the symbol (\epsilon_{i})): \begin{equation} Y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}, \quad i = 1,2,\ldots,n. \end{equation}
\bigskip ```{block, type="assumption"} We lastly assume that the errors are IID normal with mean 0 and variance \(\sigma^{2}\): \begin{equation} \epsilon_{1},\epsilon_{2},\ldots,\epsilon_{n}\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma). \end{equation}
\bigskip
```{block, type="rem"} We assume both the normality of the errors (\epsilon) and the linearity of the mean function (\mu). Recall from Proposition \@ref(pro:mvnorm-cond-dist) of Chapter \@ref(cha-multivariable-distributions) that if ((X,Y) \sim \mathsf{mvnorm}) then the mean of (Y|x) is a linear function of (x). This is not a coincidence. In more advanced classes we study the case that both (X) and (Y) are random, and in particular, when they are jointly normally distributed.
### What does it all mean? See Figure \@ref(fig:philosophy). Shown in the figure is a solid line, the regression line \index{regression line} \(\mu\), which in this display has slope 0.5 and *y*-intercept 2.5, that is, \(\mu(x)=2.5 + 0.5x\). The intuition is that for each given value of \(x\), we observe a random value of \(Y\) which is normally distributed with a mean equal to the height of the regression line at that \(x\) value. Normal densities are superimposed on the plot to drive this point home; in principle, the densities stand outside of the page, perpendicular to the plane of the paper. The figure shows three such values of \(x\), namely, \(x = 1\), \(x = 2.5\), and \(x = 4\). Not only do we assume that the observations at the three locations are independent, but we also assume that their distributions have the same spread. In mathematical terms this means that the normal densities all along the line have identical standard deviations -- there is no "fanning out" or "scrunching in" of the normal densities as \(x\) increases[^slr-2]. [^slr-2]: In practical terms, this constant variance assumption is often violated, in that we often observe scatterplots that fan out from the line as \(x\) gets large or small. We say under those circumstances that the data show *heteroscedasticity*. There are methods to address it, but they fall outside the realm of SLR. ```r plot(c(0,5), c(0,6.5), type = "n", xlab="x", ylab="y") abline(h = 0, v = 0, col = "gray60") abline(a = 2.5, b = 0.5, lwd = 2) x <- 600:3000/600 y <- dnorm(x, mean = 3, sd = 0.5) lines(y + 1.0, x) lines(y + 2.5, x + 0.75) lines(y + 4.0, x + 1.5) abline(v = c(1, 2.5, 4), lty = 2, col = "grey") segments(1, 3, 1 + dnorm(0,0,0.5),3, lty = 2, col = "gray") segments(2.5, 3.75, 2.5 + dnorm(0,0,0.5), 3.75, lty = 2, col = "gray") segments(4,4.5, 4 + dnorm(0,0,0.5),4.5, lty = 2, col = "gray")
(ref:cap-philosophy) \small Philosophical foundations of SLR.
\bigskip
``{example, name="Speed and stopping distance of cars"}
We will use the data frame \texttt{cars} \index{Data
sets!cars@\texttt{cars}} from the
datasetspackage
[@datasets]. It has two variables:
speedand
dist`. We can take
a look at some of the values in the data frame:
```r head(cars)
The speed
represents how fast the car was going ((x)) in miles per
hour and dist
((Y)) measures how far it took the car to stop, in
feet. We first make a simple scatterplot of the data.
plot(dist ~ speed, data = cars)
(ref:cap-carscatter) \small A scatterplot of dist
versus speed
for the cars
data. There is clearly an upward trend to the plot which is approximately linear.
You can see the output in Figure \@ref(fig:carscatter), which was produced by the following code.
plot(dist ~ speed, data = cars)
There is a pronounced upward trend to the data points, and the pattern looks approximately linear. There does not appear to be substantial fanning out of the points or extreme values.
Where is (\mu(x))? In essence, we would like to "fit" a line to the points. But how do we determine a "good" line? Is there a best line? We will use maximum likelihood \index{maximum likelihood} to find it. We know: \begin{equation} Y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i},\quad i=1,\ldots,n, \end{equation} where the (\epsilon_{i}) are IID (\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma)). Thus (Y_{i}\sim\mathsf{norm}(\mathtt{mean}=\beta_{0}+\beta_{1}x_{i},\,\mathtt{sd}=\sigma),\ i=1,\ldots,n). Furthermore, (Y_{1},\ldots,Y_{n}) are independent -- but not identically distributed. The likelihood function \index{likelihood function} is: \begin{alignat}{1} L(\beta_{0},\beta_{1},\sigma)= & \prod_{i=1}^{n}f_{Y_{i}}(y_{i}),\ = & \prod_{i=1}^{n}(2\pi\sigma^{2})^{-1/2}\exp\left{ \frac{-(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}}{2\sigma^{2}}\right} ,\ = & (2\pi\sigma^{2})^{-n/2}\exp\left{ \frac{-\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}}{2\sigma^{2}}\right} . \end{alignat} We take the natural logarithm to get \begin{equation} \label{eq-regML-lnL} \ln L(\beta_{0},\beta_{1},\sigma)=-\frac{n}{2}\ln(2\pi\sigma^{2})-\frac{\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}}{2\sigma^{2}}. \end{equation} We would like to maximize this function of (\beta_{0}) and (\beta_{1}). See Appendix \@ref(sec-multivariable-calculus) which tells us that we should find critical points by means of the partial derivatives. Let us start by differentiating with respect to (\beta_{0}): \begin{equation} \frac{\partial}{\partial\beta_{0}}\ln L=0-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}2(y_{i}-\beta_{0}-\beta_{1}x_{i})(-1), \end{equation} and the partial derivative equals zero when (\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i}) = 0), that is, when \begin{equation} \label{eq-regML-a} n \beta_{0} + \beta_{1} \sum_{i=1}^{n} x_{i} = \sum_{i = 1}^{n}y_{i}. \end{equation} Moving on, we next take the partial derivative of (\ln L) (Equation \eqref{eq-regML-lnL}) with respect to (\beta_{1}) to get
\begin{alignat}{1} \frac{\partial}{\partial \beta_{1}} \ln L = \ & 0 - \frac{1}{2\sigma^{2}} \sum_{i=1}^{n} 2 (y_{i} - \beta_{0} - \beta_{1} x_{i})(-x_{i}),\ = & \frac{1}{\sigma^{2}}\sum_{i = 1}^{n}\left(x_{i} y_{i} - \beta_{0}x_{i} - \beta_{1}x_{i}^{2}\right), \end{alignat} and this equals zero when the last sum equals zero, that is, when \begin{equation} \label{eq-regML-b} \beta_{0} \sum_{i = 1}^{n}x_{i} + \beta_{1} \sum_{i = 1}^{n}x_{i}^{2} = \sum_{i = 1}^{n}x_{i}y_{i}. \end{equation} Solving the system of equations \eqref{eq-regML-a} and \eqref{eq-regML-b} \begin{eqnarray} n\beta_{0} + \beta_{1}\sum_{i = 1}^{n}x_{i} & = & \sum_{i = 1}^{n}y_{i}\ \beta_{0}\sum_{i = 1}^{n}x_{i}+\beta_{1}\sum_{i = 1}^{n}x_{i}^{2} & = & \sum_{i = 1}^{n}x_{i}y_{i} \end{eqnarray} for (\beta_{0}) and (\beta_{1}) (in Exercise \@ref(xca:find-mles-slr)) gives \begin{equation} \label{eq-regline-slope-formula} \hat{\beta}{1} = \frac{\sum{i = 1}^{n}x_{i}y_{i} - \left.\left(\sum_{i = 1}^{n}x_{i}\right)\left(\sum_{i = 1}^{n}y_{i}\right)\right] n}{\sum_{i = 1}^{n}x_{i}^{2} - \left.\left(\sum_{i = 1}^{n}x_{i}\right)^{2}\right/ n} \end{equation} and \begin{equation} \hat{\beta}{0} = \overline{y} - \hat{\beta}{1}\overline{x}. \end{equation}
The conclusion? To estimate the mean line \begin{equation} \mu(x) = \beta_{0} + \beta_{1}x, \end{equation} we use the "line of best fit" \begin{equation} \hat{\mu}(x) = \hat{\beta}{0} + \hat{\beta}{1}x, \end{equation} where (\hat{\beta}{0}) and (\hat{\beta}{1}) are given as above. For notation we will usually write (b_{0} = \hat{\beta_{0}}) and (b_{1}=\hat{\beta_{1}}) so that (\hat{\mu}(x) = b_{0} + b_{1}x).
\bigskip
```{block, type="rem"} The formula for (b_{1}) in Equation \eqref{eq-regline-slope-formula} gets the job done but does not really make any sense. There are many equivalent formulas for (b_{1}) that are more intuitive, or at the least are easier to remember. One of the author's favorites is \begin{equation} \label{eq-sample-correlation-formula} b_{1} = r\frac{s_{y}}{s_{x}}, \end{equation} where (r), (s_{y}), and (s_{x}) are the sample correlation coefficient and the sample standard deviations of the (Y) and (x) data, respectively. See Exercise \@ref(xca:show-alternate-slope-formula). Also, notice the similarity between Equation \eqref{eq-sample-correlation-formula} and Equation \eqref{eq-population-slope-slr}.
#### How to do it with R ```r tmpcoef <- round(as.numeric(coef(lm(dist ~ speed, cars))), 2)
Here we go. R will calculate the linear regression line
with the lm
function. We will store the result in an object which we
will call cars.lm
. Here is how it works:
cars.lm <- lm(dist ~ speed, data = cars)
The first part of the input to the lm
function, dist ~ speed
, is a
model formula, read like this: dist
is described (or modeled) by
speed
. The data = cars
argument tells R where to look
for the variables quoted in the model formula. The output object
cars.lm
contains a multitude of information. Let's first take a look
at the coefficients of the fitted regression line, which are extracted
by the coef
function (alternatively, we could just type cars.lm
to
see the same thing):
coef(cars.lm)
The parameter estimates (b_{0}) and (b_{1}) for the intercept and slope, respectively, are shown above.
It is good practice to visually inspect the data with the regression
line added to the plot. To do this we first scatterplot the original
data and then follow with a call to the abline
function. The inputs
to abline
are the coefficients of cars.lm
; see Figure
\@ref(fig:carline).
plot(dist ~ speed, data = cars) abline(coef(cars.lm))
(ref:cap-carline) \small A scatterplot with an added regression line for the cars
data.
To calculate points on the regression line we may simply plug the
desired (x) value(s) into (\hat{\mu}), either by hand, or with
the predict
function. The inputs to predict
are the fitted linear
model object, cars.lm
, and the desired (x) value(s) represented by
a data frame. See the example below.
\bigskip
``{example, label="regline-cars-interpret"}
Using the regression line for the
cars` data:
1. What is the meaning of \(\mu(8) = \beta_{0} + \beta_{1}(8)\)? This represents the average stopping distance (in feet) for a car going 8 mph. 2. Interpret the slope \(\beta_{1}\). The true slope \(\beta_{1}\) represents the increase in average stopping distance for each mile per hour faster that the car drives. In this case, we estimate the car to take approximately `r tmpcoef[2]` additional feet to stop for each additional mph increase in speed. 3. Interpret the intercept \(\beta_{0}\). This would represent the mean stopping distance for a car traveling 0 mph (which our regression line estimates to be \(`r tmpcoef[1]`\). Of course, this interpretation does not make any sense for this example, because a car travelling 0 mph takes 0 ft to stop (it was not moving in the first place)! What went wrong? Looking at the data, we notice that the smallest speed for which we have measured data is 4 mph. Therefore, if we predict what would happen for slower speeds then we would be *extrapolating*, a dangerous practice which often gives nonsensical results. ### Point Estimates of the Regression Line {#sub-slr-point-est-regline} We said at the beginning of the chapter that our goal was to estimate \(\mu = \mathbb{E} Y\), and the arguments in Section \@ref(sub-point-estimate-mle-slr) showed how to obtain an estimate \(\hat{\mu}\) of \(\mu\) when the regression assumptions hold. Now we will reap the benefits of our work in more ways than we previously disclosed. Given a particular value \(x_{0}\), there are two values we would like to estimate: 1. the mean value of \(Y\) at \(x_{0}\), and 2. a future value of \(Y\) at \(x_{0}\). The first is a number, \(\mu(x_{0})\), and the second is a random variable, \(Y(x_{0})\), but our point estimate is the same for both: \(\hat{\mu}(x_{0})\). \bigskip ```{example, label="regline-cars-pe-8mph"} We may use the regression line to obtain a point estimate of the mean stopping distance for a car traveling 8 mph: \(\hat{\mu}(15) = b_{0} + (8) (b_{1})\) which is approximately 13.88. We would also use 13.88 as a point estimate for the stopping distance of a future car traveling 8 mph.
Note that we actually have observed data for a car traveling 8 mph;
its stopping distance was 16 ft as listed in the fifth row of the
cars
data (which we saw in Example \@ref(ex:speed-and-stopping)).
cars[5, ]
There is a special name for estimates (\hat{\mu}(x_{0})) when ( x_{0}) matches an observed value (x_{i}) from the data set. They are called fitted values, they are denoted by (\hat{Y}{1}), (\hat{Y}{2}), ..., (\hat{Y}_{n}) (ignoring repetition), and they play an important role in the sections that follow.
In an abuse of notation we will sometimes write (\hat{Y}) or (\hat{Y}(x_{0})) to denote a point on the regression line even when (x_{0}) does not belong to the original data if the context of the statement obviates any danger of confusion.
We saw in Example \@ref(ex:regline-cars-interpret) that spooky things can happen when we are cavalier about point estimation. While it is usually acceptable to predict/estimate at values of (x_{0}) that fall within the range of the original (x) data, it is reckless to use (\hat{\mu}) for point estimates at locations outside that range. Such estimates are usually worthless. Do not extrapolate unless there are compelling external reasons, and even then, temper it with a good deal of caution.
The fitted values are automatically computed as a byproduct of the
model fitting procedure and are already stored as a component of the
cars.lm
object. We may access them with the fitted
function (we
only show the first five entries):
fitted(cars.lm)[1:5]
Predictions at (x) values that are not necessarily part of the
original data are done with the predict
function. The first argument
is the original cars.lm
object and the second argument newdata
accepts a dataframe (in the same form that was used to fit cars.lm
)
that contains the locations at which we are seeking predictions. Let
us predict the average stopping distances of cars traveling 6 mph, 8
mph, and 21 mph:
predict(cars.lm, newdata = data.frame(speed = c(6, 8, 21)))
Note that there were no observed cars that traveled 6 mph or 21 mph. Also note that our estimate for a car traveling 8 mph matches the value we computed by hand in Example \@ref(ex:regline-cars-pe-8mph).
To find the MLE of (\sigma^{2}) we consider the partial derivative \begin{equation} \frac{\partial}{\partial\sigma^{2}}\ln L=\frac{n}{2\sigma^{2}}-\frac{1}{2(\sigma^{2})^{2}}\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}, \end{equation} and after plugging in (\hat{\beta}{0}) and (\hat{\beta}{1}) and setting equal to zero we get \begin{equation} \hat{\sigma^{2}}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{\beta}{0}-\hat{\beta}{1}x_{i})^{2}=\frac{1}{n}\sum_{i=1}^{n}[y_{i}-\hat{\mu}(x_{i})]^{2}. \end{equation} We write (\hat{Yi}=\hat{\mu}(x_{i})), and we let (E_{i}=Y_{i}-\hat{Y_{i}}) be the (i^{\mathrm{th}}) residual. We see \begin{equation} n\hat{\sigma^{2}}=\sum_{i=1}^{n}E_{i}^{2}=SSE=\mbox{ the sum of squared errors.} \end{equation} For a point estimate of (\sigma^{2}) we use the mean square error (S^{2}) defined by \begin{equation} S^{2}=\frac{SSE}{n-2}, \end{equation} and we estimate (\sigma) with the standard error (S=\sqrt{S^{2}})[^slr-3].
[^slr-3]: Be careful not to confuse the mean square error (S^{2}) with the sample variance (S^{2}) in Chapter \ref{cha-Describing-Data-Distributions}. Other notation the reader may encounter is the lowercase (s^{2}) or the bulky (MSE).
The residuals for the model may be obtained with the residuals
function; we only show the first few entries in the interest of space:
residuals(cars.lm)[1:5]
tmpred <- round(as.numeric(predict(cars.lm, newdata = data.frame(speed = 8))), 2) tmps <- round(summary(cars.lm)$sigma, 2)
In the last section, we calculated the fitted value for (x=8) and
found it to be approximately (\hat{\mu}(8) \approx) r tmpred
. Now, it turns out that there was only one recorded observation
at (x = 8), and we have seen this value in the output of
head(cars)
in Example \@ref(ex:speed-and-stopping); it was
(\mathtt{dist} = 16) ft for a car with (\mathtt{speed} = 8) mph. Therefore, the
residual should be (E = Y - \hat{Y}) which is (E \approx 16 -)
r tmpred
. Now take a look at the last entry of
residuals(cars.lm)
, above. It is not a coincidence.
The estimate (S) for (\sigma) is called the Residual standard
error
and for the cars
data is shown a few lines up on the
summary(cars.lm)
output (see How to do it with R in
Section \@ref(sub-slr-interval-est-params)). We may read it from there
to be (S\approx) r tmps
, or we can access it directly from the
summary
object.
carsumry <- summary(cars.lm) carsumry$sigma
We discussed general interval estimation in Chapter \@ref(cha-estimation). There we found that we could use what we know about the sampling distribution of certain statistics to construct confidence intervals for the parameter being estimated. We will continue in that vein, and to get started we will determine the sampling distributions of the parameter estimates, (b_{1}) and (b_{0}).
To that end, we can see from Equation \eqref{eq-regline-slope-formula} (and it is made clear in Chapter \@ref(cha-multiple-linear-regression)) that (b_{1}) is just a linear combination of normally distributed random variables, so (b_{1}) is normally distributed too. Further, it can be shown that \begin{equation} b_{1}\sim\mathsf{norm}\left(\mathtt{mean}=\beta_{1},\,\mathtt{sd}=\sigma_{b_{1}}\right) \end{equation} where \begin{equation} \sigma_{b_{1}}=\frac{\sigma}{\sqrt{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}} \end{equation} is called the standard error of (b_{1}) which unfortunately depends on the unknown value of (\sigma). We do not lose heart, though, because we can estimate (\sigma) with the standard error (S) from the last section. This gives us an estimate (S_{b_{1}}) for (\sigma_{b_{1}}) defined by \begin{equation} S_{b_{1}}=\frac{S}{\sqrt{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation}
Now, it turns out that (b_{0}), (b_{1}), and (S) are mutually independent (see the footnote in Section \@ref(sub-mlr-interval-est-params)). Therefore, the quantity \begin{equation} T=\frac{b_{1}-\beta_{1}}{S_{b_{1}}} \end{equation} has a (\mathsf{t}(\mathtt{df}=n-2)) distribution and a (100(1-\alpha)\%) confidence interval for (\beta_{1}) is given by \begin{equation} b_{1}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-1)\, S{b_{1}.} \end{equation}
It is also sometimes of interest to construct a confidence interval for (\beta_{0}) in which case we will need the sampling distribution of (b_{0}). It is shown in Chapter \@ref(cha-multiple-linear-regression) that \begin{equation} b_{0}\sim\mathsf{norm}\left(\mathtt{mean}=\beta_{0},\,\mathtt{sd}=\sigma_{b_{0}}\right), \end{equation} where (\sigma_{b_{0}}) is given by \begin{equation} \sigma_{b_{0}}=\sigma\sqrt{\frac{1}{n}+\frac{\overline{x}^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}, \end{equation} and which we estimate with the (S_{b_{0}}) defined by \begin{equation} S_{b_{0}}=S\sqrt{\frac{1}{n}+\frac{\overline{x}^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation} Thus the quantity \begin{equation} T=\frac{b_{0}-\beta_{0}}{S_{b_{0}}} \end{equation} has a (\mathsf{t}(\mathtt{df}=n-2)) distribution and a (100(1-\alpha)\%) confidence interval for (\beta_{0}) is given by \begin{equation} b_{0}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-1)\, S{b_{0}}. \end{equation}
A <- matrix(as.numeric(round(carsumry$coef, 3)), nrow = 2) B <- round(confint(cars.lm), 3)
Let us take a look at the output from summary(cars.lm)
:
summary(cars.lm)
In the Coefficients
section we find the parameter estimates and
their respective standard errors in the second and third columns; the
other columns are discussed in Section \@ref(sec-model-utility-slr). If we
wanted, say, a 95% confidence interval for (\beta_{1}) we could use
(b_{1} =) r A[2,1]
and (S_{b_{1}} =) r A[2,2]
together with a (\mathsf{t}{0.025}(\mathtt{df}=23))
critical value to calculate (b{1} \pm
\mathsf{t}{0.025}(\mathtt{df} = 23) \cdot S{b_{1}}). Or, we could
use the confint
function.
confint(cars.lm)
With 95% confidence, the random interval r B[2,1]
to
r B[2,2]
covers the parameter (\beta_{1}).
We have seen how to estimate the coefficients of regression line with both point estimates and confidence intervals. We even saw how to estimate a value (\hat{\mu}(x)) on the regression line for a given value of (x), such as (x=15).
But how good is our estimate (\hat{\mu}(15))? How much confidence do we have in this estimate? Furthermore, suppose we were going to observe another value of (Y) at (x=15). What could we say?
Intuitively, it should be easier to get bounds on the mean (average) value of (Y) at (x_{0}) -- called a confidence interval for the mean value of (Y) at (x_{0}) -- than it is to get bounds on a future observation of (Y) (called a prediction interval for (Y) at (x_{0})). As we shall see, the intuition serves us well and confidence intervals are shorter for the mean value, longer for the individual value.
Our point estimate of (\mu(x_{0})) is of course (\hat{Y}=\hat{Y}(x_{0})), so for a confidence interval we will need to know (\hat{Y})'s sampling distribution. It turns out (see Section ) that (\hat{Y}=\hat{\mu}(x_{0})) is distributed \begin{equation} \hat{Y}\sim\mathsf{norm}\left(\mathtt{mean}=\mu(x_{0}),\:\mathtt{sd}=\sigma\sqrt{\frac{1}{n}+\frac{(x_{0}-\overline{x})^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}\right). \end{equation}
Since (\sigma) is unknown we estimate it with (S) (we should expect the appearance of a (\mathsf{t}(\mathtt{df}=n-2)) distribution in the near future).
A (100(1-\alpha)\%) confidence interval (CI) for (\mu(x_{0})) is given by \begin{equation} \label{eq-SLR-conf-int-formula} \hat{Y}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-2)\, S\sqrt{\frac{1}{n}+\frac{(x{0}-\overline{x}^{2})}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation} Prediction intervals are a little bit different. In order to find confidence bounds for a new observation of (Y) (we will denote it (Y_{\mbox{new}})) we use the fact that \begin{equation} Y_{\mbox{new}}\sim\mathtt{norm}\left(\mathtt{mean}=\mu(x_{0}),\,\mathtt{sd}=\sigma\sqrt{1+\frac{1}{n}+\frac{(x_{0}-\overline{x})^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}\right). \end{equation} Of course, (\sigma) is unknown so we estimate it with (S) and a (100(1-\alpha)\%) prediction interval (PI) for a future value of (Y) at (x_{0}) is given by \begin{equation} \label{eq-SLR-pred-int-formula} \hat{Y}(x_{0})\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-1)\: S\,\sqrt{1+\frac{1}{n}+\frac{(x{0}-\overline{x})^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation} We notice that the prediction interval in Equation \eqref{eq-SLR-pred-int-formula} is wider than the confidence interval in Equation \eqref{eq-SLR-conf-int-formula}, as we expected at the beginning of the section.
Confidence and prediction intervals are calculated in R
with the predict
\index{predict@\texttt{predict}} function, which we
encountered in Section \@ref(sub-slr-point-est-regline). There we neglected to
take advantage of its additional interval
argument. The general
syntax follows.
\bigskip
We will find confidence and prediction intervals for the stopping distance of a car travelling 5, 6, and 21 mph (note from the graph that there are no collected data for these speeds). We have computed `cars.lm` earlier, and we will use this for input to the `predict` function. Also, we need to tell R the values of \(x_{0}\) at which we want the predictions made, and store the \(x_{0}\) values in a data frame whose variable is labeled with the correct name. *This is important*.
new <- data.frame(speed = c(5, 6, 21))
Next we instruct R to calculate the intervals. Confidence intervals are given by
predict(cars.lm, newdata = new, interval = "confidence")
carsCI <- round(predict(cars.lm, newdata = new, interval = "confidence"), 2)
Prediction intervals are given by
predict(cars.lm, newdata = new, interval = "prediction")
carsPI <- round(predict(cars.lm, newdata = new, interval = "prediction"), 2)
The type of interval is dictated by the interval
argument (which is
none
by default), and the default confidence level is 95\% (which
can be changed with the level
argument).
\bigskip
Using the `cars` data,
r carsCI[1,1]
, so a point
estimate would be r carsCI[1,1]
ft. The 95% CI is
given by r carsCI[1,2]
to r carsCI[1,3]
, so
with 95% confidence the mean stopping distance lies somewhere
between r carsCI[1,2]
ft and r carsCI[1,3]
ft.r carsPI[3,1]
, so a point
prediction for the stopping distance is r carsPI[3,1]
ft. The
95% PI is r carsPI[3,2]
to r carsPI[3,3]
, so with
95% confidence we may assert that the hypothetical stopping
distance for a car travelling 21 mph would lie somewhere between
r carsPI[3,2]
ft and r carsPI[3,3]
ft.We earlier guessed that a bound on the value of a single new observation would be inherently less certain than a bound for an average (mean) value; therefore, we expect the CIs for the mean to be tighter than the PIs for a new observation. A close look at the standard deviations in Equations \eqref{eq-SLR-conf-int-formula} and \eqref{eq-SLR-pred-int-formula} confirms our guess, but we would like to see a picture to drive the point home.
We may plot the confidence and prediction intervals with one fell
swoop using the ci.plot
function from the HH
package
[@HH]. The graph is displayed in Figure \@ref(fig:carscipi).
library(HH) ci.plot(cars.lm)
Notice that the bands curve outward from the regression line as the (x) values move away from the center. This is expected once we notice the ((x_{0}-\overline{x})^{2}) term in the standard deviation formulas in Equations \eqref{eq-SLR-conf-int-formula} and \eqref{eq-SLR-pred-int-formula}.
print(ci.plot(cars.lm))
(ref:cap-carscipi) \small A scatterplot with confidence/prediction bands for the cars
data.
Much of the attention of SLR is directed toward (\beta_{1}) because when (\beta_{1}\neq 0) the mean value of (Y) increases (or decreases) as (x) increases. It is really boring when (\beta_{1}=0), because in that case the mean value of (Y) remains the same, regardless of the value of (x) (when the regression assumptions hold, of course). It is thus very important to decide whether or not (\beta_{1} = 0). We address the question with a statistical test of the null hypothesis (H_{0}:\beta_{1}=0) versus the alternative hypothesis (H_{1}:\beta_{1}\neq0), and to do that we need to know the sampling distribution of (b_{1}) when the null hypothesis is true.
To this end we already know from Section \@ref(sub-slr-interval-est-params) that the quantity
\begin{equation} T=\frac{b_{1}-\beta_{1}}{S_{b_{1}}} \end{equation} has a (\mathsf{t}(\mathtt{df}=n-2)) distribution; therefore, when (\beta_{1}=0) the quantity (b_{1}/S_{b_{1}}) has a (\mathsf{t}(\mathtt{df}=n-2)) distribution and we can compute a (p)-value by comparing the observed value of (b_{1}/S{}{b{1}}) with values under a (\mathsf{t}(\mathtt{df}=n-2)) curve.
Similarly, we may test the hypothesis (H_{0}:\beta_{0}=0) versus the alternative (H_{1}:\beta_{0}\neq0) with the statistic (T=b_{0}/S_{b_{0}}), where (S_{b_{0}}) is given in Section \@ref(sub-slr-interval-est-params). The test is conducted the same way as for (\beta_{1}).
Let us take another look at the output from summary(cars.lm)
:
summary(cars.lm)
In the Coefficients
section we find the (t) statistics and the
(p)-values associated with the tests that the respective parameters
are zero in the fourth and fifth columns. Since the (p)-values are
(much) less than 0.05, we conclude that there is strong evidence that
the parameters (\beta_{1}\neq0) and (\beta_{0}\neq0), and as such,
we say that there is a statistically significant linear relationship
between dist
and speed
.
It would be nice to have a single number that indicates how well our linear regression model is doing, and the simple coefficient of determination is designed for that purpose. In what follows, we observe the values (Y_{1}), (Y_{2}), ...,(Y_{n}), and the goal is to estimate (\mu(x_{0})), the mean value of (Y) at the location (x_{0}).
If we disregard the dependence of (Y) and (x) and base our estimate only on the (Y) values then a reasonable choice for an estimator is just the MLE of (\mu), which is (\overline{Y}). Then the errors incurred by the estimate are just (Y_{i}-\overline{Y}) and the variation about the estimate as measured by the sample variance is proportional to \begin{equation} SSTO=\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}. \end{equation} The acronym (SSTO) stands for total sum of squares. And we have additional information, namely, we have values (x_{i}) associated with each value of (Y_{i}). We have seen that this information leads us to the estimate (\hat{Y_{i}}) and the errors incurred are just the residuals, (E_{i}=Y_{i}-\hat{Y_{i}}). The variation associated with these errors can be measured with \begin{equation} SSE=\sum_{i=1}^{n}(Y_{i}-\hat{Y_{i}})^{2}. \end{equation} We have seen the (SSE) before, which stands for the sum of squared errors or error sum of squares. Of course, we would expect the error to be less in the latter case, since we have used more information. The improvement in our estimation as a result of the linear regression model can be measured with the difference [ (Y_{i}-\overline{Y})-(Y_{i}-\hat{Y_{i}})=\hat{Y_{i}}-\overline{Y}, ] and we measure the variation in these errors with \begin{equation} SSR=\sum_{i=1}^{n}(\hat{Y_{i}}-\overline{Y})^{2}, \end{equation} also known as the regression sum of squares. It is not obvious, but some algebra proved a famous result known as the ANOVA Equality: \begin{equation} \label{eq-anovaeq} \sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}=\sum_{i=1}^{n}(\hat{Y_{i}}-\overline{Y})^{2}+\sum_{i=1}^{n}(Y_{i}-\hat{Y_{i}})^{2} \end{equation} or in other words, \begin{equation} SSTO=SSR+SSE. \end{equation} This equality has a nice interpretation. Consider (SSTO) to be the total variation of the errors. Think of a decomposition of the total variation into pieces: one piece measuring the reduction of error from using the linear regression model, or explained variation ((SSR)), while the other represents what is left over, that is, the errors that the linear regression model doesn't explain, or unexplained variation ((SSE)). In this way we see that the ANOVA equality merely partitions the variation into [ \mbox{total variation}=\mbox{explained variation}+\mbox{unexplained variation}. ] For a single number to summarize how well our model is doing we use the simple coefficient of determination (r^{2}), defined by \begin{equation} r^{2}=1-\frac{SSE}{SSTO}. \end{equation} We interpret (r^{2}) as the proportion of total variation that is explained by the simple linear regression model. When (r^{2}) is large, the model is doing a good job; when (r^{2}) is small, the model is not doing a good job.
Related to the simple coefficient of determination is the sample correlation coefficient, (r). As you can guess, the way we get (r) is by the formula (|r|=\sqrt{r^{2}}). The sign of (r) is equal the sign of the slope estimate (b_{1}). That is, if the regression line (\hat{\mu}(x)) has positive slope, then (r=\sqrt{r^{2}}). Likewise, if the slope of (\hat{\mu}(x)) is negative, then (r=-\sqrt{r^{2}}).
The primary method to display partitioned sums of squared errors is
with an ANOVA table. The command in R to produce such a
table is anova
. The input to anova
is the result of an lm
call
which for the cars
data is cars.lm
.
anova(cars.lm)
The output gives [ r^{2}=1-\frac{SSE}{SSR+SSE}=1-\frac{11353.5}{21185.5+11353.5}\approx0.65. ]
The interpretation should be: "The linear regression line accounts for
approximately 65% of the variation of dist
as explained by speed
".
The value of (r^{2}) is stored in the r.squared
component of
summary(cars.lm)
, which we called carsumry
.
carsumry$r.squared
We already knew this. We saw it in the next to the last line of the
summary(cars.lm)
output where it was called Multiple R-squared
. Listed right beside it is the Adjusted R-squared
which
we will discuss in Chapter \@ref(cha-multiple-linear-regression). For the cars
data, we find (r) to be
sqrt(carsumry$r.squared)
We choose the principal square root because the slope of the regression line is positive.
There is another way to test the significance of the linear regression model. In SLR, the new way also tests the hypothesis (H_{0}:\beta_{1}=0) versus (H_{1}:\beta_{1}\neq0), but it is done with a new test statistic called the overall F statistic. It is defined by \begin{equation} \label{eq-slr-overall-F-statistic} F=\frac{SSR}{SSE/(n-2)}. \end{equation}
Under the regression assumptions and when (H_{0}) is true, the (F) statistic has an (\mathtt{f}(\mathtt{df1}=1,\,\mathtt{df2}=n-2)) distribution. We reject (H_{0}) when (F) is large -- that is, when the explained variation is large relative to the unexplained variation.
All this being said, we have not yet gained much from the overall (F) statistic because we already knew from Section \@ref(sub-slr-hypoth-test-params) how to test (H_{0}:\beta_{1} = 0)... we use the Student's (t) statistic. What is worse is that (in the simple linear regression model) it can be proved that the (F) in Equation \eqref{eq-slr-overall-F-statistic} is exactly the Student's (t) statistic for (\beta_{1}) squared,
\begin{equation} F=\left(\frac{b_{1}}{S_{b_{1}}}\right)^{2}. \end{equation}
So why bother to define the (F) statistic? Why not just square the (t) statistic and be done with it? The answer is that the (F) statistic has a more complicated interpretation and plays a more important role in the multiple linear regression model which we will study in Chapter \@ref(cha-multiple-linear-regression). See Section \@ref(sub-mlr-overall-f-test) for details.
The overall (F) statistic and (p)-value are displayed in the
bottom line of the summary(cars.lm)
output. It is also shown in the
final columns of anova(cars.lm)
:
anova(cars.lm)
tmpf <- round(as.numeric(carsumry$fstatistic[1]), 2)
Here we see that the (F) statistic is r tmpf
with a
(p)-value very close to zero. The conclusion: there is very strong
evidence that (H_{0}:\beta_{1} = 0) is false, that is, there is
strong evidence that (\beta_{1} \neq 0). Moreover, we conclude
that the regression relationship between dist
and speed
is
significant.
Note that the value of the (F) statistic is the same as the
Student's (t) statistic for speed
squared.
We know from our model that (Y=\mu(x)+\epsilon), or in other words, (\epsilon=Y-\mu(x)). Further, we know that (\epsilon\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma)). We may estimate (\epsilon_{i}) with the residual (E_{i}=Y_{i}-\hat{Y_{i}}), where (\hat{Y_{i}}=\hat{\mu}(x_{i})). If the regression assumptions hold, then the residuals should be normally distributed. We check this in Section \@ref(sub-normality-assumption). Further, the residuals should have mean zero with constant variance (\sigma^{2}), and we check this in Section \@ref(sub-constant-variance-assumption). Last, the residuals should be independent, and we check this in Section \@ref(sub-independence-assumption).
In every case, we will begin by looking at residual plots -- that is, scatterplots of the residuals (E_{i}) versus index or predicted values (\hat{Y_{i}}) -- and follow up with hypothesis tests.
We can assess the normality of the residuals with graphical methods and hypothesis tests. To check graphically whether the residuals are normally distributed we may look at histograms or q-q plots. We first examine a histogram in Figure \@ref(fig:normal-hist-cars), which was produced by the following code.
hist(residuals(cars.lm))
(ref:cap-normal-hist-cars) \small A histogram of the residuals from the linear regression model for the cars
data. Used for checking the normality assumption. Look out for any skewness or extreme values; hopefully the graph is mound shaped and symmetric about zero, with no outliers.
There we see that the distribution of
the residuals appears to be mound shaped, for the most part. We can
plot the order statistics of the studentized residuals versus quantiles from a
(\mathsf{t}(\mathtt{mean}=0,\,\mathtt{sd}=1)) distribution with
the command qqPlot(cars.lm)
from the RcmdrMisc
package [@RcmdrMisc], and the results are in Figure
\@ref(fig:normal-q-q-plot-cars). If the assumption of normality were
true, then we would expect points randomly scattered about the dotted
straight line displayed in the figure. In this case, we see a slight
departure from normality in that the dots show a bit of clustering
on one side or the other of the line. The points on the upper end of
the plot also begin to stray from the line. We would say there
is some evidence that the residuals are not perfectly normal.
library(RcmdrMisc) qqPlot(cars.lm)
(ref:cap-normal-q-q-plot-cars) \small A quantile-quantile plot of the studentized residuals from the linear regression model for the cars
data. Used for checking the normality assumption. Look out for any curvature or substantial departures from the straight line; hopefully the dots hug the line closely and stay within the bands.
Even though we may be concerned about the plots, we can use tests to determine if the evidence present is statistically significant, or if it could have happened merely by chance. There are many statistical tests of normality. We will use the Shapiro-Wilk test, since it is known to be a good test and to be quite powerful. However, there are many other fine tests of normality including the Anderson-Darling test and the Lillefors test, just to mention two of them.
The Shapiro-Wilk test is based on the statistic
\begin{equation}
W=\frac{\left(\sum_{i=1}^{n}a_{i}E_{(i)}\right)^{2}}{\sum_{j=1}^{n}E_{j}^{2}},
\end{equation}
where the (E_{(i)}) are the ordered residuals and the (a_{i}) are
constants derived from the order statistics of a sample of size (n)
from a normal distribution. See Section \@ref(sub-shapiro-wilk-normality-test).
We perform the Shapiro-Wilk test below, using the shapiro.test
function from the stats
package [@stats]. The hypotheses are [
H_{0}:\mbox{ the residuals are normally distributed } ] versus [
H_{1}:\mbox{ the residuals are not normally distributed.} ] The
results from R are
shapiro.test(residuals(cars.lm))
For these data we would reject the assumption of normality of the residuals at the (\alpha=0.05) significance level, but do not lose heart, because the regression model is reasonably robust to departures from the normality assumption. As long as the residual distribution is not highly skewed, then the regression estimators will perform reasonably well. Moreover, departures from constant variance and independence will sometimes affect the quantile plots and histograms, therefore it is wise to delay final decisions regarding normality until all diagnostic measures have been investigated.
We will again go to residual plots to try and determine if the spread of the residuals is changing over time (or index). However, it is unfortunately not that easy because the residuals do not have constant variance! In fact, it can be shown that the variance of the residual (E_{i}) is \begin{equation} \mbox{Var$(E_{i})$}=\sigma^{2}(1-h_{ii}),\quad i=1,2,\ldots,n, \end{equation} where (h_{ii}) is a quantity called the leverage which is defined below. Consequently, in order to check the constant variance assumption we must standardize the residuals before plotting. We estimate the standard error of (E_{i}) with (s_{E_{i}}=s\sqrt{(1-h_{ii})}) and define the standardized residuals (R_{i}), (i=1,2,\ldots,n), by \begin{equation} R_{i}=\frac{E_{i}}{s\,\sqrt{1-h_{ii}}},\quad i=1,2,\ldots,n. \end{equation} For the constant variance assumption we do not need the sign of the residual so we will plot (\sqrt{|R_{i}|}) versus the fitted values. As we look at a scatterplot of (\sqrt{|R_{i}|}) versus (\hat{Y}_{i}) we would expect under the regression assumptions to see a constant band of observations, indicating no change in the magnitude of the observed distance from the line. We want to watch out for a fanning-out of the residuals, or a less common funneling-in of the residuals. Both patterns indicate a change in the residual variance and a consequent departure from the regression assumptions, the first an increase, the second a decrease.
In this case, we plot the standardized residuals versus the fitted values. The graph may be seen in Figure \@ref(fig:std-resids-fitted-cars). For these data there does appear to be somewhat of a slight fanning-out of the residuals.
plot(cars.lm, which = 3)
(ref:cap-std-resids-fitted-cars) \small Plot of standardized residuals against the fitted values for the cars
data. Used for checking the constant variance assumption. Watch out for any fanning out (or in) of the dots; hopefully they fall in a constant band.
We will use the Breusch-Pagan test to decide whether the variance of the residuals is nonconstant. The null hypothesis is that the variance is the same for all observations, and the alternative hypothesis is that the variance is not the same for all observations. The test statistic is found by fitting a linear model to the centered squared residuals, \begin{equation} W_{i} = E_{i}^{2} - \frac{SSE}{n}, \quad i=1,2,\ldots,n. \end{equation}
By default the same explanatory variables are used in the new model
which produces fitted values (\hat{W}{i}), (i=1,2,\ldots,n). The
Breusch-Pagan test statistic in R is then calculated with
\begin{equation}
BP=n\sum{i=1}^{n}\hat{W}{i}^{2}\div\sum{i=1}^{n}W_{i}^{2}.
\end{equation}
We reject the null hypothesis if (BP) is too large, which happens
when the explained variation i the new model is large relative to the
unexplained variation in the original model. We do it in
R with the bptest
function from the lmtest
package
[@lmtest].
library(lmtest) bptest(cars.lm)
For these data we would not reject the null hypothesis at the (\alpha=0.05) level. There is relatively weak evidence against the assumption of constant variance.
One of the strongest of the regression assumptions is the one regarding independence. Departures from the independence assumption are often exhibited by correlation (or autocorrelation, literally, self-correlation) present in the residuals. There can be positive or negative correlation.
Positive correlation is displayed by positive residuals followed by positive residuals, and negative residuals followed by negative residuals. Looking from left to right, this is exhibited by a cyclical feature in the residual plots, with long sequences of positive residuals being followed by long sequences of negative ones.
On the other hand, negative correlation implies positive residuals followed by negative residuals, which are then followed by positive residuals, etc. Consequently, negatively correlated residuals are often associated with an alternating pattern in the residual plots. We examine the residual plot in Figure \@ref(fig:resids-fitted-cars). There is no obvious cyclical wave pattern or structure to the residual plot.
plot(cars.lm, which = 1)
(ref:cap-resids-fitted-cars) \small Plot of the residuals versus the fitted values for the cars
data. Used for checking the independence assumption. Watch out for any patterns or structure; hopefully the points are randomly scattered on the plot.
We may statistically test whether there is evidence of autocorrelation
in the residuals with the Durbin-Watson test. The test is based on the
statistic
\begin{equation}
D=\frac{\sum_{i=2}^{n}(E_{i}-E_{i-1})^{2}}{\sum_{j=1}^{n}E_{j}^{2}}.
\end{equation}
Exact critical values are difficult to obtain, but R will
calculate the p-value to great accuracy. It is performed with the
dwtest
function from the lmtest
package [@lmtest]. We will
conduct a two sided test that the correlation is not zero, which is
not the default (the default is to test that the autocorrelation is
positive).
dwtest(cars.lm, alternative = "two.sided")
In this case we do not reject the null hypothesis at the (\alpha=0.05) significance level; there is very little evidence of nonzero autocorrelation in the residuals.
We often find problems with our model that suggest that at least one of the three regression assumptions is violated. What do we do then? There are many measures at the statistician's disposal, and we mention specific steps one can take to improve the model under certain types of violation.
There are two types of observations with which we must be especially careful:
We will discuss outliers first because the notation builds sequentially in that order.
There are three ways that an observation ((x_{i},y_{i})) might be identified as an outlier: it can have an (x_{i}) value which falls far from the other (x) values, it can have a (y_{i}) value which falls far from the other (y) values, or it can have both its (x_{i}) and (y_{i}) values falling far from the other (x) and (y) values.
Leverage statistics are designed to identify observations which have (x) values that are far away from the rest of the data. In the simple linear regression model the leverage of (x_{i}) is denoted by (h_{ii}) and defined by \begin{equation} h_{ii}=\frac{1}{n}+\frac{(x_{i}-\overline{x})^{2}}{\sum_{k=1}^{n}(x_{k}-\overline{x})^{2}},\quad i=1,2,\ldots,n. \end{equation} The formula has a nice interpretation in the SLR model: if the distance from (x_{i}) to (\overline{x}) is large relative to the other (x)'s then (h_{ii}) will be close to 1.
Leverages have nice mathematical properties; for example, they satisfy \begin{equation} \label{eq-slr-leverage-between} 0\leq h_{ii}\leq1, \end{equation} and their sum is \begin{eqnarray} \label{eq-slr-average-leverage} \sum_{i=1}^{n}h_{ii} & = & \sum_{i=1}^{n}\left[\frac{1}{n}+\frac{(x_{i}-\overline{x})^{2}}{\sum_{k=1}^{n}(x_{k}-\overline{x})^{2}}\right],\ & = & \frac{n}{n}+\frac{\sum_{i}(x_{i}-\overline{x})^{2}}{\sum_{k}(x_{k}-\overline{x})^{2}},\ & = & 2. \end{eqnarray}
A rule of thumb is to consider leverage values to be large if they are more than double their average size (which is (2/n) according to Equation \eqref{eq-slr-average-leverage}). So leverages larger than (4/n) are suspect. Another rule of thumb is to say that values bigger than 0.5 indicate high leverage, while values between 0.3 and 0.5 indicate moderate leverage.
We have already encountered the standardized residuals (r_{i}) in Section \@ref(sub-constant-variance-assumption); they are merely residuals that have been divided by their respective standard deviations: \begin{equation} R_{i}=\frac{E_{i}}{S\sqrt{1-h_{ii}}},\quad i=1,2,\ldots,n. \end{equation} Values of (|R_{i}| > 2) are extreme and suggest that the observation has an outlying (y)-value.
Now delete the (i^{\mathrm{th}}) case and fit the regression function to the remaining (n - 1) cases, producing a fitted value (\hat{Y}{(i)}) with deleted residual (D{i}=Y_{i}-\hat{Y}{(i)}). It is shown in later classes that \begin{equation} \mbox{Var $(D{i})$}=\frac{S_{(i)}^{2}}{1-h_{ii}},\quad i=1,2,\ldots,n, \end{equation} so that the studentized deleted residuals (t_{i}) defined by \begin{equation} \label{eq-slr-studentized-deleted-resids} t_{i}=\frac{D_{i}}{S_{(i)}/(1-h_{ii})},\quad i=1,2,\ldots,n, \end{equation} have a (\mathsf{t}(\mathtt{df}=n-3)) distribution and we compare observed values of (t_{i}) to this distribution to decide whether or not an observation is extreme.
The folklore in regression classes is that a test based on the statistic in Equation \eqref{eq-slr-studentized-deleted-resids} can be too liberal. A rule of thumb is if we suspect an observation to be an outlier before seeing the data then we say it is significantly outlying if its two-tailed (p)-value is less than (\alpha), but if we suspect an observation to be an outlier after seeing the data then we should only say it is significantly outlying if its two-tailed (p)-value is less than (\alpha/n). The latter rule of thumb is called the Bonferroni approach and can be overly conservative for large data sets. The responsible statistician should look at the data and use his/her best judgement, in every case.
We can calculate the standardized residuals with the rstandard
function. The input is the lm
object, which is cars.lm
.
sres <- rstandard(cars.lm) sres[1:5]
We can find out which observations have studentized residuals larger than two with the command
sres[which(abs(sres) > 2)]
In this case, we see that observations 23, 35, and 49 are potential
outliers with respect to their (y)-value. We can compute the
studentized deleted residuals with rstudent
:
sdelres <- rstudent(cars.lm) sdelres[1:5]
We should compare these values with critical values from a (\mathsf{t}(\mathtt{df}=n-3)) distribution, which in this case is (\mathsf{t}(\mathtt{df}=50-3=47)). We can calculate a 0.005 quantile and check with
t0.005 <- qt(0.005, df = 47, lower.tail = FALSE) sdelres[which(abs(sdelres) > t0.005)]
This means that observations 23 and 49 have a large studentized
deleted residual. The leverages can be found with the hatvalues
function:
leverage <- hatvalues(cars.lm) leverage[which(leverage > 4/50)]
Here we see that observations 1, 2, and 50 have leverages bigger than double their mean value. These observations would be considered outlying with respect to their (x) value (although they may or may not be influential).
Any time we do a statistical analysis, we are confronted with the variability of data. It is always a concern when an observation plays too large a role in our regression model, and we would not like or procedures to be overly influenced by the value of a single observation. Hence, it becomes desirable to check to see how much our estimates and predictions would change if one of the observations were not included in the analysis. If an observation changes the estimates/predictions a large amount, then the observation is influential and should be subjected to a higher level of scrutiny.
We measure the change in the parameter estimates as a result of
deleting an observation with (DFBETAS). The (DFBETAS) for the
intercept (b_{0}) are given by
\begin{equation}
(DFBETAS){0(i)}=\frac{b{0}-b_{0(i)}}{S_{(i)}\sqrt{\frac{1}{n}+\frac{\overline{x}^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}},\quad i=1,2,\ldots,n.
\end{equation}
and the (DFBETAS) for the slope (b_{1}) are given by
\begin{equation}
(DFBETAS){1(i)}=\frac{b{1}-b_{1(i)}}{S_{(i)}\left[\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}\right]^{-1/2}},\quad i=1,2,\ldots,n.
\end{equation}
See Section \@ref(sec-residual-analysis-mlr) for a better way to write
these. The signs of the (DFBETAS) indicate whether the coefficients
would increase or decrease as a result of including the
observation. If the (DFBETAS) are large, then the observation has a
large impact on those regression coefficients. We label observations
as suspicious if their (DFBETAS) have magnitude greater 1 for small
data or (2/\sqrt{n}) for large data sets. We can calculate the
(DFBETAS) with the dfbetas
function (some output has been
omitted):
dfb <- dfbetas(cars.lm) head(dfb)
We see that the inclusion of the first observation slightly increases
the Intercept
and slightly decreases the coefficient on speed
.
We can measure the influence that an observation has on its fitted
value with (DFFITS). These are calculated by deleting an
observation, refitting the model, recalculating the fit, then
standardizing. The formula is
\begin{equation}
(DFFITS){i}=\frac{\hat{Y{i}}-\hat{Y}{(i)}}{S{(i)}\sqrt{h_{ii}}},\quad i=1,2,\ldots,n.
\end{equation}
The value represents the number of standard deviations of
(\hat{Y_{i}}) that the fitted value (\hat{Y_{i}}) increases or
decreases with the inclusion of the (i^{\textrm{th}})
observation. We can compute them with the dffits
function.
dff <- dffits(cars.lm) dff[1:5]
A rule of thumb is to flag observations whose (DFFIT) exceeds one in absolute value, but there are none of those in this data set.
The (DFFITS) are good for measuring the influence on a single fitted value, but we may want to measure the influence an observation has on all of the fitted values simultaneously. The statistics used for measuring this are Cook's distances which may be calculated[^slr-4] by the formula \begin{equation} \label{eq-slr-cooks-distance} D_{i}=\frac{E_{i}^{2}}{(p+1)S^{2}}\cdot\frac{h_{ii}}{(1-h_{ii})^{2}},\quad i=1,2,\ldots,n. \end{equation} It shows that Cook's distance depends both on the residual (E_{i}) and the leverage (h_{ii}) and in this way (D_{i}) contains information about outlying (x) and (y) values.
To assess the significance of (D), we compare to quantiles of an (\mathsf{f}(\mathtt{df1}=2,\,\mathtt{df2}=n-2)) distribution. A rule of thumb is to classify observations falling higher than the (50^{\mathrm{th}}) percentile as being extreme.
[^slr-4]: Cook's distances are actually defined by a different formula than the one shown. The formula in Equation \eqref{eq-slr-cooks-distance} is algebraically equivalent to the defining formula and is, in the author's opinion, more transparent.
We can calculate the Cook's Distances with the cooks.distance
function.
cooksD <- cooks.distance(cars.lm) cooksD[1:4]
We can look at a plot of the Cook's distances with the command
plot(cars.lm, which = 4)
.
plot(cars.lm, which = 4)
(ref:cap-cooks-distance-cars) \small Cook's distances for the cars
data. Used for checking for influential and/our outlying observations. Values with large Cook's distance merit further investigation.
Observations with the largest Cook's D values are labeled, hence we see that observations 23, 39, and 49 are suspicious. However, we need to compare to the quantiles of an (\mathsf{f}(\mathtt{df1} = 2, \,\mathtt{df2} = 48)) distribution:
F0.50 <- qf(0.5, df1 = 2, df2 = 48) any(cooksD > F0.50)
We see that with this data set there are no observations with extreme Cook's distance, after all.
We can display the result of diagnostic checking all at once in one
table, with potentially influential points displayed. We do it with
the command influence.measures(cars.lm)
:
influence.measures(cars.lm)
The output is a huge matrix display, which we have omitted in the interest of brevity. A point is identified if it is classified to be influential with respect to any of the diagnostic measures. Here we see that observations 2, 11, 15, and 18 merit further investigation.
We can also look at all diagnostic plots at once with the commands
plot(cars.lm)
The par
command is used so that (2\times 2 = 4) plots will be
shown on the same display. The diagnostic plots for the cars
data
are shown in Figure \@ref(fig:diagnostic-plots-cars).
par(mfrow = c(2,2)) plot(cars.lm) par(mfrow = c(1,1))
(ref:cap-diagnostic-plots-cars) \small Diagnostic plots for the cars
data.
We have discussed all of the plots except the last, which is possibly the most interesting. It shows Residuals vs. Leverage, which will identify outlying (y) values versus outlying (x) values. Here we see that observation 23 has a high residual, but low leverage, and it turns out that observations 1 and 2 have relatively high leverage but low/moderate leverage (they are on the right side of the plot, just above the horizontal line). Observation 49 has a large residual with a comparatively large leverage.
We can identify the observations with the identify
command; it
allows us to display the observation number of dots on the
plot. First, we plot the graph, then we call identify
:
plot(cars.lm, which = 5) # std'd resids vs lev plot identify(leverage, sres, n = 4) # identify 4 points
The graph with the identified points is omitted (but the plain plot is shown in the bottom right corner of Figure \@ref(fig:diagnostic-plots-cars)). Observations 1 and 2 fall on the far right side of the plot, near the horizontal axis.
\bigskip
```{block, type="xca"} Prove the ANOVA equality, Equation \eqref{eq-anovaeq}. Hint: show that [ \sum_{i=1}^{n}(Y_{i}-\hat{Y_{i}})(\hat{Y_{i}}-\overline{Y})=0. ]
\bigskip ```{block, type="xca",label="xca-find-mles-slr"} Solve the following system of equations for \(\beta_{1}\) and \(\beta_{0}\) to find the MLEs for slope and intercept in the simple linear regression model. \begin{eqnarray*} n\beta_{0}+\beta_{1}\sum_{i=1}^{n}x_{i} & = & \sum_{i=1}^{n}y_{i}\\ \beta_{0}\sum_{i=1}^{n}x_{i}+\beta_{1}\sum_{i=1}^{n}x_{i}^{2} & = & \sum_{i=1}^{n}x_{i}y_{i} \end{eqnarray*}
\bigskip
```{block, type="xca", label="xca-show-alternate-slope-formula"} Show that the formula given in Equation \eqref{eq-sample-correlation-formula} is equivalent to [ \hat{\beta}{1} = \frac{\sum{i=1}^{n}x_{i}y_{i}-\left.\left(\sum_{i=1}^{n}x_{i}\right)\left(\sum_{i=1}^{n}y_{i}\right)\right/ n}{\sum_{i=1}^{n}x_{i}^{2}-\left.\left(\sum_{i=1}^{n}x_{i}\right)^{2}\right/ n}. ]
```
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.