Simple Linear Regression {#cha-simple-linear-regression}

#    IPSUR: Introduction to Probability and Statistics Using R
#    Copyright (C) 2018  G. Jay Kerns
#
#    Chapter: Simple Linear Regression
#
#    This file is part of IPSUR.
#
#    IPSUR is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    IPSUR is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with IPSUR.  If not, see <http://www.gnu.org/licenses/>.
# This chapter's package dependencies
library(HH)
library(lmtest)
library(RcmdrMisc)

What do I want them to know?

Basic Philosophy {#sec-basic-philosophy}

Here we have two variables (X) and (Y). For our purposes, (X) is not random (so we will write (x)), but (Y) is random. We believe that (Y) depends in some way on (x). Some typical examples of ((x,Y)) pairs are

Given information about the relationship between (x) and (Y), we would like to predict future values of (Y) for particular values of (x). This turns out to be a difficult problem[^slr-1], so instead we first tackle an easier problem: we estimate (\mathbb{E}Y). How can we accomplish this? Well, we know that (Y) depends somehow on (x), so it stands to reason that \begin{equation} \mathbb{E} Y = \mu(x),\ \mbox{a function of }x. \end{equation}

[^slr-1]: Yogi Berra once said, "It is always difficult to make predictions, especially about the future."

But we should be able to say more than that. To focus our efforts we impose some structure on the functional form of (\mu). For instance,

This helps us in the sense that we concentrate on the estimation of just a few parameters, (\beta_{0}) and (\beta_{1}), say, rather than some nebulous function. Our modus operandi is simply to perform the random experiment (n) times and observe the (n) ordered pairs of data ((x_{1},Y_{1}),\ (x_{2},Y_{2}),\ \ldots,(x_{n},Y_{n})). We use these (n) data points to estimate the parameters.

More to the point, there are three simple linear regression (SLR) assumptions \index{regression assumptions} that will form the basis for the rest of this chapter:

\bigskip

```{block, type="assumption"} We assume that (\mu) is a linear function of (x), that is, \begin{equation} \mu(x)=\beta_{0}+\beta_{1}x, \end{equation} where (\beta_{0}) and (\beta_{1}) are unknown constants to be estimated. \bigskip

```{block, type="assumption"} We further assume that (Y_{i}) is (\mu(x_{i})), a "signal", plus some "error" (represented by the symbol (\epsilon_{i})): \begin{equation} Y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}, \quad i = 1,2,\ldots,n. \end{equation}

\bigskip

```{block, type="assumption"}
We lastly assume that the errors are IID normal with mean 0 and variance \(\sigma^{2}\):
\begin{equation}
\epsilon_{1},\epsilon_{2},\ldots,\epsilon_{n}\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma).
\end{equation}

\bigskip

```{block, type="rem"} We assume both the normality of the errors (\epsilon) and the linearity of the mean function (\mu). Recall from Proposition \@ref(pro:mvnorm-cond-dist) of Chapter \@ref(cha-multivariable-distributions) that if ((X,Y) \sim \mathsf{mvnorm}) then the mean of (Y|x) is a linear function of (x). This is not a coincidence. In more advanced classes we study the case that both (X) and (Y) are random, and in particular, when they are jointly normally distributed.

### What does it all mean?
See Figure \@ref(fig:philosophy). Shown in the figure is a solid line, the
regression line \index{regression line} \(\mu\), which in
this display has slope 0.5 and *y*-intercept 2.5, that is, \(\mu(x)=2.5 + 0.5x\). The intuition is that for each given value of \(x\), we
observe a random value of \(Y\) which is normally distributed with a
mean equal to the height of the regression line at that \(x\)
value. Normal densities are superimposed on the plot to drive this
point home; in principle, the densities stand outside of the page,
perpendicular to the plane of the paper. The figure shows three such
values of \(x\), namely, \(x = 1\), \(x = 2.5\), and \(x = 4\). Not only do we assume that the observations at the three locations
are independent, but we also assume that their distributions have the
same spread. In mathematical terms this means that the normal
densities all along the line have identical standard deviations --
there is no "fanning out" or "scrunching in" of the normal densities
as \(x\) increases[^slr-2].

[^slr-2]: In practical terms, this constant variance assumption is
often violated, in that we often observe scatterplots that fan out
from the line as \(x\) gets large or small. We say under those
circumstances that the data show *heteroscedasticity*. There are
methods to address it, but they fall outside the realm of SLR.


```r
plot(c(0,5), c(0,6.5), type = "n", xlab="x", ylab="y")
abline(h = 0, v = 0, col = "gray60")
abline(a = 2.5, b = 0.5, lwd = 2)
x <- 600:3000/600
y <- dnorm(x, mean = 3, sd = 0.5)
lines(y + 1.0, x)
lines(y + 2.5, x + 0.75)
lines(y + 4.0, x + 1.5)
abline(v = c(1, 2.5, 4), lty = 2, col = "grey")
segments(1, 3, 1 + dnorm(0,0,0.5),3, lty = 2, col = "gray")
segments(2.5, 3.75, 2.5 + dnorm(0,0,0.5), 3.75, lty = 2, col = "gray")
segments(4,4.5, 4 + dnorm(0,0,0.5),4.5, lty = 2, col = "gray")

(ref:cap-philosophy) \small Philosophical foundations of SLR.

\bigskip

``{example, name="Speed and stopping distance of cars"} We will use the data frame \texttt{cars} \index{Data sets!cars@\texttt{cars}} from thedatasetspackage [@datasets]. It has two variables:speedanddist`. We can take a look at some of the values in the data frame:

```r 
head(cars)

The speed represents how fast the car was going ((x)) in miles per hour and dist ((Y)) measures how far it took the car to stop, in feet. We first make a simple scatterplot of the data.

plot(dist ~ speed, data = cars)

(ref:cap-carscatter) \small A scatterplot of dist versus speed for the cars data. There is clearly an upward trend to the plot which is approximately linear.

You can see the output in Figure \@ref(fig:carscatter), which was produced by the following code.

plot(dist ~ speed, data = cars)

There is a pronounced upward trend to the data points, and the pattern looks approximately linear. There does not appear to be substantial fanning out of the points or extreme values.

Estimation {#sec-slr-estimation}

Point Estimates of the Parameters {#sub-point-estimate-mle-slr}

Where is (\mu(x))? In essence, we would like to "fit" a line to the points. But how do we determine a "good" line? Is there a best line? We will use maximum likelihood \index{maximum likelihood} to find it. We know: \begin{equation} Y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i},\quad i=1,\ldots,n, \end{equation} where the (\epsilon_{i}) are IID (\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma)). Thus (Y_{i}\sim\mathsf{norm}(\mathtt{mean}=\beta_{0}+\beta_{1}x_{i},\,\mathtt{sd}=\sigma),\ i=1,\ldots,n). Furthermore, (Y_{1},\ldots,Y_{n}) are independent -- but not identically distributed. The likelihood function \index{likelihood function} is: \begin{alignat}{1} L(\beta_{0},\beta_{1},\sigma)= & \prod_{i=1}^{n}f_{Y_{i}}(y_{i}),\ = & \prod_{i=1}^{n}(2\pi\sigma^{2})^{-1/2}\exp\left{ \frac{-(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}}{2\sigma^{2}}\right} ,\ = & (2\pi\sigma^{2})^{-n/2}\exp\left{ \frac{-\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}}{2\sigma^{2}}\right} . \end{alignat} We take the natural logarithm to get \begin{equation} \label{eq-regML-lnL} \ln L(\beta_{0},\beta_{1},\sigma)=-\frac{n}{2}\ln(2\pi\sigma^{2})-\frac{\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}}{2\sigma^{2}}. \end{equation} We would like to maximize this function of (\beta_{0}) and (\beta_{1}). See Appendix \@ref(sec-multivariable-calculus) which tells us that we should find critical points by means of the partial derivatives. Let us start by differentiating with respect to (\beta_{0}): \begin{equation} \frac{\partial}{\partial\beta_{0}}\ln L=0-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}2(y_{i}-\beta_{0}-\beta_{1}x_{i})(-1), \end{equation} and the partial derivative equals zero when (\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i}) = 0), that is, when \begin{equation} \label{eq-regML-a} n \beta_{0} + \beta_{1} \sum_{i=1}^{n} x_{i} = \sum_{i = 1}^{n}y_{i}. \end{equation} Moving on, we next take the partial derivative of (\ln L) (Equation \eqref{eq-regML-lnL}) with respect to (\beta_{1}) to get

\begin{alignat}{1} \frac{\partial}{\partial \beta_{1}} \ln L = \ & 0 - \frac{1}{2\sigma^{2}} \sum_{i=1}^{n} 2 (y_{i} - \beta_{0} - \beta_{1} x_{i})(-x_{i}),\ = & \frac{1}{\sigma^{2}}\sum_{i = 1}^{n}\left(x_{i} y_{i} - \beta_{0}x_{i} - \beta_{1}x_{i}^{2}\right), \end{alignat} and this equals zero when the last sum equals zero, that is, when \begin{equation} \label{eq-regML-b} \beta_{0} \sum_{i = 1}^{n}x_{i} + \beta_{1} \sum_{i = 1}^{n}x_{i}^{2} = \sum_{i = 1}^{n}x_{i}y_{i}. \end{equation} Solving the system of equations \eqref{eq-regML-a} and \eqref{eq-regML-b} \begin{eqnarray} n\beta_{0} + \beta_{1}\sum_{i = 1}^{n}x_{i} & = & \sum_{i = 1}^{n}y_{i}\ \beta_{0}\sum_{i = 1}^{n}x_{i}+\beta_{1}\sum_{i = 1}^{n}x_{i}^{2} & = & \sum_{i = 1}^{n}x_{i}y_{i} \end{eqnarray} for (\beta_{0}) and (\beta_{1}) (in Exercise \@ref(xca:find-mles-slr)) gives \begin{equation} \label{eq-regline-slope-formula} \hat{\beta}{1} = \frac{\sum{i = 1}^{n}x_{i}y_{i} - \left.\left(\sum_{i = 1}^{n}x_{i}\right)\left(\sum_{i = 1}^{n}y_{i}\right)\right] n}{\sum_{i = 1}^{n}x_{i}^{2} - \left.\left(\sum_{i = 1}^{n}x_{i}\right)^{2}\right/ n} \end{equation} and \begin{equation} \hat{\beta}{0} = \overline{y} - \hat{\beta}{1}\overline{x}. \end{equation}

The conclusion? To estimate the mean line \begin{equation} \mu(x) = \beta_{0} + \beta_{1}x, \end{equation} we use the "line of best fit" \begin{equation} \hat{\mu}(x) = \hat{\beta}{0} + \hat{\beta}{1}x, \end{equation} where (\hat{\beta}{0}) and (\hat{\beta}{1}) are given as above. For notation we will usually write (b_{0} = \hat{\beta_{0}}) and (b_{1}=\hat{\beta_{1}}) so that (\hat{\mu}(x) = b_{0} + b_{1}x).

\bigskip

```{block, type="rem"} The formula for (b_{1}) in Equation \eqref{eq-regline-slope-formula} gets the job done but does not really make any sense. There are many equivalent formulas for (b_{1}) that are more intuitive, or at the least are easier to remember. One of the author's favorites is \begin{equation} \label{eq-sample-correlation-formula} b_{1} = r\frac{s_{y}}{s_{x}}, \end{equation} where (r), (s_{y}), and (s_{x}) are the sample correlation coefficient and the sample standard deviations of the (Y) and (x) data, respectively. See Exercise \@ref(xca:show-alternate-slope-formula). Also, notice the similarity between Equation \eqref{eq-sample-correlation-formula} and Equation \eqref{eq-population-slope-slr}.

#### How to do it with R

```r
tmpcoef <- round(as.numeric(coef(lm(dist ~ speed, cars))), 2)

Here we go. R will calculate the linear regression line with the lm function. We will store the result in an object which we will call cars.lm. Here is how it works:

cars.lm <- lm(dist ~ speed, data = cars)

The first part of the input to the lm function, dist ~ speed, is a model formula, read like this: dist is described (or modeled) by speed. The data = cars argument tells R where to look for the variables quoted in the model formula. The output object cars.lm contains a multitude of information. Let's first take a look at the coefficients of the fitted regression line, which are extracted by the coef function (alternatively, we could just type cars.lm to see the same thing):

coef(cars.lm)

The parameter estimates (b_{0}) and (b_{1}) for the intercept and slope, respectively, are shown above.

It is good practice to visually inspect the data with the regression line added to the plot. To do this we first scatterplot the original data and then follow with a call to the abline function. The inputs to abline are the coefficients of cars.lm; see Figure \@ref(fig:carline).

plot(dist ~ speed, data = cars)
abline(coef(cars.lm))

(ref:cap-carline) \small A scatterplot with an added regression line for the cars data.

To calculate points on the regression line we may simply plug the desired (x) value(s) into (\hat{\mu}), either by hand, or with the predict function. The inputs to predict are the fitted linear model object, cars.lm, and the desired (x) value(s) represented by a data frame. See the example below.

\bigskip

``{example, label="regline-cars-interpret"} Using the regression line for thecars` data:

1. What is the meaning of \(\mu(8) = \beta_{0} + \beta_{1}(8)\)?
   This represents the average stopping distance (in feet) for a car
   going 8 mph.
2. Interpret the slope \(\beta_{1}\). The true slope \(\beta_{1}\)
   represents the increase in average stopping distance for each mile
   per hour faster that the car drives. In this case, we estimate the
   car to take approximately `r tmpcoef[2]` additional feet
   to stop for each additional mph increase in speed.
3. Interpret the intercept \(\beta_{0}\). This would represent the
   mean stopping distance for a car traveling 0 mph (which our
   regression line estimates to be \(`r tmpcoef[1]`\). Of
   course, this interpretation does not make any sense for this
   example, because a car travelling 0 mph takes 0 ft to stop (it was
   not moving in the first place)! What went wrong? Looking at the
   data, we notice that the smallest speed for which we have measured
   data is 4 mph. Therefore, if we predict what would happen for
   slower speeds then we would be *extrapolating*, a dangerous
   practice which often gives nonsensical results.


### Point Estimates of the Regression Line {#sub-slr-point-est-regline}

We said at the beginning of the chapter that our goal was to estimate
\(\mu = \mathbb{E} Y\), and the arguments in Section
\@ref(sub-point-estimate-mle-slr) showed how to obtain an estimate \(\hat{\mu}\) of \(\mu\) when the regression assumptions hold. Now we
will reap the benefits of our work in more ways than we previously
disclosed. Given a particular value \(x_{0}\), there are two values we
would like to estimate:

1. the mean value of \(Y\) at \(x_{0}\), and
2. a future value of \(Y\) at \(x_{0}\). The first is a number,
   \(\mu(x_{0})\), and the second is a random variable, \(Y(x_{0})\),
   but our point estimate is the same for both: \(\hat{\mu}(x_{0})\).

\bigskip

```{example, label="regline-cars-pe-8mph"}
We may use the regression line to obtain
a point estimate of the mean stopping distance for a car traveling 8
mph: \(\hat{\mu}(15) = b_{0} + (8) (b_{1})\) which is approximately
13.88. We would also use 13.88 as a point estimate for the stopping
distance of a future car traveling 8 mph.

Note that we actually have observed data for a car traveling 8 mph; its stopping distance was 16 ft as listed in the fifth row of the cars data (which we saw in Example \@ref(ex:speed-and-stopping)).

cars[5, ]

There is a special name for estimates (\hat{\mu}(x_{0})) when ( x_{0}) matches an observed value (x_{i}) from the data set. They are called fitted values, they are denoted by (\hat{Y}{1}), (\hat{Y}{2}), ..., (\hat{Y}_{n}) (ignoring repetition), and they play an important role in the sections that follow.

In an abuse of notation we will sometimes write (\hat{Y}) or (\hat{Y}(x_{0})) to denote a point on the regression line even when (x_{0}) does not belong to the original data if the context of the statement obviates any danger of confusion.

We saw in Example \@ref(ex:regline-cars-interpret) that spooky things can happen when we are cavalier about point estimation. While it is usually acceptable to predict/estimate at values of (x_{0}) that fall within the range of the original (x) data, it is reckless to use (\hat{\mu}) for point estimates at locations outside that range. Such estimates are usually worthless. Do not extrapolate unless there are compelling external reasons, and even then, temper it with a good deal of caution.

How to do it with R

The fitted values are automatically computed as a byproduct of the model fitting procedure and are already stored as a component of the cars.lm object. We may access them with the fitted function (we only show the first five entries):

fitted(cars.lm)[1:5]

Predictions at (x) values that are not necessarily part of the original data are done with the predict function. The first argument is the original cars.lm object and the second argument newdata accepts a dataframe (in the same form that was used to fit cars.lm) that contains the locations at which we are seeking predictions. Let us predict the average stopping distances of cars traveling 6 mph, 8 mph, and 21 mph:

predict(cars.lm, newdata = data.frame(speed = c(6, 8, 21)))

Note that there were no observed cars that traveled 6 mph or 21 mph. Also note that our estimate for a car traveling 8 mph matches the value we computed by hand in Example \@ref(ex:regline-cars-pe-8mph).

Mean Square Error and Standard Error

To find the MLE of (\sigma^{2}) we consider the partial derivative \begin{equation} \frac{\partial}{\partial\sigma^{2}}\ln L=\frac{n}{2\sigma^{2}}-\frac{1}{2(\sigma^{2})^{2}}\sum_{i=1}^{n}(y_{i}-\beta_{0}-\beta_{1}x_{i})^{2}, \end{equation} and after plugging in (\hat{\beta}{0}) and (\hat{\beta}{1}) and setting equal to zero we get \begin{equation} \hat{\sigma^{2}}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{\beta}{0}-\hat{\beta}{1}x_{i})^{2}=\frac{1}{n}\sum_{i=1}^{n}[y_{i}-\hat{\mu}(x_{i})]^{2}. \end{equation} We write (\hat{Yi}=\hat{\mu}(x_{i})), and we let (E_{i}=Y_{i}-\hat{Y_{i}}) be the (i^{\mathrm{th}}) residual. We see \begin{equation} n\hat{\sigma^{2}}=\sum_{i=1}^{n}E_{i}^{2}=SSE=\mbox{ the sum of squared errors.} \end{equation} For a point estimate of (\sigma^{2}) we use the mean square error (S^{2}) defined by \begin{equation} S^{2}=\frac{SSE}{n-2}, \end{equation} and we estimate (\sigma) with the standard error (S=\sqrt{S^{2}})[^slr-3].

[^slr-3]: Be careful not to confuse the mean square error (S^{2}) with the sample variance (S^{2}) in Chapter \ref{cha-Describing-Data-Distributions}. Other notation the reader may encounter is the lowercase (s^{2}) or the bulky (MSE).

How to do it with R

The residuals for the model may be obtained with the residuals function; we only show the first few entries in the interest of space:

residuals(cars.lm)[1:5]
tmpred <- round(as.numeric(predict(cars.lm, newdata = data.frame(speed = 8))), 2)
tmps <- round(summary(cars.lm)$sigma, 2)

In the last section, we calculated the fitted value for (x=8) and found it to be approximately (\hat{\mu}(8) \approx) r tmpred. Now, it turns out that there was only one recorded observation at (x = 8), and we have seen this value in the output of head(cars) in Example \@ref(ex:speed-and-stopping); it was (\mathtt{dist} = 16) ft for a car with (\mathtt{speed} = 8) mph. Therefore, the residual should be (E = Y - \hat{Y}) which is (E \approx 16 -) r tmpred. Now take a look at the last entry of residuals(cars.lm), above. It is not a coincidence.

The estimate (S) for (\sigma) is called the Residual standard error and for the cars data is shown a few lines up on the summary(cars.lm) output (see How to do it with R in Section \@ref(sub-slr-interval-est-params)). We may read it from there to be (S\approx) r tmps, or we can access it directly from the summary object.

carsumry <- summary(cars.lm)
carsumry$sigma

Interval Estimates of the Parameters {#sub-slr-interval-est-params}

We discussed general interval estimation in Chapter \@ref(cha-estimation). There we found that we could use what we know about the sampling distribution of certain statistics to construct confidence intervals for the parameter being estimated. We will continue in that vein, and to get started we will determine the sampling distributions of the parameter estimates, (b_{1}) and (b_{0}).

To that end, we can see from Equation \eqref{eq-regline-slope-formula} (and it is made clear in Chapter \@ref(cha-multiple-linear-regression)) that (b_{1}) is just a linear combination of normally distributed random variables, so (b_{1}) is normally distributed too. Further, it can be shown that \begin{equation} b_{1}\sim\mathsf{norm}\left(\mathtt{mean}=\beta_{1},\,\mathtt{sd}=\sigma_{b_{1}}\right) \end{equation} where \begin{equation} \sigma_{b_{1}}=\frac{\sigma}{\sqrt{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}} \end{equation} is called the standard error of (b_{1}) which unfortunately depends on the unknown value of (\sigma). We do not lose heart, though, because we can estimate (\sigma) with the standard error (S) from the last section. This gives us an estimate (S_{b_{1}}) for (\sigma_{b_{1}}) defined by \begin{equation} S_{b_{1}}=\frac{S}{\sqrt{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation}

Now, it turns out that (b_{0}), (b_{1}), and (S) are mutually independent (see the footnote in Section \@ref(sub-mlr-interval-est-params)). Therefore, the quantity \begin{equation} T=\frac{b_{1}-\beta_{1}}{S_{b_{1}}} \end{equation} has a (\mathsf{t}(\mathtt{df}=n-2)) distribution and a (100(1-\alpha)\%) confidence interval for (\beta_{1}) is given by \begin{equation} b_{1}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-1)\, S{b_{1}.} \end{equation}

It is also sometimes of interest to construct a confidence interval for (\beta_{0}) in which case we will need the sampling distribution of (b_{0}). It is shown in Chapter \@ref(cha-multiple-linear-regression) that \begin{equation} b_{0}\sim\mathsf{norm}\left(\mathtt{mean}=\beta_{0},\,\mathtt{sd}=\sigma_{b_{0}}\right), \end{equation} where (\sigma_{b_{0}}) is given by \begin{equation} \sigma_{b_{0}}=\sigma\sqrt{\frac{1}{n}+\frac{\overline{x}^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}, \end{equation} and which we estimate with the (S_{b_{0}}) defined by \begin{equation} S_{b_{0}}=S\sqrt{\frac{1}{n}+\frac{\overline{x}^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation} Thus the quantity \begin{equation} T=\frac{b_{0}-\beta_{0}}{S_{b_{0}}} \end{equation} has a (\mathsf{t}(\mathtt{df}=n-2)) distribution and a (100(1-\alpha)\%) confidence interval for (\beta_{0}) is given by \begin{equation} b_{0}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-1)\, S{b_{0}}. \end{equation}

How to do it with R

A <- matrix(as.numeric(round(carsumry$coef, 3)), nrow = 2)
B <- round(confint(cars.lm), 3)

Let us take a look at the output from summary(cars.lm):

summary(cars.lm)

In the Coefficients section we find the parameter estimates and their respective standard errors in the second and third columns; the other columns are discussed in Section \@ref(sec-model-utility-slr). If we wanted, say, a 95% confidence interval for (\beta_{1}) we could use (b_{1} =) r A[2,1] and (S_{b_{1}} =) r A[2,2] together with a (\mathsf{t}{0.025}(\mathtt{df}=23)) critical value to calculate (b{1} \pm \mathsf{t}{0.025}(\mathtt{df} = 23) \cdot S{b_{1}}). Or, we could use the confint function.

confint(cars.lm)

With 95% confidence, the random interval r B[2,1] to r B[2,2] covers the parameter (\beta_{1}).

Interval Estimates of the Regression Line {#sub-slr-interval-est-regline}

We have seen how to estimate the coefficients of regression line with both point estimates and confidence intervals. We even saw how to estimate a value (\hat{\mu}(x)) on the regression line for a given value of (x), such as (x=15).

But how good is our estimate (\hat{\mu}(15))? How much confidence do we have in this estimate? Furthermore, suppose we were going to observe another value of (Y) at (x=15). What could we say?

Intuitively, it should be easier to get bounds on the mean (average) value of (Y) at (x_{0}) -- called a confidence interval for the mean value of (Y) at (x_{0}) -- than it is to get bounds on a future observation of (Y) (called a prediction interval for (Y) at (x_{0})). As we shall see, the intuition serves us well and confidence intervals are shorter for the mean value, longer for the individual value.

Our point estimate of (\mu(x_{0})) is of course (\hat{Y}=\hat{Y}(x_{0})), so for a confidence interval we will need to know (\hat{Y})'s sampling distribution. It turns out (see Section ) that (\hat{Y}=\hat{\mu}(x_{0})) is distributed \begin{equation} \hat{Y}\sim\mathsf{norm}\left(\mathtt{mean}=\mu(x_{0}),\:\mathtt{sd}=\sigma\sqrt{\frac{1}{n}+\frac{(x_{0}-\overline{x})^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}\right). \end{equation}

Since (\sigma) is unknown we estimate it with (S) (we should expect the appearance of a (\mathsf{t}(\mathtt{df}=n-2)) distribution in the near future).

A (100(1-\alpha)\%) confidence interval (CI) for (\mu(x_{0})) is given by \begin{equation} \label{eq-SLR-conf-int-formula} \hat{Y}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-2)\, S\sqrt{\frac{1}{n}+\frac{(x{0}-\overline{x}^{2})}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation} Prediction intervals are a little bit different. In order to find confidence bounds for a new observation of (Y) (we will denote it (Y_{\mbox{new}})) we use the fact that \begin{equation} Y_{\mbox{new}}\sim\mathtt{norm}\left(\mathtt{mean}=\mu(x_{0}),\,\mathtt{sd}=\sigma\sqrt{1+\frac{1}{n}+\frac{(x_{0}-\overline{x})^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}\right). \end{equation} Of course, (\sigma) is unknown so we estimate it with (S) and a (100(1-\alpha)\%) prediction interval (PI) for a future value of (Y) at (x_{0}) is given by \begin{equation} \label{eq-SLR-pred-int-formula} \hat{Y}(x_{0})\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-1)\: S\,\sqrt{1+\frac{1}{n}+\frac{(x{0}-\overline{x})^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}. \end{equation} We notice that the prediction interval in Equation \eqref{eq-SLR-pred-int-formula} is wider than the confidence interval in Equation \eqref{eq-SLR-conf-int-formula}, as we expected at the beginning of the section.

How to do it with R

Confidence and prediction intervals are calculated in R with the predict \index{predict@\texttt{predict}} function, which we encountered in Section \@ref(sub-slr-point-est-regline). There we neglected to take advantage of its additional interval argument. The general syntax follows.

\bigskip

We will find confidence and prediction intervals for the stopping
distance of a car travelling 5, 6, and 21 mph (note from the graph
that there are no collected data for these speeds). We have computed
`cars.lm` earlier, and we will use this for input to the `predict`
function. Also, we need to tell R the values of \(x_{0}\)
at which we want the predictions made, and store the \(x_{0}\) values
in a data frame whose variable is labeled with the correct name. *This is important*.
new <- data.frame(speed = c(5, 6, 21))

Next we instruct R to calculate the intervals. Confidence intervals are given by

predict(cars.lm, newdata = new, interval = "confidence")
carsCI <- round(predict(cars.lm, newdata = new, interval = "confidence"), 2)

Prediction intervals are given by

predict(cars.lm, newdata = new, interval = "prediction")
carsPI <- round(predict(cars.lm, newdata = new, interval = "prediction"), 2)

The type of interval is dictated by the interval argument (which is none by default), and the default confidence level is 95\% (which can be changed with the level argument).

\bigskip

Using the `cars` data,
  1. Report a point estimate of and a 95% confidence interval for the mean stopping distance for a car travelling 5 mph. The fitted value for (x = 5) is r carsCI[1,1], so a point estimate would be r carsCI[1,1] ft. The 95% CI is given by r carsCI[1,2] to r carsCI[1,3], so with 95% confidence the mean stopping distance lies somewhere between r carsCI[1,2] ft and r carsCI[1,3] ft.
  2. Report a point prediction for and a 95% prediction interval for the stopping distance of a hypothetical car travelling 21 mph. The fitted value for (x = 21) is r carsPI[3,1], so a point prediction for the stopping distance is r carsPI[3,1] ft. The 95% PI is r carsPI[3,2] to r carsPI[3,3], so with 95% confidence we may assert that the hypothetical stopping distance for a car travelling 21 mph would lie somewhere between r carsPI[3,2] ft and r carsPI[3,3] ft.

Graphing the Confidence and Prediction Bands

We earlier guessed that a bound on the value of a single new observation would be inherently less certain than a bound for an average (mean) value; therefore, we expect the CIs for the mean to be tighter than the PIs for a new observation. A close look at the standard deviations in Equations \eqref{eq-SLR-conf-int-formula} and \eqref{eq-SLR-pred-int-formula} confirms our guess, but we would like to see a picture to drive the point home.

We may plot the confidence and prediction intervals with one fell swoop using the ci.plot function from the HH package [@HH]. The graph is displayed in Figure \@ref(fig:carscipi).

library(HH)
ci.plot(cars.lm)

Notice that the bands curve outward from the regression line as the (x) values move away from the center. This is expected once we notice the ((x_{0}-\overline{x})^{2}) term in the standard deviation formulas in Equations \eqref{eq-SLR-conf-int-formula} and \eqref{eq-SLR-pred-int-formula}.

print(ci.plot(cars.lm))

(ref:cap-carscipi) \small A scatterplot with confidence/prediction bands for the cars data.

Model Utility and Inference {#sec-model-utility-slr}

Hypothesis Tests for the Parameters {#sub-slr-hypoth-test-params}

Much of the attention of SLR is directed toward (\beta_{1}) because when (\beta_{1}\neq 0) the mean value of (Y) increases (or decreases) as (x) increases. It is really boring when (\beta_{1}=0), because in that case the mean value of (Y) remains the same, regardless of the value of (x) (when the regression assumptions hold, of course). It is thus very important to decide whether or not (\beta_{1} = 0). We address the question with a statistical test of the null hypothesis (H_{0}:\beta_{1}=0) versus the alternative hypothesis (H_{1}:\beta_{1}\neq0), and to do that we need to know the sampling distribution of (b_{1}) when the null hypothesis is true.

To this end we already know from Section \@ref(sub-slr-interval-est-params) that the quantity

\begin{equation} T=\frac{b_{1}-\beta_{1}}{S_{b_{1}}} \end{equation} has a (\mathsf{t}(\mathtt{df}=n-2)) distribution; therefore, when (\beta_{1}=0) the quantity (b_{1}/S_{b_{1}}) has a (\mathsf{t}(\mathtt{df}=n-2)) distribution and we can compute a (p)-value by comparing the observed value of (b_{1}/S{}{b{1}}) with values under a (\mathsf{t}(\mathtt{df}=n-2)) curve.

Similarly, we may test the hypothesis (H_{0}:\beta_{0}=0) versus the alternative (H_{1}:\beta_{0}\neq0) with the statistic (T=b_{0}/S_{b_{0}}), where (S_{b_{0}}) is given in Section \@ref(sub-slr-interval-est-params). The test is conducted the same way as for (\beta_{1}).

How to do it with R

Let us take another look at the output from summary(cars.lm):

summary(cars.lm)

In the Coefficients section we find the (t) statistics and the (p)-values associated with the tests that the respective parameters are zero in the fourth and fifth columns. Since the (p)-values are (much) less than 0.05, we conclude that there is strong evidence that the parameters (\beta_{1}\neq0) and (\beta_{0}\neq0), and as such, we say that there is a statistically significant linear relationship between dist and speed.

Simple Coefficient of Determination

It would be nice to have a single number that indicates how well our linear regression model is doing, and the simple coefficient of determination is designed for that purpose. In what follows, we observe the values (Y_{1}), (Y_{2}), ...,(Y_{n}), and the goal is to estimate (\mu(x_{0})), the mean value of (Y) at the location (x_{0}).

If we disregard the dependence of (Y) and (x) and base our estimate only on the (Y) values then a reasonable choice for an estimator is just the MLE of (\mu), which is (\overline{Y}). Then the errors incurred by the estimate are just (Y_{i}-\overline{Y}) and the variation about the estimate as measured by the sample variance is proportional to \begin{equation} SSTO=\sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}. \end{equation} The acronym (SSTO) stands for total sum of squares. And we have additional information, namely, we have values (x_{i}) associated with each value of (Y_{i}). We have seen that this information leads us to the estimate (\hat{Y_{i}}) and the errors incurred are just the residuals, (E_{i}=Y_{i}-\hat{Y_{i}}). The variation associated with these errors can be measured with \begin{equation} SSE=\sum_{i=1}^{n}(Y_{i}-\hat{Y_{i}})^{2}. \end{equation} We have seen the (SSE) before, which stands for the sum of squared errors or error sum of squares. Of course, we would expect the error to be less in the latter case, since we have used more information. The improvement in our estimation as a result of the linear regression model can be measured with the difference [ (Y_{i}-\overline{Y})-(Y_{i}-\hat{Y_{i}})=\hat{Y_{i}}-\overline{Y}, ] and we measure the variation in these errors with \begin{equation} SSR=\sum_{i=1}^{n}(\hat{Y_{i}}-\overline{Y})^{2}, \end{equation} also known as the regression sum of squares. It is not obvious, but some algebra proved a famous result known as the ANOVA Equality: \begin{equation} \label{eq-anovaeq} \sum_{i=1}^{n}(Y_{i}-\overline{Y})^{2}=\sum_{i=1}^{n}(\hat{Y_{i}}-\overline{Y})^{2}+\sum_{i=1}^{n}(Y_{i}-\hat{Y_{i}})^{2} \end{equation} or in other words, \begin{equation} SSTO=SSR+SSE. \end{equation} This equality has a nice interpretation. Consider (SSTO) to be the total variation of the errors. Think of a decomposition of the total variation into pieces: one piece measuring the reduction of error from using the linear regression model, or explained variation ((SSR)), while the other represents what is left over, that is, the errors that the linear regression model doesn't explain, or unexplained variation ((SSE)). In this way we see that the ANOVA equality merely partitions the variation into [ \mbox{total variation}=\mbox{explained variation}+\mbox{unexplained variation}. ] For a single number to summarize how well our model is doing we use the simple coefficient of determination (r^{2}), defined by \begin{equation} r^{2}=1-\frac{SSE}{SSTO}. \end{equation} We interpret (r^{2}) as the proportion of total variation that is explained by the simple linear regression model. When (r^{2}) is large, the model is doing a good job; when (r^{2}) is small, the model is not doing a good job.

Related to the simple coefficient of determination is the sample correlation coefficient, (r). As you can guess, the way we get (r) is by the formula (|r|=\sqrt{r^{2}}). The sign of (r) is equal the sign of the slope estimate (b_{1}). That is, if the regression line (\hat{\mu}(x)) has positive slope, then (r=\sqrt{r^{2}}). Likewise, if the slope of (\hat{\mu}(x)) is negative, then (r=-\sqrt{r^{2}}).

How to do it with R

The primary method to display partitioned sums of squared errors is with an ANOVA table. The command in R to produce such a table is anova. The input to anova is the result of an lm call which for the cars data is cars.lm.

anova(cars.lm)

The output gives [ r^{2}=1-\frac{SSE}{SSR+SSE}=1-\frac{11353.5}{21185.5+11353.5}\approx0.65. ]

The interpretation should be: "The linear regression line accounts for approximately 65% of the variation of dist as explained by speed".

The value of (r^{2}) is stored in the r.squared component of summary(cars.lm), which we called carsumry.

carsumry$r.squared

We already knew this. We saw it in the next to the last line of the summary(cars.lm) output where it was called Multiple R-squared. Listed right beside it is the Adjusted R-squared which we will discuss in Chapter \@ref(cha-multiple-linear-regression). For the cars data, we find (r) to be

sqrt(carsumry$r.squared)

We choose the principal square root because the slope of the regression line is positive.

Overall F statistic {#sub-slr-overall-f-statistic}

There is another way to test the significance of the linear regression model. In SLR, the new way also tests the hypothesis (H_{0}:\beta_{1}=0) versus (H_{1}:\beta_{1}\neq0), but it is done with a new test statistic called the overall F statistic. It is defined by \begin{equation} \label{eq-slr-overall-F-statistic} F=\frac{SSR}{SSE/(n-2)}. \end{equation}

Under the regression assumptions and when (H_{0}) is true, the (F) statistic has an (\mathtt{f}(\mathtt{df1}=1,\,\mathtt{df2}=n-2)) distribution. We reject (H_{0}) when (F) is large -- that is, when the explained variation is large relative to the unexplained variation.

All this being said, we have not yet gained much from the overall (F) statistic because we already knew from Section \@ref(sub-slr-hypoth-test-params) how to test (H_{0}:\beta_{1} = 0)... we use the Student's (t) statistic. What is worse is that (in the simple linear regression model) it can be proved that the (F) in Equation \eqref{eq-slr-overall-F-statistic} is exactly the Student's (t) statistic for (\beta_{1}) squared,

\begin{equation} F=\left(\frac{b_{1}}{S_{b_{1}}}\right)^{2}. \end{equation}

So why bother to define the (F) statistic? Why not just square the (t) statistic and be done with it? The answer is that the (F) statistic has a more complicated interpretation and plays a more important role in the multiple linear regression model which we will study in Chapter \@ref(cha-multiple-linear-regression). See Section \@ref(sub-mlr-overall-f-test) for details.

How to do it with R

The overall (F) statistic and (p)-value are displayed in the bottom line of the summary(cars.lm) output. It is also shown in the final columns of anova(cars.lm):

anova(cars.lm)
tmpf <- round(as.numeric(carsumry$fstatistic[1]), 2)

Here we see that the (F) statistic is r tmpf with a (p)-value very close to zero. The conclusion: there is very strong evidence that (H_{0}:\beta_{1} = 0) is false, that is, there is strong evidence that (\beta_{1} \neq 0). Moreover, we conclude that the regression relationship between dist and speed is significant.

Note that the value of the (F) statistic is the same as the Student's (t) statistic for speed squared.

Residual Analysis {#sec-residual-analysis-slr}

We know from our model that (Y=\mu(x)+\epsilon), or in other words, (\epsilon=Y-\mu(x)). Further, we know that (\epsilon\sim\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma)). We may estimate (\epsilon_{i}) with the residual (E_{i}=Y_{i}-\hat{Y_{i}}), where (\hat{Y_{i}}=\hat{\mu}(x_{i})). If the regression assumptions hold, then the residuals should be normally distributed. We check this in Section \@ref(sub-normality-assumption). Further, the residuals should have mean zero with constant variance (\sigma^{2}), and we check this in Section \@ref(sub-constant-variance-assumption). Last, the residuals should be independent, and we check this in Section \@ref(sub-independence-assumption).

In every case, we will begin by looking at residual plots -- that is, scatterplots of the residuals (E_{i}) versus index or predicted values (\hat{Y_{i}}) -- and follow up with hypothesis tests.

Normality Assumption {#sub-normality-assumption}

We can assess the normality of the residuals with graphical methods and hypothesis tests. To check graphically whether the residuals are normally distributed we may look at histograms or q-q plots. We first examine a histogram in Figure \@ref(fig:normal-hist-cars), which was produced by the following code.

hist(residuals(cars.lm))

(ref:cap-normal-hist-cars) \small A histogram of the residuals from the linear regression model for the cars data. Used for checking the normality assumption. Look out for any skewness or extreme values; hopefully the graph is mound shaped and symmetric about zero, with no outliers.

There we see that the distribution of the residuals appears to be mound shaped, for the most part. We can plot the order statistics of the studentized residuals versus quantiles from a (\mathsf{t}(\mathtt{mean}=0,\,\mathtt{sd}=1)) distribution with the command qqPlot(cars.lm) from the RcmdrMisc package [@RcmdrMisc], and the results are in Figure \@ref(fig:normal-q-q-plot-cars). If the assumption of normality were true, then we would expect points randomly scattered about the dotted straight line displayed in the figure. In this case, we see a slight departure from normality in that the dots show a bit of clustering on one side or the other of the line. The points on the upper end of the plot also begin to stray from the line. We would say there is some evidence that the residuals are not perfectly normal.

library(RcmdrMisc)
qqPlot(cars.lm)

(ref:cap-normal-q-q-plot-cars) \small A quantile-quantile plot of the studentized residuals from the linear regression model for the cars data. Used for checking the normality assumption. Look out for any curvature or substantial departures from the straight line; hopefully the dots hug the line closely and stay within the bands.

Testing the Normality Assumption

Even though we may be concerned about the plots, we can use tests to determine if the evidence present is statistically significant, or if it could have happened merely by chance. There are many statistical tests of normality. We will use the Shapiro-Wilk test, since it is known to be a good test and to be quite powerful. However, there are many other fine tests of normality including the Anderson-Darling test and the Lillefors test, just to mention two of them.

The Shapiro-Wilk test is based on the statistic \begin{equation} W=\frac{\left(\sum_{i=1}^{n}a_{i}E_{(i)}\right)^{2}}{\sum_{j=1}^{n}E_{j}^{2}}, \end{equation} where the (E_{(i)}) are the ordered residuals and the (a_{i}) are constants derived from the order statistics of a sample of size (n) from a normal distribution. See Section \@ref(sub-shapiro-wilk-normality-test). We perform the Shapiro-Wilk test below, using the shapiro.test function from the stats package [@stats]. The hypotheses are [ H_{0}:\mbox{ the residuals are normally distributed } ] versus [ H_{1}:\mbox{ the residuals are not normally distributed.} ] The results from R are

shapiro.test(residuals(cars.lm))

For these data we would reject the assumption of normality of the residuals at the (\alpha=0.05) significance level, but do not lose heart, because the regression model is reasonably robust to departures from the normality assumption. As long as the residual distribution is not highly skewed, then the regression estimators will perform reasonably well. Moreover, departures from constant variance and independence will sometimes affect the quantile plots and histograms, therefore it is wise to delay final decisions regarding normality until all diagnostic measures have been investigated.

Constant Variance Assumption {#sub-constant-variance-assumption}

We will again go to residual plots to try and determine if the spread of the residuals is changing over time (or index). However, it is unfortunately not that easy because the residuals do not have constant variance! In fact, it can be shown that the variance of the residual (E_{i}) is \begin{equation} \mbox{Var$(E_{i})$}=\sigma^{2}(1-h_{ii}),\quad i=1,2,\ldots,n, \end{equation} where (h_{ii}) is a quantity called the leverage which is defined below. Consequently, in order to check the constant variance assumption we must standardize the residuals before plotting. We estimate the standard error of (E_{i}) with (s_{E_{i}}=s\sqrt{(1-h_{ii})}) and define the standardized residuals (R_{i}), (i=1,2,\ldots,n), by \begin{equation} R_{i}=\frac{E_{i}}{s\,\sqrt{1-h_{ii}}},\quad i=1,2,\ldots,n. \end{equation} For the constant variance assumption we do not need the sign of the residual so we will plot (\sqrt{|R_{i}|}) versus the fitted values. As we look at a scatterplot of (\sqrt{|R_{i}|}) versus (\hat{Y}_{i}) we would expect under the regression assumptions to see a constant band of observations, indicating no change in the magnitude of the observed distance from the line. We want to watch out for a fanning-out of the residuals, or a less common funneling-in of the residuals. Both patterns indicate a change in the residual variance and a consequent departure from the regression assumptions, the first an increase, the second a decrease.

In this case, we plot the standardized residuals versus the fitted values. The graph may be seen in Figure \@ref(fig:std-resids-fitted-cars). For these data there does appear to be somewhat of a slight fanning-out of the residuals.

plot(cars.lm, which = 3)

(ref:cap-std-resids-fitted-cars) \small Plot of standardized residuals against the fitted values for the cars data. Used for checking the constant variance assumption. Watch out for any fanning out (or in) of the dots; hopefully they fall in a constant band.

Testing the Constant Variance Assumption

We will use the Breusch-Pagan test to decide whether the variance of the residuals is nonconstant. The null hypothesis is that the variance is the same for all observations, and the alternative hypothesis is that the variance is not the same for all observations. The test statistic is found by fitting a linear model to the centered squared residuals, \begin{equation} W_{i} = E_{i}^{2} - \frac{SSE}{n}, \quad i=1,2,\ldots,n. \end{equation}

By default the same explanatory variables are used in the new model which produces fitted values (\hat{W}{i}), (i=1,2,\ldots,n). The Breusch-Pagan test statistic in R is then calculated with \begin{equation} BP=n\sum{i=1}^{n}\hat{W}{i}^{2}\div\sum{i=1}^{n}W_{i}^{2}. \end{equation} We reject the null hypothesis if (BP) is too large, which happens when the explained variation i the new model is large relative to the unexplained variation in the original model. We do it in R with the bptest function from the lmtest package [@lmtest].

library(lmtest)
bptest(cars.lm)

For these data we would not reject the null hypothesis at the (\alpha=0.05) level. There is relatively weak evidence against the assumption of constant variance.

Independence Assumption {#sub-independence-assumption}

One of the strongest of the regression assumptions is the one regarding independence. Departures from the independence assumption are often exhibited by correlation (or autocorrelation, literally, self-correlation) present in the residuals. There can be positive or negative correlation.

Positive correlation is displayed by positive residuals followed by positive residuals, and negative residuals followed by negative residuals. Looking from left to right, this is exhibited by a cyclical feature in the residual plots, with long sequences of positive residuals being followed by long sequences of negative ones.

On the other hand, negative correlation implies positive residuals followed by negative residuals, which are then followed by positive residuals, etc. Consequently, negatively correlated residuals are often associated with an alternating pattern in the residual plots. We examine the residual plot in Figure \@ref(fig:resids-fitted-cars). There is no obvious cyclical wave pattern or structure to the residual plot.

plot(cars.lm, which = 1)

(ref:cap-resids-fitted-cars) \small Plot of the residuals versus the fitted values for the cars data. Used for checking the independence assumption. Watch out for any patterns or structure; hopefully the points are randomly scattered on the plot.

Testing the Independence Assumption

We may statistically test whether there is evidence of autocorrelation in the residuals with the Durbin-Watson test. The test is based on the statistic \begin{equation} D=\frac{\sum_{i=2}^{n}(E_{i}-E_{i-1})^{2}}{\sum_{j=1}^{n}E_{j}^{2}}. \end{equation} Exact critical values are difficult to obtain, but R will calculate the p-value to great accuracy. It is performed with the dwtest function from the lmtest package [@lmtest]. We will conduct a two sided test that the correlation is not zero, which is not the default (the default is to test that the autocorrelation is positive).

dwtest(cars.lm, alternative = "two.sided")

In this case we do not reject the null hypothesis at the (\alpha=0.05) significance level; there is very little evidence of nonzero autocorrelation in the residuals.

Remedial Measures

We often find problems with our model that suggest that at least one of the three regression assumptions is violated. What do we do then? There are many measures at the statistician's disposal, and we mention specific steps one can take to improve the model under certain types of violation.

Other Diagnostic Tools {#sec-other-diagnostic-tools-slr}

There are two types of observations with which we must be especially careful:

We will discuss outliers first because the notation builds sequentially in that order.

Outliers

There are three ways that an observation ((x_{i},y_{i})) might be identified as an outlier: it can have an (x_{i}) value which falls far from the other (x) values, it can have a (y_{i}) value which falls far from the other (y) values, or it can have both its (x_{i}) and (y_{i}) values falling far from the other (x) and (y) values.

Leverage

Leverage statistics are designed to identify observations which have (x) values that are far away from the rest of the data. In the simple linear regression model the leverage of (x_{i}) is denoted by (h_{ii}) and defined by \begin{equation} h_{ii}=\frac{1}{n}+\frac{(x_{i}-\overline{x})^{2}}{\sum_{k=1}^{n}(x_{k}-\overline{x})^{2}},\quad i=1,2,\ldots,n. \end{equation} The formula has a nice interpretation in the SLR model: if the distance from (x_{i}) to (\overline{x}) is large relative to the other (x)'s then (h_{ii}) will be close to 1.

Leverages have nice mathematical properties; for example, they satisfy \begin{equation} \label{eq-slr-leverage-between} 0\leq h_{ii}\leq1, \end{equation} and their sum is \begin{eqnarray} \label{eq-slr-average-leverage} \sum_{i=1}^{n}h_{ii} & = & \sum_{i=1}^{n}\left[\frac{1}{n}+\frac{(x_{i}-\overline{x})^{2}}{\sum_{k=1}^{n}(x_{k}-\overline{x})^{2}}\right],\ & = & \frac{n}{n}+\frac{\sum_{i}(x_{i}-\overline{x})^{2}}{\sum_{k}(x_{k}-\overline{x})^{2}},\ & = & 2. \end{eqnarray}

A rule of thumb is to consider leverage values to be large if they are more than double their average size (which is (2/n) according to Equation \eqref{eq-slr-average-leverage}). So leverages larger than (4/n) are suspect. Another rule of thumb is to say that values bigger than 0.5 indicate high leverage, while values between 0.3 and 0.5 indicate moderate leverage.

Standardized and Studentized Deleted Residuals

We have already encountered the standardized residuals (r_{i}) in Section \@ref(sub-constant-variance-assumption); they are merely residuals that have been divided by their respective standard deviations: \begin{equation} R_{i}=\frac{E_{i}}{S\sqrt{1-h_{ii}}},\quad i=1,2,\ldots,n. \end{equation} Values of (|R_{i}| > 2) are extreme and suggest that the observation has an outlying (y)-value.

Now delete the (i^{\mathrm{th}}) case and fit the regression function to the remaining (n - 1) cases, producing a fitted value (\hat{Y}{(i)}) with deleted residual (D{i}=Y_{i}-\hat{Y}{(i)}). It is shown in later classes that \begin{equation} \mbox{Var $(D{i})$}=\frac{S_{(i)}^{2}}{1-h_{ii}},\quad i=1,2,\ldots,n, \end{equation} so that the studentized deleted residuals (t_{i}) defined by \begin{equation} \label{eq-slr-studentized-deleted-resids} t_{i}=\frac{D_{i}}{S_{(i)}/(1-h_{ii})},\quad i=1,2,\ldots,n, \end{equation} have a (\mathsf{t}(\mathtt{df}=n-3)) distribution and we compare observed values of (t_{i}) to this distribution to decide whether or not an observation is extreme.

The folklore in regression classes is that a test based on the statistic in Equation \eqref{eq-slr-studentized-deleted-resids} can be too liberal. A rule of thumb is if we suspect an observation to be an outlier before seeing the data then we say it is significantly outlying if its two-tailed (p)-value is less than (\alpha), but if we suspect an observation to be an outlier after seeing the data then we should only say it is significantly outlying if its two-tailed (p)-value is less than (\alpha/n). The latter rule of thumb is called the Bonferroni approach and can be overly conservative for large data sets. The responsible statistician should look at the data and use his/her best judgement, in every case.

How to do it with R

We can calculate the standardized residuals with the rstandard function. The input is the lm object, which is cars.lm.

sres <- rstandard(cars.lm)
sres[1:5]

We can find out which observations have studentized residuals larger than two with the command

sres[which(abs(sres) > 2)]

In this case, we see that observations 23, 35, and 49 are potential outliers with respect to their (y)-value. We can compute the studentized deleted residuals with rstudent:

sdelres <- rstudent(cars.lm)
sdelres[1:5]

We should compare these values with critical values from a (\mathsf{t}(\mathtt{df}=n-3)) distribution, which in this case is (\mathsf{t}(\mathtt{df}=50-3=47)). We can calculate a 0.005 quantile and check with

t0.005 <- qt(0.005, df = 47, lower.tail = FALSE)
sdelres[which(abs(sdelres) > t0.005)]

This means that observations 23 and 49 have a large studentized deleted residual. The leverages can be found with the hatvalues function:

leverage <- hatvalues(cars.lm)
leverage[which(leverage > 4/50)]

Here we see that observations 1, 2, and 50 have leverages bigger than double their mean value. These observations would be considered outlying with respect to their (x) value (although they may or may not be influential).

Influential Observations

(DFBETAS) and (DFFITS)

Any time we do a statistical analysis, we are confronted with the variability of data. It is always a concern when an observation plays too large a role in our regression model, and we would not like or procedures to be overly influenced by the value of a single observation. Hence, it becomes desirable to check to see how much our estimates and predictions would change if one of the observations were not included in the analysis. If an observation changes the estimates/predictions a large amount, then the observation is influential and should be subjected to a higher level of scrutiny.

We measure the change in the parameter estimates as a result of deleting an observation with (DFBETAS). The (DFBETAS) for the intercept (b_{0}) are given by \begin{equation} (DFBETAS){0(i)}=\frac{b{0}-b_{0(i)}}{S_{(i)}\sqrt{\frac{1}{n}+\frac{\overline{x}^{2}}{\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}}}},\quad i=1,2,\ldots,n. \end{equation} and the (DFBETAS) for the slope (b_{1}) are given by \begin{equation} (DFBETAS){1(i)}=\frac{b{1}-b_{1(i)}}{S_{(i)}\left[\sum_{i=1}^{n}(x_{i}-\overline{x})^{2}\right]^{-1/2}},\quad i=1,2,\ldots,n. \end{equation} See Section \@ref(sec-residual-analysis-mlr) for a better way to write these. The signs of the (DFBETAS) indicate whether the coefficients would increase or decrease as a result of including the observation. If the (DFBETAS) are large, then the observation has a large impact on those regression coefficients. We label observations as suspicious if their (DFBETAS) have magnitude greater 1 for small data or (2/\sqrt{n}) for large data sets. We can calculate the (DFBETAS) with the dfbetas function (some output has been omitted):

dfb <- dfbetas(cars.lm)
head(dfb)

We see that the inclusion of the first observation slightly increases the Intercept and slightly decreases the coefficient on speed.

We can measure the influence that an observation has on its fitted value with (DFFITS). These are calculated by deleting an observation, refitting the model, recalculating the fit, then standardizing. The formula is \begin{equation} (DFFITS){i}=\frac{\hat{Y{i}}-\hat{Y}{(i)}}{S{(i)}\sqrt{h_{ii}}},\quad i=1,2,\ldots,n. \end{equation} The value represents the number of standard deviations of (\hat{Y_{i}}) that the fitted value (\hat{Y_{i}}) increases or decreases with the inclusion of the (i^{\textrm{th}}) observation. We can compute them with the dffits function.

dff <- dffits(cars.lm)
dff[1:5]

A rule of thumb is to flag observations whose (DFFIT) exceeds one in absolute value, but there are none of those in this data set.

Cook's Distance

The (DFFITS) are good for measuring the influence on a single fitted value, but we may want to measure the influence an observation has on all of the fitted values simultaneously. The statistics used for measuring this are Cook's distances which may be calculated[^slr-4] by the formula \begin{equation} \label{eq-slr-cooks-distance} D_{i}=\frac{E_{i}^{2}}{(p+1)S^{2}}\cdot\frac{h_{ii}}{(1-h_{ii})^{2}},\quad i=1,2,\ldots,n. \end{equation} It shows that Cook's distance depends both on the residual (E_{i}) and the leverage (h_{ii}) and in this way (D_{i}) contains information about outlying (x) and (y) values.

To assess the significance of (D), we compare to quantiles of an (\mathsf{f}(\mathtt{df1}=2,\,\mathtt{df2}=n-2)) distribution. A rule of thumb is to classify observations falling higher than the (50^{\mathrm{th}}) percentile as being extreme.

[^slr-4]: Cook's distances are actually defined by a different formula than the one shown. The formula in Equation \eqref{eq-slr-cooks-distance} is algebraically equivalent to the defining formula and is, in the author's opinion, more transparent.

How to do it with R

We can calculate the Cook's Distances with the cooks.distance function.

cooksD <- cooks.distance(cars.lm)
cooksD[1:4]

We can look at a plot of the Cook's distances with the command plot(cars.lm, which = 4).

plot(cars.lm, which = 4)

(ref:cap-cooks-distance-cars) \small Cook's distances for the cars data. Used for checking for influential and/our outlying observations. Values with large Cook's distance merit further investigation.

Observations with the largest Cook's D values are labeled, hence we see that observations 23, 39, and 49 are suspicious. However, we need to compare to the quantiles of an (\mathsf{f}(\mathtt{df1} = 2, \,\mathtt{df2} = 48)) distribution:

F0.50 <- qf(0.5, df1 = 2, df2 = 48)
any(cooksD > F0.50)

We see that with this data set there are no observations with extreme Cook's distance, after all.

All Influence Measures Simultaneously

We can display the result of diagnostic checking all at once in one table, with potentially influential points displayed. We do it with the command influence.measures(cars.lm):

influence.measures(cars.lm)

The output is a huge matrix display, which we have omitted in the interest of brevity. A point is identified if it is classified to be influential with respect to any of the diagnostic measures. Here we see that observations 2, 11, 15, and 18 merit further investigation.

We can also look at all diagnostic plots at once with the commands

plot(cars.lm)

The par command is used so that (2\times 2 = 4) plots will be shown on the same display. The diagnostic plots for the cars data are shown in Figure \@ref(fig:diagnostic-plots-cars).

par(mfrow = c(2,2))
plot(cars.lm)
par(mfrow = c(1,1))

(ref:cap-diagnostic-plots-cars) \small Diagnostic plots for the cars data.

We have discussed all of the plots except the last, which is possibly the most interesting. It shows Residuals vs. Leverage, which will identify outlying (y) values versus outlying (x) values. Here we see that observation 23 has a high residual, but low leverage, and it turns out that observations 1 and 2 have relatively high leverage but low/moderate leverage (they are on the right side of the plot, just above the horizontal line). Observation 49 has a large residual with a comparatively large leverage.

We can identify the observations with the identify command; it allows us to display the observation number of dots on the plot. First, we plot the graph, then we call identify:

plot(cars.lm, which = 5)          # std'd resids vs lev plot
identify(leverage, sres, n = 4)   # identify 4 points

The graph with the identified points is omitted (but the plain plot is shown in the bottom right corner of Figure \@ref(fig:diagnostic-plots-cars)). Observations 1 and 2 fall on the far right side of the plot, near the horizontal axis.

Exercises

\bigskip

```{block, type="xca"} Prove the ANOVA equality, Equation \eqref{eq-anovaeq}. Hint: show that [ \sum_{i=1}^{n}(Y_{i}-\hat{Y_{i}})(\hat{Y_{i}}-\overline{Y})=0. ]

\bigskip

```{block, type="xca",label="xca-find-mles-slr"}
Solve the following system of equations for
\(\beta_{1}\) and \(\beta_{0}\) to find the MLEs for slope and
intercept in the simple linear regression model.
\begin{eqnarray*}
n\beta_{0}+\beta_{1}\sum_{i=1}^{n}x_{i} & = & \sum_{i=1}^{n}y_{i}\\
\beta_{0}\sum_{i=1}^{n}x_{i}+\beta_{1}\sum_{i=1}^{n}x_{i}^{2} & = & \sum_{i=1}^{n}x_{i}y_{i}
\end{eqnarray*}

\bigskip

```{block, type="xca", label="xca-show-alternate-slope-formula"} Show that the formula given in Equation \eqref{eq-sample-correlation-formula} is equivalent to [ \hat{\beta}{1} = \frac{\sum{i=1}^{n}x_{i}y_{i}-\left.\left(\sum_{i=1}^{n}x_{i}\right)\left(\sum_{i=1}^{n}y_{i}\right)\right/ n}{\sum_{i=1}^{n}x_{i}^{2}-\left.\left(\sum_{i=1}^{n}x_{i}\right)^{2}\right/ n}. ]

```



Try the IPSUR package in your browser

Any scripts or data that you put into this service are public.

IPSUR documentation built on May 2, 2019, 9:15 a.m.