Multiple Linear Regression {#cha-multiple-linear-regression}

#    IPSUR: Introduction to Probability and Statistics Using R
#    Copyright (C) 2018 G. Jay Kerns
#
#    Chapter: Multiple Linear Regression
#
#    This file is part of IPSUR.
#
#    IPSUR is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    IPSUR is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with IPSUR.  If not, see <http://www.gnu.org/licenses/>.
# This chapter's package dependencies
library(scatterplot3d)
library(lattice)

We know a lot about simple linear regression models, and a next step is to study multiple regression models that have more than one independent (explanatory) variable. In the discussion that follows we will assume that we have (p) explanatory variables, where (p > 1).

The language is phrased in matrix terms -- for two reasons. First, it is quicker to write and (arguably) more pleasant to read. Second, the matrix approach will be required for later study of the subject; the reader might as well be introduced to it now.

Most of the results are stated without proof or with only a cursory justification. Those yearning for more should consult an advanced text in linear regression for details, such as Applied Linear Regression Models [@Neter1996] or Linear Models: Least Squares and Alternatives [@Rao1999].

What do I want them to know?

The Multiple Linear Regression Model {#sec-the-mlr-model}

The first thing to do is get some better notation. We will write \begin{equation} \mathbf{Y}{\mathrm{n}\times1}= \begin{bmatrix}y{1}\ y_{2}\ \vdots\ y_{n} \end{bmatrix}, \quad \mbox{and}\quad \mathbf{X}{\mathrm{n}\times(\mathrm{p}+1)}= \begin{bmatrix}1 & x{11} & x_{21} & \cdots & x_{p1}\ 1 & x_{12} & x_{22} & \cdots & x_{p2}\ \vdots & \vdots & \vdots & \ddots & \vdots\ 1 & x_{1n} & x_{2n} & \cdots & x_{pn} \end{bmatrix}. \end{equation} The vector (\mathbf{Y}) is called the response vector \index{response vector} and the matrix (\mathbf{X}) is called the model matrix \index{model matrix}. As in Chapter \@ref(cha-simple-linear-regression), the most general assumption that relates (\mathbf{Y}) to (\mathbf{X}) is \begin{equation} \mathbf{Y}=\mu(\mathbf{X})+\upepsilon, \end{equation} where (\mu) is some function (the signal) and (\upepsilon) is the noise (everything else). We usually impose some structure on (\mu) and (\upepsilon). In particular, the standard multiple linear regression model \index{model!multiple linear regression} assumes \begin{equation} \mathbf{Y}=\mathbf{X}\upbeta+\upepsilon, \end{equation} where the parameter vector (\upbeta) looks like \begin{equation} \upbeta_{(\mathrm{p}+1)\times1}=\begin{bmatrix}\beta_{0} & \beta_{1} & \cdots & \beta_{p}\end{bmatrix}^{\mathrm{T}}, \end{equation} and the random vector (\upepsilon_{\mathrm{n}\times1}=\begin{bmatrix}\epsilon_{1} & \epsilon_{2} & \cdots & \epsilon_{n}\end{bmatrix}^{\mathrm{T}}) is assumed to be distributed \begin{equation} \upepsilon\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{0}{\mathrm{n}\times1},\,\mathtt{sigma}=\sigma^{2}\mathbf{I}{\mathrm{n}\times\mathrm{n}}\right). \end{equation}

The assumption on (\upepsilon) is equivalent to the assumption that (\epsilon_{1}), (\epsilon_{2}), ..., (\epsilon_{n}) are IID (\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma)). It is a linear model because the quantity (\mu(\mathbf{X})=\mathbf{X}\upbeta) is linear in the parameters (\beta_{0}), (\beta_{1}), ..., (\beta_{p}). It may be helpful to see the model in expanded form; the above matrix formulation is equivalent to the more lengthy \begin{equation} Y_{i}=\beta_{0}+\beta_{1}x_{1i}+\beta_{2}x_{2i}+\cdots+\beta_{p}x_{pi}+\epsilon_{i},\quad i=1,2,\ldots,n. \end{equation}

\bigskip

```{example, name="Girth, Height, and Volume for Black Cherry trees"} \index{Data sets!trees@\texttt{trees}} Measurements were made of the girth, height, and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground. The variables are

1. `Girth`: tree diameter in inches (denoted \(x_{1}\))
2. `Height`: tree height in feet (\(x_{2}\)).
3. `Volume`: volume of the tree in cubic feet. (\(y\))

The data are in the `datasets` package [@datasets] and are already
on the search path; they can be viewed with

```r 
head(trees)

Let us take a look at a visual display of the data. For multiple variables, instead of a simple scatterplot we use a scatterplot matrix which is made with the splom function in the lattice package [@lattice] as shown below. The plot is shown in Figure \@ref(fig:splom-trees).

splom(trees)

(ref:cap-splom-trees) \small A scatterplot matrix of the trees data.

The dependent (response) variable Volume is listed in the first row of the scatterplot matrix. Moving from left to right, we see an approximately linear relationship between Volume and the independent (explanatory) variables Height and Girth. A first guess at a model for these data might be \begin{equation} Y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\epsilon, \end{equation} in which case the quantity (\mu(x_{1},x_{2})=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}) would represent the mean value of (Y) at the point ((x_{1},x_{2})).

What does it mean?

The interpretation is simple. The intercept (\beta_{0}) represents the mean Volume when all other independent variables are zero. The parameter (\beta_{i}) represents the change in mean Volume when there is a unit increase in (x_{i}), while the other independent variable is held constant. For the trees data, (\beta_{1}) represents the change in average Volume as Girth increases by one unit when the Height is held constant, and (\beta_{2}) represents the change in average Volume as Height increases by one unit when the Girth is held constant.

In simple linear regression, we had one independent variable and our linear regression surface was 1D, simply a line. In multiple regression there are many independent variables and so our linear regression surface will be many-D... in general, a hyperplane. But when there are only two explanatory variables the hyperplane is just an ordinary plane and we can look at it with a 3D scatterplot.

One way to do this is with the R Commander in the Rcmdr package [@Rcmdr]. It has a 3D scatterplot option under the Graphs menu. It is especially great because the resulting graph is dynamic; it can be moved around with the mouse, zoomed, etc. But that particular display does not translate well to a printed book.

Another way to do it is with the scatterplot3d function in the scatterplot3d package [@scatterplot3d]. The code follows, and the result is shown in Figure \@ref(fig:3D-scatterplot-trees).

s3d <- with(trees, scatterplot3d(Girth, Height, Volume, 
            pch = 16, highlight.3d = TRUE, angle = 60))

(ref:cap-3D-scatterplot-trees) \small A 3D scatterplot with regression plane for the trees data.

Looking at the graph we see that the data points fall close to a plane in three dimensional space. (The plot looks remarkably good. In the author's experience it is rare to see points fit so well to the plane without some additional work.)

Estimation and Prediction {#sec-estimation-and-prediction-mlr}

Parameter estimates {#sub-mlr-parameter-estimates}

We will proceed exactly like we did in Section \@ref(sec-slr-estimation). We know \begin{equation} \upepsilon\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{0}{\mathrm{n}\times1},\,\mathtt{sigma}=\sigma^{2}\mathbf{I}{\mathrm{n}\times\mathrm{n}}\right), \end{equation} which means that (\mathbf{Y}=\mathbf{X}\upbeta+\upepsilon) has an (\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{X}\upbeta,\,\mathtt{sigma}=\sigma^{2}\mathbf{I}_{\mathrm{n}\times\mathrm{n}}\right)) distribution. Therefore, the likelihood function \index{likelihood function} is \begin{equation} L(\upbeta,\sigma)=\frac{1}{2\pi^{n/2}\sigma}\exp\left{ -\frac{1}{2\sigma^{2}}\left(\mathbf{Y}-\mathbf{X}\upbeta\right)^{\mathrm{T}}\left(\mathbf{Y}-\mathbf{X}\upbeta\right)\right}. \end{equation}

To maximize the likelihood \index{maximum likelihood} in (\upbeta), we need to minimize the quantity (g(\upbeta)=\left(\mathbf{Y}-\mathbf{X}\upbeta\right)^{\mathrm{T}}\left(\mathbf{Y}-\mathbf{X}\upbeta\right)). We do this by differentiating (g) with respect to (\upbeta). (It may be a good idea to brush up on the material in Appendices \@ref(sec-linear-algebra) and \@ref(sec-multivariable-calculus).) First we will rewrite (g): \begin{equation} g(\upbeta)=\mathbf{Y}^{\mathrm{T}}\mathbf{Y}-\mathbf{Y}^{\mathrm{T}}\mathbf{X}\upbeta-\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{Y}+\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta, \end{equation} which can be further simplified to (g(\upbeta)=\mathbf{Y}^{\mathrm{T}}\mathbf{Y}-2\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{Y}+\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta) since (\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{Y}) is (1\times1) and thus equal to its transpose. Now we differentiate to get \begin{equation} \frac{\partial g}{\partial\upbeta}=\mathbf{0}-2\mathbf{X}^{\mathrm{T}}\mathbf{Y}+2\mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta, \end{equation} since (\mathbf{X}^{\mathrm{T}}\mathbf{X}) is symmetric. Setting the derivative equal to the zero vector yields the so called "normal equations" \index{normal equations} \begin{equation} \mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta=\mathbf{X}^{\mathrm{T}}\mathbf{Y}. \end{equation}

In the case that (\mathbf{X}^{\mathrm{T}}\mathbf{X}) is invertible[^mlr-1], we may solve the equation for (\upbeta) to get the maximum likelihood estimator of (\upbeta) which we denote by (\mathbf{b}): \begin{equation} \label{eq-b-formula-matrix} \mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}. \end{equation}

[^mlr-1]: We can find solutions of the normal equations even when (\mathbf{X}^{\mathrm{T}}\mathbf{X}) is not of full rank, but the topic falls outside the scope of this book. The interested reader can consult an advanced text such as Rao [@Rao1999].

\bigskip

```{block, type="rem"} The formula in Equation \eqref{eq-b-formula-matrix} is convenient for mathematical study but is inconvenient for numerical computation. Researchers have devised much more efficient algorithms for the actual calculation of the parameter estimates, and we do not explore them here.

\bigskip

```{block2, type="rem"}
We have only found a critical value, and have not actually shown that
the critical value is a minimum. We omit the details and refer the
interested reader to [@Rao1999].

How to do it with R

We do all of the above just as we would in simple linear regression. The powerhouse is the lm \index{lm@\texttt{lm}} function. Everything else is based on it. We separate explanatory variables in the model formula by a plus sign.

fit <- lm(Volume ~ Girth + Height, data = trees)
trees.lm <- lm(Volume ~ Girth + Height, data = trees)
trees.lm

We see from the output that for the trees data our parameter estimates are [ \mathbf{b}=\begin{bmatrix}-58.0 & 4.7 & 0.3\end{bmatrix}, ] and consequently our estimate of the mean response is (\hat{\mu}) given by \begin{alignat}{1} \hat{\mu}(x_{1},x_{2}) = & \ b_{0} + b_{1} x_{1} + b_{2}x_{2},\ \approx & -58.0 + 4.7 x_{1} + 0.3 x_{2}. \end{alignat}

We could see the entire model matrix (\mathbf{X}) with the model.matrix \index{model.matrix@\texttt{model.matrix}} function, but in the interest of brevity we only show the first few rows.

head(model.matrix(trees.lm))

Point Estimates of the Regression Surface {#sub-mlr-point-est-regsurface}

The parameter estimates (\mathbf{b}) make it easy to find the fitted values \index{fitted values}, (\hat{\mathbf{Y}}). We write them individually as (\hat{Y}{i}), (i=1,2,\ldots,n), and recall that they are defined by \begin{eqnarray} \hat{Y}{i} & = & \hat{\mu}(x_{1i},x_{2i}),\ & = & b_{0}+b_{1}x_{1i}+b_{2}x_{2i},\quad i=1,2,\ldots,n. \end{eqnarray} They are expressed more compactly by the matrix equation \begin{equation} \hat{\mathbf{Y}}=\mathbf{X}\mathbf{b}. \end{equation} From Equation \eqref{eq-b-formula-matrix} we know that (\mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}), so we can rewrite \begin{eqnarray} \hat{\mathbf{Y}} & = & \mathbf{X}\left[\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\right],\ & = & \mathbf{H}\mathbf{Y}, \end{eqnarray} where (\mathbf{H}=\mathbf{X}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}) is appropriately named the hat matrix \index{hat matrix} because it "puts the hat on (\mathbf{Y})". The hat matrix is very important in later courses. Some facts about (\mathbf{H}) are

Now let us write a column vector (\mathbf{x}{0}=(x{10},x_{20})^{\mathrm{T}}) to denote given values of the explanatory variables Girth == (x_{10}) and Height == (x_{20}). These values may match those of the collected data, or they may be completely new values not observed in the original data set. We may use the parameter estimates to find (\hat{Y}(\mathbf{x}{0})), which will give us 1. an estimate of (\mu(\mathbf{x}{0})), the mean value of a future observation at (\mathbf{x}{0}), and 2. a prediction for (Y(\mathbf{x}{0})), the actual value of a future observation at (\mathbf{x}_{0}).

We can represent (\hat{Y}(\mathbf{x}{0})) by the matrix equation \begin{equation} \label{eq-mlr-single-yhat-matrix} \hat{Y}(\mathbf{x}{0})=\mathbf{x}{0}^{\mathrm{T}}\mathbf{b}, \end{equation} which is just a fancy way to write \begin{equation} \hat{Y}(x{10},x_{20})=b_{0}+b_{1}x_{10}+b_{2}x_{20}. \end{equation}

\bigskip

If we wanted to predict the average volume of black cherry trees that
have `Girth = 15` in and are `Height = 77` ft tall then we would use
the estimate

\begin{alignat}{1} \hat{\mu}(15,\,77)= & -58+4.7(15)+0.3(77),\ \approx & 35.6\text{ft}^{3}. \end{alignat}

We would use the same estimate (\hat{Y}=35.6) to predict the measured Volume of another black cherry tree -- yet to be observed -- that has Girth = 15 in and is Height = 77 ft tall.

How to do it with R

The fitted values are stored inside trees.lm and may be accessed with the fitted function. We only show the first five fitted values.

fitted(trees.lm)[1:5]

The syntax for general prediction does not change much from simple linear regression. The computations are done with the predict function as described below.

The only difference from SLR is in the way we tell R the values of the explanatory variables for which we want predictions. In SLR we had only one independent variable but in MLR we have many (for the trees data we have two). We will store values for the independent variables in the data frame new, which has two columns (one for each independent variable) and three rows (we shall make predictions at three different locations).

new <- data.frame(Girth = c(9.1, 11.6, 12.5), 
                  Height = c(69, 74, 87))

We can view the locations at which we will predict:

new

We continue just like we would have done in SLR.

predict(trees.lm, newdata = new)
treesFIT <- round(predict(trees.lm, newdata = new), 1)

Example. Using the trees data,

  1. Report a point estimate of the mean Volume of a tree of Girth 9.1 in and Height 69 ft. The fitted value for (x_{1}=9.1) and (x_{2} = 69) is r treesFIT[1], so a point estimate would be r treesFIT[1] cubic feet.
  2. Report a point prediction for and a 95% prediction interval for the Volume of a hypothetical tree of Girth 12.5 in and Height 87 ft. The fitted value for (x_{1} = 12.5) and (x_{2} = 87) is r treesFIT[3], so a point prediction for the Volume is r treesFIT[3] cubic feet.

Mean Square Error and Standard Error {#sub-mlr-mse-se}

The residuals are given by \begin{equation} \mathbf{E}=\mathbf{Y}-\hat{\mathbf{Y}}=\mathbf{Y}-\mathbf{H}\mathbf{Y}=(\mathbf{I}-\mathbf{H})\mathbf{Y}. \end{equation} Now we can use Theorem \@ref(thm:mvnorm-dist-matrix-prod) to see that the residuals are distributed \begin{equation} \mathbf{E}\sim\mathsf{mvnorm}(\mathtt{mean}=\mathbf{0},\,\mathtt{sigma}=\sigma^{2}(\mathbf{I}-\mathbf{H})), \end{equation} since ((\mathbf{I}-\mathbf{H})\mathbf{X}\upbeta=\mathbf{X}\upbeta-\mathbf{X}\upbeta=\mathbf{0}) and ((\mathbf{I}-\mathbf{H})\,(\sigma^{2}\mathbf{I})\,(\mathbf{I}-\mathbf{H})^{\mathrm{T}}=\sigma^{2}(\mathbf{I}-\mathbf{H})^{2}=\sigma^{2}(\mathbf{I}-\mathbf{H})). Thesum of squared errors (SSE) is just \begin{equation} SSE=\mathbf{E}^{\mathrm{T}}\mathbf{E}=\mathbf{Y}^{\mathrm{T}}(\mathbf{I}-\mathbf{H})(\mathbf{I}-\mathbf{H})\mathbf{Y}=\mathbf{Y}^{\mathrm{T}}(\mathbf{I}-\mathbf{H})\mathbf{Y}. \end{equation} Recall that in SLR we had two parameters ((\beta_{0}) and (\beta_{1})) in our regression model and we estimated (\sigma^{2}) with (s^{2}=SSE/(n-2)). In MLR, we have (p+1) parameters in our regression model and we might guess that to estimate (\sigma^{2}) we would use the mean square error (S^{2}) defined by \begin{equation} S^{2}=\frac{SSE}{n-(p+1)}. \end{equation} That would be a good guess. The residual standard error is (S=\sqrt{S^{2}}).

How to do it with R

The residuals are also stored with trees.lm and may be accessed with the residuals function. We only show the first five residuals.

residuals(trees.lm)[1:5]

The summary function output (shown later) lists the Residual Standard Error which is just (S=\sqrt{S^{2}}). It is stored in the sigma component of the summary object.

treesumry <- summary(trees.lm)
treesumry$sigma

For the trees data we find (s \approx) r round(treesumry$sigma,3).

Interval Estimates of the Parameters {#sub-mlr-interval-est-params}

We showed in Section \@ref(sub-mlr-parameter-estimates) that (\mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}), which is really just a big matrix -- namely (\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}) -- multiplied by (\mathbf{Y}). It stands to reason that the sampling distribution of (\mathbf{b}) would be intimately related to the distribution of (\mathbf{Y}), which we assumed to be \begin{equation} \mathbf{Y}\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{X}\upbeta,\,\mathtt{sigma}=\sigma^{2}\mathbf{I}\right). \end{equation} Now recall Theorem \@ref(thm:mvnorm-dist-matrix-prod) that we said we were going to need eventually (the time is now). That proposition guarantees that \begin{equation} \label{eq-distn-b-mlr} \mathbf{b}\sim\mathsf{mvnorm}\left(\mathtt{mean}=\upbeta,\,\mathtt{sigma}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\right), \end{equation} since \begin{equation} \mathbb{E}\mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}(\mathbf{X}\upbeta)=\upbeta, \end{equation} and \begin{equation} \mbox{Var}(\mathbf{b})=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}(\sigma^{2}\mathbf{I})\mathbf{X}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}, \end{equation} the first equality following because the matrix (\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}) is symmetric.

There is a lot that we can glean from Equation \eqref{eq-distn-b-mlr}. First, it follows that the estimator (\mathbf{b}) is unbiased (see Section \@ref(sec-point-estimation-1)). Second, the variances of (b_{0}), (b_{1}), ..., (b_{n}) are exactly the diagonal elements of (\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}), which is completely known except for that pesky parameter (\sigma^{2}). Third, we can estimate the standard error of (b_{i}) (denoted (S_{b_{i}})) with the mean square error (S) (defined in the previous section) multiplied by the corresponding diagonal element of (\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}). Finally, given estimates of the standard errors we may construct confidence intervals for (\beta_{i}) with an interval that looks like \begin{equation} b_{i}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-p-1)S{b_{i}}. \end{equation} The degrees of freedom for the Student's (t) distribution[^mlr-2] are the same as the denominator of (S^{2}).

[^mlr-2]: We are taking great leaps over the mathematical details. In particular, we have yet to show that (s^{2}) has a chi-square distribution and we have not even come close to showing that (b_{i}) and (s_{b_{i}}) are independent. But these are entirely outside the scope of the present book and the reader may rest assured that the proofs await in later classes. See C.R. Rao for more.

How to do it with R

To get confidence intervals for the parameters we need only use confint \index{confint@\texttt{confint}}:

confint(trees.lm)
treesPAR <- round(confint(trees.lm), 1)

For example, using the calculations above we say that for the regression model Volume ~ Girth + Height we are 95% confident that the parameter (\beta_{1}) lies somewhere in the interval r treesPAR[2,1] to r treesPAR[2,2].

Confidence and Prediction Intervals

We saw in Section \@ref(sub-mlr-point-est-regsurface) how to make point estimates of the mean value of additional observations and predict values of future observations, but how good are our estimates? We need confidence and prediction intervals to gauge their accuracy, and lucky for us the formulas look similar to the ones we saw in SLR.

In Equation \eqref{eq-mlr-single-yhat-matrix} we wrote $\hat{Y}(\mathbf{x}0)=\mathbf{x}_0^{\mathrm{T}}\mathbf{b}$, and in Equation \eqref{eq-distn-b-mlr} we saw that \begin{equation} \mathbf{b}\sim\mathsf{mvnorm}\left(\mathtt{mean} = \upbeta,\,\mathtt{sigma}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\right). \end{equation} The following is therefore immediate from Theorem \@ref(thm:mvnorm-dist-matrix-prod): \begin{equation} \hat{Y}(\mathbf{x}{0})\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{x}{0}^{\mathrm{T}}\upbeta,\,\mathtt{sigma}=\sigma^{2}\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}{0}\right). \end{equation} It should be no surprise that confidence intervals for the mean value of a future observation at the location (\mathbf{x}{0}=\begin{bmatrix}x_{10} & x_{20} & \ldots & x_{p0}\end{bmatrix}^{\mathrm{T}}) are given by \begin{equation} \hat{Y}(\mathbf{x}{0})\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-p-1)\, S\sqrt{\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}{0}}. \end{equation} Intuitively, (\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}{0}) measures the distance of (\mathbf{x}_{0}) from the center of the data. The degrees of freedom in the Student's (t) critical value are (n-(p+1)) because we need to estimate (p+1) parameters.

Prediction intervals for a new observation at (\mathbf{x}{0}) are given by \begin{equation} \hat{Y}(\mathbf{x}{0})\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-p-1)\, S\sqrt{1+\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}_{0}}. \end{equation} The prediction intervals are wider than the confidence intervals, just as in Section \@ref(sub-slr-interval-est-regline).

How to do it with R

The syntax is identical to that used in SLR, with the proviso that we need to specify values of the independent variables in the data frame new as we did in Section \@ref(sub-slr-interval-est-regline) (which we repeat here for illustration).

new <- data.frame(Girth = c(9.1, 11.6, 12.5), 
                  Height = c(69, 74, 87))

Confidence intervals are given by

predict(trees.lm, newdata = new, interval = "confidence")
treesCI <- round(predict(trees.lm, newdata = new, interval = "confidence"), 1)

Prediction intervals are given by

predict(trees.lm, newdata = new, interval = "prediction")
treesPI <- round(predict(trees.lm, newdata = new, interval = "prediction"), 1)

As before, the interval type is decided by the interval argument and the default confidence level is 95% (which can be changed with the level argument).

Example. Using the trees data,

  1. Report a 95% confidence interval for the mean Volume of a tree of Girth 9.1 in and Height 69 ft. The 95% CI is given by r treesCI[1,2] to r treesCI[1,3], so with 95% confidence the mean Volume lies somewhere between r treesCI[1, 2] cubic feet and r treesCI[1,3] cubic feet.
  2. Report a 95% prediction interval for the Volume of a hypothetical tree of Girth 12.5 in and Height 87 ft. The 95% prediction interval is given by r treesCI[3,2] to r treesCI[3,3], so with 95% confidence we may assert that the hypothetical Volume of a tree of Girth 12.5 in and Height 87 ft would lie somewhere between r treesCI[3, 2] cubic feet and r treesCI[3,3] feet.

Model Utility and Inference {#sec-model-utility-and-mlr}

Multiple Coefficient of Determination

We saw in Section \@ref(sub-mlr-mse-se) that the error sum of squares (SSE) can be conveniently written in MLR as \begin{equation} \label{eq-mlr-sse-matrix} SSE=\mathbf{Y}^{\mathrm{T}}(\mathbf{I}-\mathbf{H})\mathbf{Y}. \end{equation} It turns out that there are equally convenient formulas for the total sum of squares (SSTO) and the regression sum of squares (SSR). They are: \begin{alignat}{1} \label{eq-mlr-ssto-matrix} SSTO= & \mathbf{Y}^{\mathrm{T}}\left(\mathbf{I}-\frac{1}{n}\mathbf{J}\right)\mathbf{Y} \end{alignat} and \begin{alignat}{1} \label{eq-mlr-ssr-matrix} SSR= & \mathbf{Y}^{\mathrm{T}}\left(\mathbf{H}-\frac{1}{n}\mathbf{J}\right)\mathbf{Y}. \end{alignat} (The matrix (\mathbf{J}) is defined in Appendix \@ref(sec-linear-algebra).) Immediately from Equations \eqref{eq-mlr-sse-matrix}, \eqref{eq-mlr-ssto-matrix}, and \eqref{eq-mlr-ssr-matrix} we get the Anova Equality \begin{equation} SSTO=SSE+SSR. \end{equation} (See Exercise \@ref(xca-anova-equality).) We define the multiple coefficient of determination by the formula \begin{equation} R^{2}=1-\frac{SSE}{SSTO}. \end{equation}

We interpret (R^{2}) as the proportion of total variation that is explained by the multiple regression model. In MLR we must be careful, however, because the value of (R^{2}) can be artificially inflated by the addition of explanatory variables to the model, regardless of whether or not the added variables are useful with respect to prediction of the response variable. In fact, it can be proved that the addition of a single explanatory variable to a regression model will increase the value of (R^{2}), no matter how worthless the explanatory variable is. We could model the height of the ocean tides, then add a variable for the length of cheetah tongues on the Serengeti plain, and our (R^{2}) would inevitably increase.

This is a problem, because as the philosopher, Occam, once said: "causes should not be multiplied beyond necessity". We address the problem by penalizing (R^{2}) when parameters are added to the model. The result is an adjusted (R^{2}) which we denote by (\overline{R}^{2}). \begin{equation} \overline{R}^{2}=\left(R^{2}-\frac{p}{n-1}\right)\left(\frac{n-1}{n-p-1}\right). \end{equation} It is good practice for the statistician to weigh both (R^{2}) and (\overline{R}^{2}) during assessment of model utility. In many cases their values will be very close to each other. If their values differ substantially, or if one changes dramatically when an explanatory variable is added, then (s)he should take a closer look at the explanatory variables in the model.

How to do it with R

For the trees data, we can get (R^{2}) and (\overline{R}^{2}) from the summary output or access the values directly by name as shown (recall that we stored the summary object in treesumry).

treesumry$r.squared
treesumry$adj.r.squared

High values of (R^{2}) and $\overline{R}^2$ such as these indicate that the model fits very well, which agrees with what we saw in Figure \@ref(fig:3D-scatterplot-trees).

Overall F-Test {#sub-mlr-overall-f-test}

Another way to assess the model's utility is to to test the hypothesis [ H_{0}:\beta_{1}=\beta_{2}=\cdots=\beta_{p}=0\mbox{ versus }H_{1}:\mbox{ at least one $\beta_{i}\neq0$}. ] The idea is that if all (\beta_{i})'s were zero, then the explanatory variables (X_{1},\ldots,X_{p}) would be worthless predictors for the response variable (Y). We can test the above hypothesis with the overall (F) statistic, which in MLR is defined by \begin{equation} F=\frac{SSR/p}{SSE/(n-p-1)}. \end{equation} When the regression assumptions hold and under (H_{0}), it can be shown that (F\sim\mathsf{f}(\mathtt{df1}=p,\,\mathtt{df2}=n-p-1)). We reject (H_{0}) when (F) is large, that is, when the explained variation is large relative to the unexplained variation.

How to do it with R

The overall (F) statistic and its associated p-value is listed at the bottom of the summary output, or we can access it directly by name; it is stored in the fstatistic component of the summary object.

treesumry$fstatistic

For the trees data, we see that ( F = ) r treesumry$fstatistic[1] with a p-value =< 2.2e-16. Consequently we reject (H_{0}), that is, the data provide strong evidence that not all (\beta_{i})'s are zero.

Student's t Tests {#sub-mlr-students-t-tests}

We know that \begin{equation} \mathbf{b}\sim\mathsf{mvnorm}\left(\mathtt{mean}=\upbeta,\,\mathtt{sigma}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\right) \end{equation} and we have seen how to test the hypothesis (H_{0}:\beta_{1}=\beta_{2}=\cdots=\beta_{p}=0), but let us now consider the test \begin{equation} H_{0}:\beta_{i}=0\mbox{ versus }H_{1}:\beta_{i}\neq0, \end{equation} where (\beta_{i}) is the coefficient for the (i^{\textrm{th}}) independent variable. We test the hypothesis by calculating a statistic, examining it's null distribution, and rejecting (H_{0}) if the p-value is small. If (H_{0}) is rejected, then we conclude that there is a significant relationship between (Y) and (x_{i}) in the regression model (Y\sim(x_{1},\ldots,x_{p})). This last part of the sentence is very important because the significance of the variable (x_{i}) sometimes depends on the presence of other independent variables in the model[^mlr-3].

[^mlr-3]: In other words, a variable might be highly significant one moment but then fail to be significant when another variable is added to the model. When this happens it often indicates a problem with the explanatory variables, such as multicollinearity. See Section \ref{sub-Multicollinearity}.

To test the hypothesis we go to find the sampling distribution of ( b_{i} ), the estimator of the corresponding parameter (\beta_{i}), when the null hypothesis is true. We saw in Section \@ref(sub-mlr-interval-est-params) that \begin{equation} T_{i}=\frac{b_{i}-\beta_{i}}{S_{b_{i}}} \end{equation} has a Student's (t) distribution with (n-(p+1)) degrees of freedom. (Remember, we are estimating (p+1) parameters.) Consequently, under the null hypothesis (H_{0}:\beta_{i}=0) the statistic (t_{i}=b_{i}/S_{b_{i}}) has a (\mathsf{t}(\mathtt{df}=n-p-1)) distribution.

How to do it with R

The Student's (t) tests for significance of the individual explanatory variables are shown in the summary output.

treesumry

We see from the p-values that there is a significant linear relationship between Volume and Girth and between Volume and Height in the regression model Volume ~ Girth + Height. Further, it appears that the Intercept is significant in the aforementioned model.

Polynomial Regression {#sec-polynomial-regression}

Quadratic Regression Model

In each of the previous sections we assumed that (\mu) was a linear function of the explanatory variables. For example, in SLR we assumed that (\mu(x)=\beta_{0}+\beta_{1}x), and in our previous MLR examples we assumed (\mu(x_{1},x_{2}) = \beta_{0}+\beta_{1}x_{1} + \beta_{2}x_{2}). In every case the scatterplots indicated that our assumption was reasonable. Sometimes, however, plots of the data suggest that the linear model is incomplete and should be modified.

plot(Volume ~ Girth, data = trees)

(ref:cap-scatterplot-volume-girth-trees) \small Scatterplot of Volume versus Girth for the trees data.

For example, let us examine a scatterplot of Volume versus Girth a little more closely. See Figure \@ref(fig:scatterplot-volume-girth-trees). There might be a slight curvature to the data; the volume curves ever so slightly upward as the girth increases. After looking at the plot we might try to capture the curvature with a mean response such as \begin{equation} \mu(x_{1})=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{1}^{2}. \end{equation} The model associated with this choice of (\mu) is \begin{equation} Y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{1}^{2}+\epsilon. \end{equation} The regression assumptions are the same. Almost everything indeed is the same. In fact, it is still called a "linear regression model", since the mean response (\mu) is linear in the parameters (\beta_{0}), (\beta_{1}), and (\beta_{2}).

However, there is one important difference. When we introduce the squared variable in the model we inadvertently also introduce strong dependence between the terms which can cause significant numerical problems when it comes time to calculate the parameter estimates. Therefore, we should usually rescale the independent variable to have mean zero (and even variance one if we wish) before fitting the model. That is, we replace the (x_{i})'s with (x_{i}-\overline{x}) (or ((x_{i}-\overline{x})/s)) before fitting the model[^mlr-4].

[^mlr-4]: Rescaling the data gets the job done but a better way to avoid the multicollinearity introduced by the higher order terms is with orthogonal polynomials, whose coefficients are chosen just right so that the polynomials are not correlated with each other. This is beginning to linger outside the scope of this book, however, so we will content ourselves with a brief mention and then stick with the rescaling approach in the discussion that follows. A nice example of orthogonal polynomials in action can be run with example(cars).

How to do it with R

There are multiple ways to fit a quadratic model to the variables Volume and Girth using R.

  1. One way would be to square the values for Girth and save them in a vector Girthsq. Next, fit the linear model Volume ~ Girth + Girthsq.
  2. A second way would be to use the insulate function in R, denoted by I: Volume ~ Girth + I(Girth^2) The second method is shorter than the first but the end result is the same. And once we calculate and store the fitted model (in, say, treesquad.lm) all of the previous comments regarding R apply.
  3. A third and "right" way to do it is with orthogonal polynomials: Volume ~ poly(Girth, degree = 2) See ?poly and ?cars for more information. Note that we can recover the approach in 2 with poly(Girth, degree = 2, raw = TRUE).

\bigskip

We will fit the quadratic model to the `trees` data and display the
results with `summary`, being careful to rescale the data before
fitting the model. We may rescale the `Girth` variable to have zero
mean and unit variance on-the-fly with the `scale` function.
treesquad.lm <- lm(Volume ~ scale(Girth) + I(scale(Girth)^2), 
                   data = trees)
summary(treesquad.lm)

We see that the (F) statistic indicates the overall model including Girth and Girth^2 is significant. Further, there is strong evidence that both Girth and Girth^2 are significantly related to Volume. We may examine a scatterplot together with the fitted quadratic function using the lines function, which adds a line to the plot tracing the estimated mean response.

plot(Volume ~ scale(Girth), data = trees)
lines(fitted(treesquad.lm) ~ scale(Girth), data = trees)

(ref:cap-fitting-the-quadratic) \small A quadratic model for the trees data.

The plot is shown in Figure \@ref(fig:fitting-the-quadratic). Pay attention to the scale on the (x)-axis: it is on the scale of the transformed Girth data and not on the original scale.

\bigskip

```{block, type="rem"} When a model includes a quadratic term for an independent variable, it is customary to also include the linear term in the model. The principle is called parsimony. More generally, if the researcher decides to include (x^{m}) as a term in the model, then (s)he should also include all lower order terms (x), (x^{2}), ...,(x^{m-1}) in the model.

We do estimation/prediction the same way that we did in Section
\@ref(sub-mlr-point-est-regsurface), except we do not need a `Height`
column in the dataframe =new= since the variable is not included in
the quadratic model.

```r 
new <- data.frame(Girth = c(9.1, 11.6, 12.5))
predict(treesquad.lm, newdata = new, interval = "prediction")

The predictions and intervals are slightly different from what they were previously. Notice that it was not necessary to rescale the Girth prediction data before input to the predict function; the model did the rescaling for us automatically.

\bigskip

```{block, type="rem"} We have mentioned on several occasions that it is important to rescale the explanatory variables for polynomial regression. Watch what happens if we ignore this advice:

```r 
summary(lm(Volume ~ Girth + I(Girth^2), data = trees))

Now nothing is significant in the model except Girth^2. We could delete the Intercept and Girth from the model, but the model would no longer be parsimonious. A novice may see the output and be confused about how to proceed, while the seasoned statistician recognizes immediately that Girth and Girth^2 are highly correlated (see Section \@ref(sub-multicollinearity)). The only remedy to this ailment is to rescale Girth, which we should have done in the first place.

In Example \@ref(ex:mlr-trees-poly-no-rescale) of Section \@ref(sec-partial-f-statistic).

Note: The trees data do not have any qualitative explanatory variables, so we will construct one for illustrative purposes[^mlr-5]. We will leave the Girth variable alone, but we will replace the variable Height by a new variable Tall which indicates whether or not the cherry tree is taller than a certain threshold (which for the sake of argument will be the sample median height of 76 ft). That is, Tall will be defined by \begin{equation} \mathtt{Tall} = \begin{cases} \mathtt{yes}, & \mbox{if }\mathtt{Height} > 76,\ \mathtt{no}, & \mbox{if }\mathtt{Height}\leq 76. \end{cases} \end{equation}

We can construct Tall very quickly in R with the cut function:

trees$Tall <- cut(trees$Height, breaks = c(-Inf, 76, Inf), 
                  labels = c("no","yes"))
trees$Tall[1:5]

Note that Tall is automatically generated to be a factor with the labels in the correct order. See ?cut for more.

[^mlr-5]: This procedure of replacing a continuous variable by a discrete/qualitative one is called binning, and is almost never the right thing to do. We are in a bind at this point, however, because we have invested this chapter in the trees data and I do not want to switch mid-discussion. I am currently searching for a data set with pre-existing qualitative variables that also conveys the same points present in the trees data, and when I find it I will update this chapter accordingly.

Once we have Tall, we include it in the regression model just like we would any other variable. It is handled internally in a special way. Define a "dummy variable" Tallyes that takes values \begin{equation} \mathtt{Tallyes} = \begin{cases} 1, & \mbox{if }\mathtt{Tall}=\mathtt{yes},\ 0, & \mbox{otherwise.} \end{cases} \end{equation} That is, Tallyes is an indicator variable which indicates when a respective tree is tall. The model may now be written as \begin{equation} \mathtt{Volume}=\beta_{0}+\beta_{1}\mathtt{Girth}+\beta_{2}\mathtt{Tallyes}+\epsilon. \end{equation} Let us take a look at what this definition does to the mean response. Trees with Tall = yes will have the mean response \begin{equation} \mu(\mathtt{Girth})=(\beta_{0}+\beta_{2})+\beta_{1}\mathtt{Girth}, \end{equation} while trees with Tall = no will have the mean response \begin{equation} \mu(\mathtt{Girth})=\beta_{0}+\beta_{1}\mathtt{Girth}. \end{equation} In essence, we are fitting two regression lines: one for tall trees, and one for short trees. The regression lines have the same slope but they have different (y) intercepts (which are exactly (|\beta_{2}|) far apart).

How to do it with R

The important thing is to double check that the qualitative variable in question is stored as a factor. The way to check is with the class command. For example,

class(trees$Tall)

If the qualitative variable is not yet stored as a factor then we may convert it to one with the factor command. See Section \@ref(sub-qualitative-data). Other than this we perform MLR as we normally would.

treesdummy.lm <- lm(Volume ~ Girth + Tall, data = trees)
summary(treesdummy.lm)

From the output we see that all parameter estimates are statistically significant and we conclude that the mean response differs for trees with Tall = yes and trees with Tall = no.

\bigskip

``{block2, type="rem"} We were somewhat disingenuous when we defined the dummy variableTallyesbecause, in truth, R definesTallyes` automatically without input from the user[^mlr-6]. Indeed, the author fit the model beforehand and wrote the discussion afterward with the knowledge of what R would do so that the output the reader saw would match what (s)he had previously read. The way that R handles factors internally is part of a much larger topic concerning contrasts, which falls outside the scope of this book. The interested reader should see Neter et al [@Neter1996] or Fox [@Fox1997] for more.

[^mlr-6]: That is, R by default handles contrasts
according to its internal settings which may be customized by the user
for fine control. Given that we will not investigate contrasts further
in this book it does not serve the discussion to delve into those
settings, either. The interested reader should check `?contrasts` for
details.

\bigskip

```{block, type="rem"}
In general, if an explanatory variable `foo` is qualitative with \(n\)
levels `bar1`, `bar2`, ..., `barn` then R will by default
automatically define \(n-1\) indicator variables in the following way:
\begin{eqnarray*} \mathtt{foobar2} & = & \begin{cases} 1, & \mbox{if }\mathtt{foo}=\mathtt{"bar2"},\\ 0, & \mbox{otherwise.}\end{cases},\,\ldots,\,\mathtt{foobarn}=\begin{cases} 1, & \mbox{if }\mathtt{foo}=\mathtt{"barn"},\\ 0, & \mbox{otherwise.}\end{cases} \end{eqnarray*}
The level `bar1` is represented by
\(\mathtt{foobar2}=\cdots=\mathtt{foobarn}=0\). We just need to make
sure that `foo` is stored as a factor and R will take
care of the rest.

Graphing the Regression Lines

We can see a plot of the two regression lines with the following mouthful of code.

treesTall <- split(trees, trees$Tall)
treesTall[["yes"]]$Fit <- predict(treesdummy.lm, treesTall[["yes"]])
treesTall[["no"]]$Fit <- predict(treesdummy.lm, treesTall[["no"]])
plot(Volume ~ Girth, data = trees)
points(Volume ~ Girth, data = treesTall[["yes"]], pch = 1)
points(Volume ~ Girth, data = treesTall[["no"]], pch = 2)
lines(Fit ~ Girth, data = treesTall[["yes"]])
lines(Fit ~ Girth, data = treesTall[["no"]])

(ref:cap-dummy-variable-trees) \small A dummy variable model for the trees data.

It may look intimidating but there is reason to the madness. First we split the trees data into two pieces, with groups determined by the Tall variable. Next we add the fitted values to each piece via predict. Then we set up a plot for the variables Volume versus Girth, but we do not plot anything yet (type = n) because we want to use different symbols for the two groups. Next we add points to the plot for the Tall = yes trees and use an open circle for a plot character (pch = 1), followed by points for the Tall = no trees with a triangle character (pch = 2). Finally, we add regression lines to the plot, one for each group.

There are other -- shorter -- ways to plot regression lines by groups, namely the scatterplot function in the car package [@car] and the xyplot function in the lattice package [@lattice]. We elected to introduce the reader to the above approach since many advanced plots in R are done in a similar, consecutive fashion.

Partial F Statistic {#sec-partial-f-statistic}

We saw in Section \@ref(sub-mlr-overall-f-test) how to test (H_{0}:\beta_{0}=\beta_{1}=\cdots=\beta_{p}=0) with the overall (F) statistic and we saw in Section \@ref(sub-mlr-students-t-tests) how to test (H_{0}:\beta_{i}=0) that a particular coefficient (\beta_{i}) is zero. Sometimes, however, we would like to test whether a certain part of the model is significant. Consider the regression model \begin{equation} Y=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{j}x_{j}+\beta_{j+1}x_{j+1}+\cdots+\beta_{p}x_{p}+\epsilon, \end{equation} where (j\geq1) and (p\geq2). Now we wish to test the hypothesis \begin{equation} H_{0}:\beta_{j+1}=\beta_{j+2}=\cdots=\beta_{p}=0 \end{equation} versus the alternative \begin{equation} H_{1}:\mbox{at least one of $\beta_{j+1},\ \beta_{j+2},\ ,\ldots,\beta_{p}\neq0$}. \end{equation}

The interpretation of (H_{0}) is that none of the variables (x_{j+1}), ...,(x_{p}) is significantly related to (Y) and the interpretation of (H_{1}) is that at least one of (x_{j+1}), ...,(x_{p}) is significantly related to (Y). In essence, for this hypothesis test there are two competing models under consideration: \begin{align} \mbox{the full model:} & \quad y=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}+\epsilon,\ \mbox{the reduced model:} & \quad y=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{j}x_{j}+\epsilon, \end{align}

Of course, the full model will always explain the data better than the reduced model, but does the full model explain the data significantly better than the reduced model? This question is exactly what the partial (F) statistic is designed to answer.

We first calculate (SSE_{f}), the unexplained variation in the full model, and (SSE_{r}), the unexplained variation in the reduced model. We base our test on the difference (SSE_{r}-SSE_{f}) which measures the reduction in unexplained variation attributable to the variables (x_{j+1}), ..., (x_{p}). In the full model there are (p+1) parameters and in the reduced model there are (j+1) parameters, which gives a difference of (p-j) parameters (hence degrees of freedom). The partial F statistic is \begin{equation} F=\frac{(SSE_{r}-SSE_{f})/(p-j)}{SSE_{f}/(n-p-1)}. \end{equation} It can be shown when the regression assumptions hold under (H_{0}) that the partial (F) statistic has an (\mathsf{f}(\mathtt{df1}=p-j,\,\mathtt{df2}=n-p-1)) distribution. We calculate the (p)-value of the observed partial (F) statistic and reject (H_{0}) if the (p)-value is small.

How to do it with R

The key ingredient above is that the two competing models are nested in the sense that the reduced model is entirely contained within the complete model. The way to test whether the improvement is significant is to compute lm objects both for the complete model and the reduced model then compare the answers with the anova function.

\bigskip

``{example, name="The Trees data"} For thetrees` data, let us fit a polynomial regression model and for the sake of argument we will ignore our own good advice and fail to rescale the explanatory variables.

```r 
treesfull.lm <- lm(Volume ~ Girth + I(Girth^2) + Height + 
                   I(Height^2), data = trees)
summary(treesfull.lm)

In this ill-formed model nothing is significant except Girth and Girth^2. Let us continue down this path and suppose that we would like to try a reduced model which contains nothing but Girth and Girth^2 (not even an Intercept). Our two models are now \begin{align} \mbox{the full model:} & \quad Y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{1}^{2}+\beta_{3}x_{2}+\beta_{4}x_{2}^{2}+\epsilon,\ \mbox{the reduced model:} & \quad Y=\beta_{1}x_{1}+\beta_{2}x_{1}^{2}+\epsilon, \end{align} We fit the reduced model with lm and store the results:

treesreduced.lm <- lm(Volume ~ -1 + Girth + I(Girth^2), data = trees)

To delete the intercept from the model we used -1 in the model formula. Next we compare the two models with the anova function. The convention is to list the models from smallest to largest.

anova(treesreduced.lm, treesfull.lm)

We see from the output that the complete model is highly significant compared to the model that does not incorporate Height or the Intercept. We wonder (with our tongue in our cheek) if the Height^2 term in the full model is causing all of the trouble. We will fit an alternative reduced model that only deletes Height^2.

treesreduced2.lm <- lm(Volume ~ Girth + I(Girth^2) + Height, 
                       data = trees)
anova(treesreduced2.lm, treesfull.lm)

In this case, the improvement to the reduced model that is attributable to Height^2 is not significant, so we can delete Height^2 from the model with a clear conscience. We notice that the p-value for this latest partial (F) test is 0.8865, which seems to be remarkably close to the p-value we saw for the univariate t test of Height^2 at the beginning of this example. In fact, the p-values are exactly the same. Perhaps now we gain some insight into the true meaning of the univariate tests.

Residual Analysis and Diagnostic Tools {#sec-residual-analysis-mlr}

We encountered many, many diagnostic measures for simple linear regression in Sections \@ref(sec-residual-analysis-slr) and \@ref(sec-other-diagnostic-tools-slr). All of these are valid in multiple linear regression, too, but there are some slight changes that we need to make for the multivariate case. We list these below, and apply them to the trees example.

Additional Topics {#sec-additional-topics-mlr}

Nonlinear Regression

We spent the entire chapter talking about the trees data, and all of our models looked like Volume ~ Girth + Height or a variant of this model. But let us think again: we know from elementary school that the volume of a rectangle is (V=lwh) and the volume of a cylinder (which is closer to what a black cherry tree looks like) is \begin{equation} V=\pi r^{2}h\quad \mbox{or}\quad V=4\pi dh, \end{equation} where (r) and (d) represent the radius and diameter of the tree, respectively. With this in mind, it would seem that a more appropriate model for (\mu) might be \begin{equation} \label{eq-trees-nonlin-reg} \mu(x_{1},x_{2})=\beta_{0}x_{1}^{\beta_{1}}x_{2}^{\beta_{2}}, \end{equation} where (\beta_{1}) and (\beta_{2}) are parameters to adjust for the fact that a black cherry tree is not a perfect cylinder.

How can we fit this model? The model is not linear in the parameters any more, so our linear regression methods will not work... or will they? In the trees example we may take the logarithm of both sides of Equation \eqref{eq-trees-nonlin-reg} to get \begin{equation} \mu^{\ast}(x_{1},x_{2})=\ln\left[\mu(x_{1},x_{2})\right]=\ln\beta_{0}+\beta_{1}\ln x_{1}+\beta_{2}\ln x_{2}, \end{equation} and this new model (\mu^{\ast}) is linear in the parameters (\beta_{0}^{\ast}=\ln\beta_{0}), (\beta_{1}^{\ast}=\beta_{1}) and (\beta_{2}^{\ast}=\beta_{2}). We can use what we have learned to fit a linear model log(Volume) ~ log(Girth) + log(Height), and everything will proceed as before, with one exception: we will need to be mindful when it comes time to make predictions because the model will have been fit on the log scale, and we will need to transform our predictions back to the original scale (by exponentiating with exp) to make sense.

treesNonlin.lm <- lm(log(Volume) ~ log(Girth) + 
                       log(Height), data = trees)
summary(treesNonlin.lm)

This is our best model yet (judging by (R^{2}) and (\overline{R}^{2})), all of the parameters are significant, it is simpler than the quadratic or interaction models, and it even makes theoretical sense. It rarely gets any better than that.

We may get confidence intervals for the parameters, but remember that it is usually better to transform back to the original scale for interpretation purposes:

exp(confint(treesNonlin.lm))

(Note that we did not update the row labels of the matrix to show that we exponentiated and so they are misleading as written.) We do predictions just as before. Remember to transform the response variable back to the original scale after prediction.

new <- data.frame(Girth = c(9.1, 11.6, 12.5), 
                  Height = c(69, 74, 87))
exp(predict(treesNonlin.lm, newdata = new, 
            interval = "confidence"))

The predictions and intervals are slightly different from those calculated earlier, but they are close. Note that we did not need to transform the Girth and Height arguments in the dataframe new. All transformations are done for us automatically.

Real Nonlinear Regression

We saw with the trees data that a nonlinear model might be more appropriate for the data based on theoretical considerations, and we were lucky because the functional form of (\mu) allowed us to take logarithms to transform the nonlinear model to a linear one. The same trick will not work in other circumstances, however. We need techniques to fit general models of the form \begin{equation} \mathbf{Y}=\mu(\mathbf{X})+\epsilon, \end{equation} where (\mu) is some crazy function that does not lend itself to linear transformations.

There are a host of methods to address problems like these which are studied in advanced regression classes. The interested reader should see Neter et al [@Neter1996] or Tabachnick and Fidell [@Tabachnick2006].

It turns out that John Fox has posted an Appendix to his book [@Fox2002] which discusses some of the methods and issues associated with nonlinear regression; see http://cran.r-project.org/doc/contrib/Fox-Companion/appendix.html for more. Here is an example of how it works, based on a question from R-help.

# fake data 
set.seed(1) 
x <- seq(from = 0, to = 1000, length.out = 200) 
y <- 1 + 2*(sin((2*pi*x/360) - 3))^2 + rnorm(200, sd = 2)
# plot(x, y)
acc.nls <- nls(y ~ a + b*(sin((2*pi*x/360) - c))^2, 
               start = list(a = 0.9, b = 2.3, c = 2.9))
summary(acc.nls)
#plot(x, fitted(acc.nls))

Multicollinearity {#sub-multicollinearity}

A multiple regression model exhibits multicollinearity when two or more of the explanatory variables are substantially correlated with each other. We can measure multicollinearity by having one of the explanatory play the role of "dependent variable" and regress it on the remaining explanatory variables. The the (R^{2}) of the resulting model is near one, then we say that the model is multicollinear or shows multicollinearity.

Multicollinearity is a problem because it causes instability in the regression model. The instability is a consequence of redundancy in the explanatory variables: a high (R^{2}) indicates a strong dependence between the selected independent variable and the others. The redundant information inflates the variance of the parameter estimates which can cause them to be statistically insignificant when they would have been significant otherwise. To wit, multicollinearity is usually measured by what are called variance inflation factors.

Once multicollinearity has been diagnosed there are several approaches to remediate it. Here are a couple of important ones.

We decided to omit a thorough discussion of multicollinearity because we are not equipped to handle the mathematical details. Perhaps the topic will receive more attention in a later edition.

Akaike's Information Criterion

[ AIC = -2\ln L + 2(p + 1) ]

Exercises

Exercise. Use Equations \eqref{eq-mlr-sse-matrix}, \eqref{eq-mlr-ssto-matrix}, and \eqref{eq-mlr-ssr-matrix} to prove the Anova Equality: [ SSTO = SSE + SSR.]



Try the IPSUR package in your browser

Any scripts or data that you put into this service are public.

IPSUR documentation built on May 2, 2019, 9:15 a.m.