# IPSUR: Introduction to Probability and Statistics Using R # Copyright (C) 2018 G. Jay Kerns # # Chapter: Multiple Linear Regression # # This file is part of IPSUR. # # IPSUR is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # IPSUR is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with IPSUR. If not, see <http://www.gnu.org/licenses/>.
# This chapter's package dependencies library(scatterplot3d) library(lattice)
We know a lot about simple linear regression models, and a next step is to study multiple regression models that have more than one independent (explanatory) variable. In the discussion that follows we will assume that we have (p) explanatory variables, where (p > 1).
The language is phrased in matrix terms -- for two reasons. First, it is quicker to write and (arguably) more pleasant to read. Second, the matrix approach will be required for later study of the subject; the reader might as well be introduced to it now.
Most of the results are stated without proof or with only a cursory justification. Those yearning for more should consult an advanced text in linear regression for details, such as Applied Linear Regression Models [@Neter1996] or Linear Models: Least Squares and Alternatives [@Rao1999].
What do I want them to know?
The first thing to do is get some better notation. We will write \begin{equation} \mathbf{Y}{\mathrm{n}\times1}= \begin{bmatrix}y{1}\ y_{2}\ \vdots\ y_{n} \end{bmatrix}, \quad \mbox{and}\quad \mathbf{X}{\mathrm{n}\times(\mathrm{p}+1)}= \begin{bmatrix}1 & x{11} & x_{21} & \cdots & x_{p1}\ 1 & x_{12} & x_{22} & \cdots & x_{p2}\ \vdots & \vdots & \vdots & \ddots & \vdots\ 1 & x_{1n} & x_{2n} & \cdots & x_{pn} \end{bmatrix}. \end{equation} The vector (\mathbf{Y}) is called the response vector \index{response vector} and the matrix (\mathbf{X}) is called the model matrix \index{model matrix}. As in Chapter \@ref(cha-simple-linear-regression), the most general assumption that relates (\mathbf{Y}) to (\mathbf{X}) is \begin{equation} \mathbf{Y}=\mu(\mathbf{X})+\upepsilon, \end{equation} where (\mu) is some function (the signal) and (\upepsilon) is the noise (everything else). We usually impose some structure on (\mu) and (\upepsilon). In particular, the standard multiple linear regression model \index{model!multiple linear regression} assumes \begin{equation} \mathbf{Y}=\mathbf{X}\upbeta+\upepsilon, \end{equation} where the parameter vector (\upbeta) looks like \begin{equation} \upbeta_{(\mathrm{p}+1)\times1}=\begin{bmatrix}\beta_{0} & \beta_{1} & \cdots & \beta_{p}\end{bmatrix}^{\mathrm{T}}, \end{equation} and the random vector (\upepsilon_{\mathrm{n}\times1}=\begin{bmatrix}\epsilon_{1} & \epsilon_{2} & \cdots & \epsilon_{n}\end{bmatrix}^{\mathrm{T}}) is assumed to be distributed \begin{equation} \upepsilon\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{0}{\mathrm{n}\times1},\,\mathtt{sigma}=\sigma^{2}\mathbf{I}{\mathrm{n}\times\mathrm{n}}\right). \end{equation}
The assumption on (\upepsilon) is equivalent to the assumption that (\epsilon_{1}), (\epsilon_{2}), ..., (\epsilon_{n}) are IID (\mathsf{norm}(\mathtt{mean}=0,\,\mathtt{sd}=\sigma)). It is a linear model because the quantity (\mu(\mathbf{X})=\mathbf{X}\upbeta) is linear in the parameters (\beta_{0}), (\beta_{1}), ..., (\beta_{p}). It may be helpful to see the model in expanded form; the above matrix formulation is equivalent to the more lengthy \begin{equation} Y_{i}=\beta_{0}+\beta_{1}x_{1i}+\beta_{2}x_{2i}+\cdots+\beta_{p}x_{pi}+\epsilon_{i},\quad i=1,2,\ldots,n. \end{equation}
\bigskip
```{example, name="Girth, Height, and Volume for Black Cherry trees"} \index{Data sets!trees@\texttt{trees}} Measurements were made of the girth, height, and volume of timber in 31 felled black cherry trees. Note that girth is the diameter of the tree (in inches) measured at 4 ft 6 in above the ground. The variables are
1. `Girth`: tree diameter in inches (denoted \(x_{1}\)) 2. `Height`: tree height in feet (\(x_{2}\)). 3. `Volume`: volume of the tree in cubic feet. (\(y\)) The data are in the `datasets` package [@datasets] and are already on the search path; they can be viewed with ```r head(trees)
Let us take a look at a visual display of the data. For multiple
variables, instead of a simple scatterplot we use a scatterplot matrix
which is made with the splom
function in the lattice
package
[@lattice] as shown below. The plot is shown in Figure
\@ref(fig:splom-trees).
splom(trees)
(ref:cap-splom-trees) \small A scatterplot matrix of the trees
data.
The dependent (response) variable Volume
is listed in the first row
of the scatterplot matrix. Moving from left to right, we see an
approximately linear relationship between Volume
and the independent
(explanatory) variables Height
and Girth
. A first guess at a model
for these data might be
\begin{equation}
Y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\epsilon,
\end{equation}
in which case the quantity
(\mu(x_{1},x_{2})=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}) would
represent the mean value of (Y) at the point ((x_{1},x_{2})).
The interpretation is simple. The intercept (\beta_{0}) represents
the mean Volume
when all other independent variables are zero. The
parameter (\beta_{i}) represents the change in mean Volume
when
there is a unit increase in (x_{i}), while the other independent
variable is held constant. For the trees
data, (\beta_{1})
represents the change in average Volume
as Girth
increases by one
unit when the Height
is held constant, and (\beta_{2}) represents
the change in average Volume
as Height
increases by one unit when
the Girth
is held constant.
In simple linear regression, we had one independent variable and our linear regression surface was 1D, simply a line. In multiple regression there are many independent variables and so our linear regression surface will be many-D... in general, a hyperplane. But when there are only two explanatory variables the hyperplane is just an ordinary plane and we can look at it with a 3D scatterplot.
One way to do this is with the R Commander in the Rcmdr
package [@Rcmdr]. It has a 3D scatterplot option under the
Graphs
menu. It is especially great because the resulting graph is
dynamic; it can be moved around with the mouse, zoomed, etc. But
that particular display does not translate well to a printed book.
Another way to do it is with the scatterplot3d
function in the
scatterplot3d
package [@scatterplot3d]. The code follows, and
the result is shown in Figure \@ref(fig:3D-scatterplot-trees).
s3d <- with(trees, scatterplot3d(Girth, Height, Volume, pch = 16, highlight.3d = TRUE, angle = 60))
(ref:cap-3D-scatterplot-trees) \small A 3D scatterplot with regression plane for the trees
data.
Looking at the graph we see that the data points fall close to a plane in three dimensional space. (The plot looks remarkably good. In the author's experience it is rare to see points fit so well to the plane without some additional work.)
We will proceed exactly like we did in Section \@ref(sec-slr-estimation). We know \begin{equation} \upepsilon\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{0}{\mathrm{n}\times1},\,\mathtt{sigma}=\sigma^{2}\mathbf{I}{\mathrm{n}\times\mathrm{n}}\right), \end{equation} which means that (\mathbf{Y}=\mathbf{X}\upbeta+\upepsilon) has an (\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{X}\upbeta,\,\mathtt{sigma}=\sigma^{2}\mathbf{I}_{\mathrm{n}\times\mathrm{n}}\right)) distribution. Therefore, the likelihood function \index{likelihood function} is \begin{equation} L(\upbeta,\sigma)=\frac{1}{2\pi^{n/2}\sigma}\exp\left{ -\frac{1}{2\sigma^{2}}\left(\mathbf{Y}-\mathbf{X}\upbeta\right)^{\mathrm{T}}\left(\mathbf{Y}-\mathbf{X}\upbeta\right)\right}. \end{equation}
To maximize the likelihood \index{maximum likelihood} in (\upbeta), we need to minimize the quantity (g(\upbeta)=\left(\mathbf{Y}-\mathbf{X}\upbeta\right)^{\mathrm{T}}\left(\mathbf{Y}-\mathbf{X}\upbeta\right)). We do this by differentiating (g) with respect to (\upbeta). (It may be a good idea to brush up on the material in Appendices \@ref(sec-linear-algebra) and \@ref(sec-multivariable-calculus).) First we will rewrite (g): \begin{equation} g(\upbeta)=\mathbf{Y}^{\mathrm{T}}\mathbf{Y}-\mathbf{Y}^{\mathrm{T}}\mathbf{X}\upbeta-\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{Y}+\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta, \end{equation} which can be further simplified to (g(\upbeta)=\mathbf{Y}^{\mathrm{T}}\mathbf{Y}-2\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{Y}+\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta) since (\upbeta^{\mathrm{T}}\mathbf{X}^{\mathrm{T}}\mathbf{Y}) is (1\times1) and thus equal to its transpose. Now we differentiate to get \begin{equation} \frac{\partial g}{\partial\upbeta}=\mathbf{0}-2\mathbf{X}^{\mathrm{T}}\mathbf{Y}+2\mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta, \end{equation} since (\mathbf{X}^{\mathrm{T}}\mathbf{X}) is symmetric. Setting the derivative equal to the zero vector yields the so called "normal equations" \index{normal equations} \begin{equation} \mathbf{X}^{\mathrm{T}}\mathbf{X}\upbeta=\mathbf{X}^{\mathrm{T}}\mathbf{Y}. \end{equation}
In the case that (\mathbf{X}^{\mathrm{T}}\mathbf{X}) is invertible[^mlr-1], we may solve the equation for (\upbeta) to get the maximum likelihood estimator of (\upbeta) which we denote by (\mathbf{b}): \begin{equation} \label{eq-b-formula-matrix} \mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}. \end{equation}
[^mlr-1]: We can find solutions of the normal equations even when (\mathbf{X}^{\mathrm{T}}\mathbf{X}) is not of full rank, but the topic falls outside the scope of this book. The interested reader can consult an advanced text such as Rao [@Rao1999].
\bigskip
```{block, type="rem"} The formula in Equation \eqref{eq-b-formula-matrix} is convenient for mathematical study but is inconvenient for numerical computation. Researchers have devised much more efficient algorithms for the actual calculation of the parameter estimates, and we do not explore them here.
\bigskip ```{block2, type="rem"} We have only found a critical value, and have not actually shown that the critical value is a minimum. We omit the details and refer the interested reader to [@Rao1999].
We do all of the above just as we would in simple linear
regression. The powerhouse is the lm
\index{lm@\texttt{lm}}
function. Everything else is based on it. We separate explanatory
variables in the model formula by a plus sign.
fit <- lm(Volume ~ Girth + Height, data = trees)
trees.lm <- lm(Volume ~ Girth + Height, data = trees) trees.lm
We see from the output that for the trees
data our parameter
estimates are [ \mathbf{b}=\begin{bmatrix}-58.0 & 4.7 &
0.3\end{bmatrix}, ] and consequently our estimate of the mean
response is (\hat{\mu}) given by
\begin{alignat}{1}
\hat{\mu}(x_{1},x_{2}) = & \ b_{0} + b_{1} x_{1} + b_{2}x_{2},\ \approx & -58.0 + 4.7 x_{1} + 0.3 x_{2}.
\end{alignat}
We could see the entire model matrix (\mathbf{X}) with the
model.matrix
\index{model.matrix@\texttt{model.matrix}}
function, but in the interest of brevity we only show the first few
rows.
head(model.matrix(trees.lm))
The parameter estimates (\mathbf{b}) make it easy to find the fitted values \index{fitted values}, (\hat{\mathbf{Y}}). We write them individually as (\hat{Y}{i}), (i=1,2,\ldots,n), and recall that they are defined by \begin{eqnarray} \hat{Y}{i} & = & \hat{\mu}(x_{1i},x_{2i}),\ & = & b_{0}+b_{1}x_{1i}+b_{2}x_{2i},\quad i=1,2,\ldots,n. \end{eqnarray} They are expressed more compactly by the matrix equation \begin{equation} \hat{\mathbf{Y}}=\mathbf{X}\mathbf{b}. \end{equation} From Equation \eqref{eq-b-formula-matrix} we know that (\mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}), so we can rewrite \begin{eqnarray} \hat{\mathbf{Y}} & = & \mathbf{X}\left[\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}\right],\ & = & \mathbf{H}\mathbf{Y}, \end{eqnarray} where (\mathbf{H}=\mathbf{X}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}) is appropriately named the hat matrix \index{hat matrix} because it "puts the hat on (\mathbf{Y})". The hat matrix is very important in later courses. Some facts about (\mathbf{H}) are
Now let us write a column vector
(\mathbf{x}{0}=(x{10},x_{20})^{\mathrm{T}}) to denote given values
of the explanatory variables Girth ==
(x_{10}) and Height ==
(x_{20}). These values may match those of the collected data, or
they may be completely new values not observed in the original data
set. We may use the parameter estimates to find
(\hat{Y}(\mathbf{x}{0})), which will give us
1. an estimate of (\mu(\mathbf{x}{0})), the mean value of a future
observation at (\mathbf{x}{0}), and
2. a prediction for (Y(\mathbf{x}{0})), the actual value of a
future observation at (\mathbf{x}_{0}).
We can represent (\hat{Y}(\mathbf{x}{0})) by the matrix equation \begin{equation} \label{eq-mlr-single-yhat-matrix} \hat{Y}(\mathbf{x}{0})=\mathbf{x}{0}^{\mathrm{T}}\mathbf{b}, \end{equation} which is just a fancy way to write \begin{equation} \hat{Y}(x{10},x_{20})=b_{0}+b_{1}x_{10}+b_{2}x_{20}. \end{equation}
\bigskip
If we wanted to predict the average volume of black cherry trees that have `Girth = 15` in and are `Height = 77` ft tall then we would use the estimate
\begin{alignat}{1} \hat{\mu}(15,\,77)= & -58+4.7(15)+0.3(77),\ \approx & 35.6\text{ft}^{3}. \end{alignat}
We would use the same estimate (\hat{Y}=35.6) to predict the
measured Volume
of another black cherry tree -- yet to be observed
-- that has Girth = 15
in and is Height = 77
ft tall.
The fitted values are stored inside trees.lm
and may be accessed
with the fitted
function. We only show the first five fitted values.
fitted(trees.lm)[1:5]
The syntax for general prediction does not change much from simple
linear regression. The computations are done with the predict
function as described below.
The only difference from SLR is in the way we tell R the
values of the explanatory variables for which we want predictions. In
SLR we had only one independent variable but in MLR we have many (for
the trees
data we have two). We will store values for the
independent variables in the data frame new
, which has two columns
(one for each independent variable) and three rows (we shall make
predictions at three different locations).
new <- data.frame(Girth = c(9.1, 11.6, 12.5), Height = c(69, 74, 87))
We can view the locations at which we will predict:
new
We continue just like we would have done in SLR.
predict(trees.lm, newdata = new)
treesFIT <- round(predict(trees.lm, newdata = new), 1)
Example. Using the trees
data,
Volume
of a tree of Girth
9.1 in and Height
69 ft. The fitted value for (x_{1}=9.1) and
(x_{2} = 69) is r treesFIT[1]
, so a point estimate
would be r treesFIT[1]
cubic feet.Volume
of a hypothetical tree of Girth
12.5 in and Height
87
ft. The fitted value for (x_{1} = 12.5) and (x_{2} = 87) is
r treesFIT[3]
, so a point prediction for the Volume
is r treesFIT[3]
cubic feet.The residuals are given by \begin{equation} \mathbf{E}=\mathbf{Y}-\hat{\mathbf{Y}}=\mathbf{Y}-\mathbf{H}\mathbf{Y}=(\mathbf{I}-\mathbf{H})\mathbf{Y}. \end{equation} Now we can use Theorem \@ref(thm:mvnorm-dist-matrix-prod) to see that the residuals are distributed \begin{equation} \mathbf{E}\sim\mathsf{mvnorm}(\mathtt{mean}=\mathbf{0},\,\mathtt{sigma}=\sigma^{2}(\mathbf{I}-\mathbf{H})), \end{equation} since ((\mathbf{I}-\mathbf{H})\mathbf{X}\upbeta=\mathbf{X}\upbeta-\mathbf{X}\upbeta=\mathbf{0}) and ((\mathbf{I}-\mathbf{H})\,(\sigma^{2}\mathbf{I})\,(\mathbf{I}-\mathbf{H})^{\mathrm{T}}=\sigma^{2}(\mathbf{I}-\mathbf{H})^{2}=\sigma^{2}(\mathbf{I}-\mathbf{H})). Thesum of squared errors (SSE) is just \begin{equation} SSE=\mathbf{E}^{\mathrm{T}}\mathbf{E}=\mathbf{Y}^{\mathrm{T}}(\mathbf{I}-\mathbf{H})(\mathbf{I}-\mathbf{H})\mathbf{Y}=\mathbf{Y}^{\mathrm{T}}(\mathbf{I}-\mathbf{H})\mathbf{Y}. \end{equation} Recall that in SLR we had two parameters ((\beta_{0}) and (\beta_{1})) in our regression model and we estimated (\sigma^{2}) with (s^{2}=SSE/(n-2)). In MLR, we have (p+1) parameters in our regression model and we might guess that to estimate (\sigma^{2}) we would use the mean square error (S^{2}) defined by \begin{equation} S^{2}=\frac{SSE}{n-(p+1)}. \end{equation} That would be a good guess. The residual standard error is (S=\sqrt{S^{2}}).
The residuals are also stored with trees.lm
and may be accessed with
the residuals
function. We only show the first five residuals.
residuals(trees.lm)[1:5]
The summary
function output (shown later) lists the Residual
Standard Error
which is just (S=\sqrt{S^{2}}). It is stored in the
sigma
component of the summary
object.
treesumry <- summary(trees.lm) treesumry$sigma
For the trees
data we find (s \approx) r round(treesumry$sigma,3)
.
We showed in Section \@ref(sub-mlr-parameter-estimates) that (\mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}\mathbf{Y}), which is really just a big matrix -- namely (\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}) -- multiplied by (\mathbf{Y}). It stands to reason that the sampling distribution of (\mathbf{b}) would be intimately related to the distribution of (\mathbf{Y}), which we assumed to be \begin{equation} \mathbf{Y}\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{X}\upbeta,\,\mathtt{sigma}=\sigma^{2}\mathbf{I}\right). \end{equation} Now recall Theorem \@ref(thm:mvnorm-dist-matrix-prod) that we said we were going to need eventually (the time is now). That proposition guarantees that \begin{equation} \label{eq-distn-b-mlr} \mathbf{b}\sim\mathsf{mvnorm}\left(\mathtt{mean}=\upbeta,\,\mathtt{sigma}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\right), \end{equation} since \begin{equation} \mathbb{E}\mathbf{b}=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}(\mathbf{X}\upbeta)=\upbeta, \end{equation} and \begin{equation} \mbox{Var}(\mathbf{b})=\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{X}^{\mathrm{T}}(\sigma^{2}\mathbf{I})\mathbf{X}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}, \end{equation} the first equality following because the matrix (\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}) is symmetric.
There is a lot that we can glean from Equation \eqref{eq-distn-b-mlr}. First, it follows that the estimator (\mathbf{b}) is unbiased (see Section \@ref(sec-point-estimation-1)). Second, the variances of (b_{0}), (b_{1}), ..., (b_{n}) are exactly the diagonal elements of (\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}), which is completely known except for that pesky parameter (\sigma^{2}). Third, we can estimate the standard error of (b_{i}) (denoted (S_{b_{i}})) with the mean square error (S) (defined in the previous section) multiplied by the corresponding diagonal element of (\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}). Finally, given estimates of the standard errors we may construct confidence intervals for (\beta_{i}) with an interval that looks like \begin{equation} b_{i}\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-p-1)S{b_{i}}. \end{equation} The degrees of freedom for the Student's (t) distribution[^mlr-2] are the same as the denominator of (S^{2}).
[^mlr-2]: We are taking great leaps over the mathematical details. In particular, we have yet to show that (s^{2}) has a chi-square distribution and we have not even come close to showing that (b_{i}) and (s_{b_{i}}) are independent. But these are entirely outside the scope of the present book and the reader may rest assured that the proofs await in later classes. See C.R. Rao for more.
To get confidence intervals for the parameters we need only use
confint
\index{confint@\texttt{confint}}:
confint(trees.lm)
treesPAR <- round(confint(trees.lm), 1)
For example, using the calculations above we say that for the
regression model Volume ~ Girth + Height
we are 95% confident that
the parameter (\beta_{1}) lies somewhere in the interval
r treesPAR[2,1]
to r treesPAR[2,2]
.
We saw in Section \@ref(sub-mlr-point-est-regsurface) how to make point estimates of the mean value of additional observations and predict values of future observations, but how good are our estimates? We need confidence and prediction intervals to gauge their accuracy, and lucky for us the formulas look similar to the ones we saw in SLR.
In Equation \eqref{eq-mlr-single-yhat-matrix} we wrote $\hat{Y}(\mathbf{x}0)=\mathbf{x}_0^{\mathrm{T}}\mathbf{b}$, and in Equation \eqref{eq-distn-b-mlr} we saw that \begin{equation} \mathbf{b}\sim\mathsf{mvnorm}\left(\mathtt{mean} = \upbeta,\,\mathtt{sigma}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\right). \end{equation} The following is therefore immediate from Theorem \@ref(thm:mvnorm-dist-matrix-prod): \begin{equation} \hat{Y}(\mathbf{x}{0})\sim\mathsf{mvnorm}\left(\mathtt{mean}=\mathbf{x}{0}^{\mathrm{T}}\upbeta,\,\mathtt{sigma}=\sigma^{2}\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}{0}\right). \end{equation} It should be no surprise that confidence intervals for the mean value of a future observation at the location (\mathbf{x}{0}=\begin{bmatrix}x_{10} & x_{20} & \ldots & x_{p0}\end{bmatrix}^{\mathrm{T}}) are given by \begin{equation} \hat{Y}(\mathbf{x}{0})\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-p-1)\, S\sqrt{\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}{0}}. \end{equation} Intuitively, (\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}{0}) measures the distance of (\mathbf{x}_{0}) from the center of the data. The degrees of freedom in the Student's (t) critical value are (n-(p+1)) because we need to estimate (p+1) parameters.
Prediction intervals for a new observation at (\mathbf{x}{0}) are given by \begin{equation} \hat{Y}(\mathbf{x}{0})\pm\mathsf{t}{\alpha/2}(\mathtt{df}=n-p-1)\, S\sqrt{1+\mathbf{x}{0}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\mathbf{x}_{0}}. \end{equation} The prediction intervals are wider than the confidence intervals, just as in Section \@ref(sub-slr-interval-est-regline).
The syntax is identical to that used in SLR, with the proviso that we
need to specify values of the independent variables in the data frame
new
as we did in Section \@ref(sub-slr-interval-est-regline) (which we
repeat here for illustration).
new <- data.frame(Girth = c(9.1, 11.6, 12.5), Height = c(69, 74, 87))
Confidence intervals are given by
predict(trees.lm, newdata = new, interval = "confidence")
treesCI <- round(predict(trees.lm, newdata = new, interval = "confidence"), 1)
Prediction intervals are given by
predict(trees.lm, newdata = new, interval = "prediction")
treesPI <- round(predict(trees.lm, newdata = new, interval = "prediction"), 1)
As before, the interval type is decided by the interval
argument and
the default confidence level is 95% (which can be changed with the
level
argument).
Example. Using the trees
data,
Volume
of a tree of
Girth
9.1 in and Height
69 ft. The 95% CI is given by
r treesCI[1,2]
to r treesCI[1,3]
, so with 95%
confidence the mean Volume
lies somewhere between
r treesCI[1, 2]
cubic feet and r treesCI[1,3]
cubic feet.Volume
of a hypothetical
tree of Girth
12.5 in and Height
87 ft. The 95% prediction
interval is given by r treesCI[3,2]
to
r treesCI[3,3]
, so with 95% confidence we may assert that
the hypothetical Volume
of a tree of Girth
12.5 in and Height
87 ft would lie somewhere between r treesCI[3, 2]
cubic feet and r treesCI[3,3]
feet.We saw in Section \@ref(sub-mlr-mse-se) that the error sum of squares (SSE) can be conveniently written in MLR as \begin{equation} \label{eq-mlr-sse-matrix} SSE=\mathbf{Y}^{\mathrm{T}}(\mathbf{I}-\mathbf{H})\mathbf{Y}. \end{equation} It turns out that there are equally convenient formulas for the total sum of squares (SSTO) and the regression sum of squares (SSR). They are: \begin{alignat}{1} \label{eq-mlr-ssto-matrix} SSTO= & \mathbf{Y}^{\mathrm{T}}\left(\mathbf{I}-\frac{1}{n}\mathbf{J}\right)\mathbf{Y} \end{alignat} and \begin{alignat}{1} \label{eq-mlr-ssr-matrix} SSR= & \mathbf{Y}^{\mathrm{T}}\left(\mathbf{H}-\frac{1}{n}\mathbf{J}\right)\mathbf{Y}. \end{alignat} (The matrix (\mathbf{J}) is defined in Appendix \@ref(sec-linear-algebra).) Immediately from Equations \eqref{eq-mlr-sse-matrix}, \eqref{eq-mlr-ssto-matrix}, and \eqref{eq-mlr-ssr-matrix} we get the Anova Equality \begin{equation} SSTO=SSE+SSR. \end{equation} (See Exercise \@ref(xca-anova-equality).) We define the multiple coefficient of determination by the formula \begin{equation} R^{2}=1-\frac{SSE}{SSTO}. \end{equation}
We interpret (R^{2}) as the proportion of total variation that is explained by the multiple regression model. In MLR we must be careful, however, because the value of (R^{2}) can be artificially inflated by the addition of explanatory variables to the model, regardless of whether or not the added variables are useful with respect to prediction of the response variable. In fact, it can be proved that the addition of a single explanatory variable to a regression model will increase the value of (R^{2}), no matter how worthless the explanatory variable is. We could model the height of the ocean tides, then add a variable for the length of cheetah tongues on the Serengeti plain, and our (R^{2}) would inevitably increase.
This is a problem, because as the philosopher, Occam, once said: "causes should not be multiplied beyond necessity". We address the problem by penalizing (R^{2}) when parameters are added to the model. The result is an adjusted (R^{2}) which we denote by (\overline{R}^{2}). \begin{equation} \overline{R}^{2}=\left(R^{2}-\frac{p}{n-1}\right)\left(\frac{n-1}{n-p-1}\right). \end{equation} It is good practice for the statistician to weigh both (R^{2}) and (\overline{R}^{2}) during assessment of model utility. In many cases their values will be very close to each other. If their values differ substantially, or if one changes dramatically when an explanatory variable is added, then (s)he should take a closer look at the explanatory variables in the model.
For the trees
data, we can get (R^{2}) and (\overline{R}^{2})
from the summary
output or access the values directly by name as
shown (recall that we stored the summary
object in treesumry
).
treesumry$r.squared
treesumry$adj.r.squared
High values of (R^{2}) and $\overline{R}^2$ such as these indicate that the model fits very well, which agrees with what we saw in Figure \@ref(fig:3D-scatterplot-trees).
Another way to assess the model's utility is to to test the hypothesis [ H_{0}:\beta_{1}=\beta_{2}=\cdots=\beta_{p}=0\mbox{ versus }H_{1}:\mbox{ at least one $\beta_{i}\neq0$}. ] The idea is that if all (\beta_{i})'s were zero, then the explanatory variables (X_{1},\ldots,X_{p}) would be worthless predictors for the response variable (Y). We can test the above hypothesis with the overall (F) statistic, which in MLR is defined by \begin{equation} F=\frac{SSR/p}{SSE/(n-p-1)}. \end{equation} When the regression assumptions hold and under (H_{0}), it can be shown that (F\sim\mathsf{f}(\mathtt{df1}=p,\,\mathtt{df2}=n-p-1)). We reject (H_{0}) when (F) is large, that is, when the explained variation is large relative to the unexplained variation.
The overall (F) statistic and its associated p-value is listed at
the bottom of the summary
output, or we can access it directly by
name; it is stored in the fstatistic
component of the summary
object.
treesumry$fstatistic
For the trees
data, we see that ( F = ) r treesumry$fstatistic[1]
with a p-value =< 2.2e-16
. Consequently
we reject (H_{0}), that is, the data provide strong evidence that
not all (\beta_{i})'s are zero.
We know that \begin{equation} \mathbf{b}\sim\mathsf{mvnorm}\left(\mathtt{mean}=\upbeta,\,\mathtt{sigma}=\sigma^{2}\left(\mathbf{X}^{\mathrm{T}}\mathbf{X}\right)^{-1}\right) \end{equation} and we have seen how to test the hypothesis (H_{0}:\beta_{1}=\beta_{2}=\cdots=\beta_{p}=0), but let us now consider the test \begin{equation} H_{0}:\beta_{i}=0\mbox{ versus }H_{1}:\beta_{i}\neq0, \end{equation} where (\beta_{i}) is the coefficient for the (i^{\textrm{th}}) independent variable. We test the hypothesis by calculating a statistic, examining it's null distribution, and rejecting (H_{0}) if the p-value is small. If (H_{0}) is rejected, then we conclude that there is a significant relationship between (Y) and (x_{i}) in the regression model (Y\sim(x_{1},\ldots,x_{p})). This last part of the sentence is very important because the significance of the variable (x_{i}) sometimes depends on the presence of other independent variables in the model[^mlr-3].
[^mlr-3]: In other words, a variable might be highly significant one moment but then fail to be significant when another variable is added to the model. When this happens it often indicates a problem with the explanatory variables, such as multicollinearity. See Section \ref{sub-Multicollinearity}.
To test the hypothesis we go to find the sampling distribution of ( b_{i} ), the estimator of the corresponding parameter (\beta_{i}), when the null hypothesis is true. We saw in Section \@ref(sub-mlr-interval-est-params) that \begin{equation} T_{i}=\frac{b_{i}-\beta_{i}}{S_{b_{i}}} \end{equation} has a Student's (t) distribution with (n-(p+1)) degrees of freedom. (Remember, we are estimating (p+1) parameters.) Consequently, under the null hypothesis (H_{0}:\beta_{i}=0) the statistic (t_{i}=b_{i}/S_{b_{i}}) has a (\mathsf{t}(\mathtt{df}=n-p-1)) distribution.
The Student's (t) tests for significance of the individual explanatory variables are shown in the summary
output.
treesumry
We see from the p-values that there is a significant linear
relationship between Volume
and Girth
and between Volume
and
Height
in the regression model Volume ~ Girth + Height
. Further,
it appears that the Intercept
is significant in the aforementioned
model.
In each of the previous sections we assumed that (\mu) was a linear function of the explanatory variables. For example, in SLR we assumed that (\mu(x)=\beta_{0}+\beta_{1}x), and in our previous MLR examples we assumed (\mu(x_{1},x_{2}) = \beta_{0}+\beta_{1}x_{1} + \beta_{2}x_{2}). In every case the scatterplots indicated that our assumption was reasonable. Sometimes, however, plots of the data suggest that the linear model is incomplete and should be modified.
plot(Volume ~ Girth, data = trees)
(ref:cap-scatterplot-volume-girth-trees) \small Scatterplot of Volume
versus Girth
for the trees
data.
For example, let us examine a scatterplot of Volume
versus Girth
a
little more closely. See Figure
\@ref(fig:scatterplot-volume-girth-trees). There might be a slight
curvature to the data; the volume curves ever so slightly upward as
the girth increases. After looking at the plot we might try to capture
the curvature with a mean response such as \begin{equation}
\mu(x_{1})=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{1}^{2}.
\end{equation} The model associated with this choice of (\mu) is
\begin{equation}
Y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{1}^{2}+\epsilon.
\end{equation} The regression assumptions are the same. Almost
everything indeed is the same. In fact, it is still called a "linear
regression model", since the mean response (\mu) is linear in the
parameters (\beta_{0}), (\beta_{1}), and (\beta_{2}).
However, there is one important difference. When we introduce the squared variable in the model we inadvertently also introduce strong dependence between the terms which can cause significant numerical problems when it comes time to calculate the parameter estimates. Therefore, we should usually rescale the independent variable to have mean zero (and even variance one if we wish) before fitting the model. That is, we replace the (x_{i})'s with (x_{i}-\overline{x}) (or ((x_{i}-\overline{x})/s)) before fitting the model[^mlr-4].
[^mlr-4]: Rescaling the data gets the job done but a better way to
avoid the multicollinearity introduced by the higher order terms is
with orthogonal polynomials, whose coefficients are chosen just
right so that the polynomials are not correlated with each other. This
is beginning to linger outside the scope of this book, however, so we
will content ourselves with a brief mention and then stick with the
rescaling approach in the discussion that follows. A nice example of
orthogonal polynomials in action can be run with example(cars)
.
There are multiple ways to fit a quadratic model to the variables
Volume
and Girth
using R.
Girth
and save them in
a vector Girthsq
. Next, fit the linear model Volume ~ Girth +
Girthsq
.I
: Volume ~ Girth + I(Girth^2)
The second method is shorter than the first but the end result is
the same. And once we calculate and store the fitted model (in,
say, treesquad.lm
) all of the previous comments regarding
R apply.Volume ~ poly(Girth, degree = 2)
See ?poly
and ?cars
for more information. Note that we can
recover the approach in 2 with poly(Girth, degree = 2, raw = TRUE)
.\bigskip
We will fit the quadratic model to the `trees` data and display the results with `summary`, being careful to rescale the data before fitting the model. We may rescale the `Girth` variable to have zero mean and unit variance on-the-fly with the `scale` function.
treesquad.lm <- lm(Volume ~ scale(Girth) + I(scale(Girth)^2), data = trees) summary(treesquad.lm)
We see that the (F) statistic indicates the overall model including
Girth
and Girth^2
is significant. Further, there is strong
evidence that both Girth
and Girth^2
are significantly related to
Volume
. We may examine a scatterplot together with the fitted
quadratic function using the lines
function, which adds a line to
the plot tracing the estimated mean response.
plot(Volume ~ scale(Girth), data = trees) lines(fitted(treesquad.lm) ~ scale(Girth), data = trees)
(ref:cap-fitting-the-quadratic) \small A quadratic model for the trees
data.
The plot is shown in Figure \@ref(fig:fitting-the-quadratic). Pay
attention to the scale on the (x)-axis: it is on the scale of the
transformed Girth
data and not on the original scale.
\bigskip
```{block, type="rem"} When a model includes a quadratic term for an independent variable, it is customary to also include the linear term in the model. The principle is called parsimony. More generally, if the researcher decides to include (x^{m}) as a term in the model, then (s)he should also include all lower order terms (x), (x^{2}), ...,(x^{m-1}) in the model.
We do estimation/prediction the same way that we did in Section \@ref(sub-mlr-point-est-regsurface), except we do not need a `Height` column in the dataframe =new= since the variable is not included in the quadratic model. ```r new <- data.frame(Girth = c(9.1, 11.6, 12.5)) predict(treesquad.lm, newdata = new, interval = "prediction")
The predictions and intervals are slightly different from what they
were previously. Notice that it was not necessary to rescale the
Girth
prediction data before input to the predict
function; the
model did the rescaling for us automatically.
\bigskip
```{block, type="rem"} We have mentioned on several occasions that it is important to rescale the explanatory variables for polynomial regression. Watch what happens if we ignore this advice:
```r summary(lm(Volume ~ Girth + I(Girth^2), data = trees))
Now nothing is significant in the model except Girth^2
. We could
delete the Intercept
and Girth
from the model, but the model would
no longer be parsimonious. A novice may see the output and be
confused about how to proceed, while the seasoned statistician
recognizes immediately that Girth
and Girth^2
are highly
correlated (see Section \@ref(sub-multicollinearity)). The only remedy to this
ailment is to rescale Girth
, which we should have done in the first
place.
In Example \@ref(ex:mlr-trees-poly-no-rescale) of Section \@ref(sec-partial-f-statistic).
Note: The trees
data do not have any qualitative explanatory
variables, so we will construct one for illustrative
purposes[^mlr-5]. We will leave the Girth
variable alone,
but we will replace the variable Height
by a new variable Tall
which indicates whether or not the cherry tree is taller than a
certain threshold (which for the sake of argument will be the sample
median height of 76 ft). That is, Tall
will be defined by
\begin{equation} \mathtt{Tall} = \begin{cases} \mathtt{yes}, & \mbox{if }\mathtt{Height} > 76,\ \mathtt{no}, & \mbox{if }\mathtt{Height}\leq 76. \end{cases} \end{equation}
We can construct Tall
very quickly in R with the cut
function:
trees$Tall <- cut(trees$Height, breaks = c(-Inf, 76, Inf), labels = c("no","yes")) trees$Tall[1:5]
Note that Tall
is automatically generated to be a factor with the
labels in the correct order. See ?cut
for more.
[^mlr-5]: This procedure of replacing a continuous variable by a
discrete/qualitative one is called binning, and is almost never
the right thing to do. We are in a bind at this point, however,
because we have invested this chapter in the trees
data and I do not
want to switch mid-discussion. I am currently searching for a data set
with pre-existing qualitative variables that also conveys the same
points present in the trees data, and when I find it I will update
this chapter accordingly.
Once we have Tall
, we include it in the regression model just like
we would any other variable. It is handled internally in a special
way. Define a "dummy variable" Tallyes
that takes values
\begin{equation} \mathtt{Tallyes} = \begin{cases} 1, & \mbox{if }\mathtt{Tall}=\mathtt{yes},\ 0, & \mbox{otherwise.} \end{cases} \end{equation}
That is, Tallyes
is an indicator variable which indicates when a
respective tree is tall. The model may now be written as
\begin{equation}
\mathtt{Volume}=\beta_{0}+\beta_{1}\mathtt{Girth}+\beta_{2}\mathtt{Tallyes}+\epsilon.
\end{equation}
Let us take a look at what this definition does to the mean
response. Trees with Tall = yes
will have the mean response
\begin{equation}
\mu(\mathtt{Girth})=(\beta_{0}+\beta_{2})+\beta_{1}\mathtt{Girth},
\end{equation}
while trees with Tall = no
will have the mean response
\begin{equation}
\mu(\mathtt{Girth})=\beta_{0}+\beta_{1}\mathtt{Girth}.
\end{equation}
In essence, we are fitting two regression lines: one for tall trees,
and one for short trees. The regression lines have the same slope but
they have different (y) intercepts (which are exactly
(|\beta_{2}|) far apart).
The important thing is to double check that the qualitative variable
in question is stored as a factor. The way to check is with the
class
command. For example,
class(trees$Tall)
If the qualitative variable is not yet stored as a factor then we may
convert it to one with the factor
command. See Section
\@ref(sub-qualitative-data). Other than this we perform MLR as we
normally would.
treesdummy.lm <- lm(Volume ~ Girth + Tall, data = trees) summary(treesdummy.lm)
From the output we see that all parameter estimates are statistically
significant and we conclude that the mean response differs for trees
with Tall = yes
and trees with Tall = no
.
\bigskip
``{block2, type="rem"}
We were somewhat disingenuous when we defined the dummy variable
Tallyesbecause, in truth, R defines
Tallyes`
automatically without input from the user[^mlr-6]. Indeed, the
author fit the model beforehand and wrote the discussion afterward
with the knowledge of what R would do so that the output
the reader saw would match what (s)he had previously read. The way
that R handles factors internally is part of a much
larger topic concerning contrasts, which falls outside the scope of
this book. The interested reader should see Neter et al
[@Neter1996] or Fox [@Fox1997] for more.
[^mlr-6]: That is, R by default handles contrasts according to its internal settings which may be customized by the user for fine control. Given that we will not investigate contrasts further in this book it does not serve the discussion to delve into those settings, either. The interested reader should check `?contrasts` for details. \bigskip ```{block, type="rem"} In general, if an explanatory variable `foo` is qualitative with \(n\) levels `bar1`, `bar2`, ..., `barn` then R will by default automatically define \(n-1\) indicator variables in the following way: \begin{eqnarray*} \mathtt{foobar2} & = & \begin{cases} 1, & \mbox{if }\mathtt{foo}=\mathtt{"bar2"},\\ 0, & \mbox{otherwise.}\end{cases},\,\ldots,\,\mathtt{foobarn}=\begin{cases} 1, & \mbox{if }\mathtt{foo}=\mathtt{"barn"},\\ 0, & \mbox{otherwise.}\end{cases} \end{eqnarray*} The level `bar1` is represented by \(\mathtt{foobar2}=\cdots=\mathtt{foobarn}=0\). We just need to make sure that `foo` is stored as a factor and R will take care of the rest.
We can see a plot of the two regression lines with the following mouthful of code.
treesTall <- split(trees, trees$Tall) treesTall[["yes"]]$Fit <- predict(treesdummy.lm, treesTall[["yes"]]) treesTall[["no"]]$Fit <- predict(treesdummy.lm, treesTall[["no"]]) plot(Volume ~ Girth, data = trees) points(Volume ~ Girth, data = treesTall[["yes"]], pch = 1) points(Volume ~ Girth, data = treesTall[["no"]], pch = 2) lines(Fit ~ Girth, data = treesTall[["yes"]]) lines(Fit ~ Girth, data = treesTall[["no"]])
(ref:cap-dummy-variable-trees) \small A dummy variable model for the trees
data.
It may look intimidating but there is reason to the madness. First we
split
the trees
data into two pieces, with groups determined by
the Tall
variable. Next we add the fitted values to each piece via
predict
. Then we set up a plot
for the variables Volume
versus
Girth
, but we do not plot anything yet (type = n
) because we want
to use different symbols for the two groups. Next we add points
to
the plot for the Tall = yes
trees and use an open circle for a plot
character (pch = 1
), followed by points
for the Tall = no
trees
with a triangle character (pch = 2
). Finally, we add regression
lines
to the plot, one for each group.
There are other -- shorter -- ways to plot regression lines by groups,
namely the scatterplot
function in the car
package [@car] and
the xyplot
function in the lattice
package [@lattice]. We
elected to introduce the reader to the above approach since many
advanced plots in R are done in a similar, consecutive
fashion.
We saw in Section \@ref(sub-mlr-overall-f-test) how to test (H_{0}:\beta_{0}=\beta_{1}=\cdots=\beta_{p}=0) with the overall (F) statistic and we saw in Section \@ref(sub-mlr-students-t-tests) how to test (H_{0}:\beta_{i}=0) that a particular coefficient (\beta_{i}) is zero. Sometimes, however, we would like to test whether a certain part of the model is significant. Consider the regression model \begin{equation} Y=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{j}x_{j}+\beta_{j+1}x_{j+1}+\cdots+\beta_{p}x_{p}+\epsilon, \end{equation} where (j\geq1) and (p\geq2). Now we wish to test the hypothesis \begin{equation} H_{0}:\beta_{j+1}=\beta_{j+2}=\cdots=\beta_{p}=0 \end{equation} versus the alternative \begin{equation} H_{1}:\mbox{at least one of $\beta_{j+1},\ \beta_{j+2},\ ,\ldots,\beta_{p}\neq0$}. \end{equation}
The interpretation of (H_{0}) is that none of the variables (x_{j+1}), ...,(x_{p}) is significantly related to (Y) and the interpretation of (H_{1}) is that at least one of (x_{j+1}), ...,(x_{p}) is significantly related to (Y). In essence, for this hypothesis test there are two competing models under consideration: \begin{align} \mbox{the full model:} & \quad y=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}+\epsilon,\ \mbox{the reduced model:} & \quad y=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{j}x_{j}+\epsilon, \end{align}
Of course, the full model will always explain the data better than the reduced model, but does the full model explain the data significantly better than the reduced model? This question is exactly what the partial (F) statistic is designed to answer.
We first calculate (SSE_{f}), the unexplained variation in the full model, and (SSE_{r}), the unexplained variation in the reduced model. We base our test on the difference (SSE_{r}-SSE_{f}) which measures the reduction in unexplained variation attributable to the variables (x_{j+1}), ..., (x_{p}). In the full model there are (p+1) parameters and in the reduced model there are (j+1) parameters, which gives a difference of (p-j) parameters (hence degrees of freedom). The partial F statistic is \begin{equation} F=\frac{(SSE_{r}-SSE_{f})/(p-j)}{SSE_{f}/(n-p-1)}. \end{equation} It can be shown when the regression assumptions hold under (H_{0}) that the partial (F) statistic has an (\mathsf{f}(\mathtt{df1}=p-j,\,\mathtt{df2}=n-p-1)) distribution. We calculate the (p)-value of the observed partial (F) statistic and reject (H_{0}) if the (p)-value is small.
The key ingredient above is that the two competing models are nested
in the sense that the reduced model is entirely contained within the
complete model. The way to test whether the improvement is significant
is to compute lm
objects both for the complete model and the reduced
model then compare the answers with the anova
function.
\bigskip
``{example, name="The Trees data"}
For the
trees` data, let us fit a polynomial regression model and for
the sake of argument we will ignore our own good advice and fail to
rescale the explanatory variables.
```r treesfull.lm <- lm(Volume ~ Girth + I(Girth^2) + Height + I(Height^2), data = trees) summary(treesfull.lm)
In this ill-formed model nothing is significant except Girth
and
Girth^2
. Let us continue down this path and suppose that we would
like to try a reduced model which contains nothing but Girth
and
Girth^2
(not even an Intercept
). Our two models are now
\begin{align}
\mbox{the full model:} & \quad Y=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{1}^{2}+\beta_{3}x_{2}+\beta_{4}x_{2}^{2}+\epsilon,\
\mbox{the reduced model:} & \quad Y=\beta_{1}x_{1}+\beta_{2}x_{1}^{2}+\epsilon,
\end{align}
We fit the reduced model with lm
and store the results:
treesreduced.lm <- lm(Volume ~ -1 + Girth + I(Girth^2), data = trees)
To delete the intercept from the model we used -1
in the model
formula. Next we compare the two models with the anova
function. The
convention is to list the models from smallest to largest.
anova(treesreduced.lm, treesfull.lm)
We see from the output that the complete model is highly significant
compared to the model that does not incorporate Height
or the
Intercept
. We wonder (with our tongue in our cheek) if the
Height^2
term in the full model is causing all of the trouble. We
will fit an alternative reduced model that only deletes Height^2
.
treesreduced2.lm <- lm(Volume ~ Girth + I(Girth^2) + Height, data = trees) anova(treesreduced2.lm, treesfull.lm)
In this case, the improvement to the reduced model that is
attributable to Height^2
is not significant, so we can delete
Height^2
from the model with a clear conscience. We notice that the
p-value for this latest partial (F) test is 0.8865, which seems to
be remarkably close to the p-value we saw for the univariate t
test of Height^2
at the beginning of this example. In fact, the
p-values are exactly the same. Perhaps now we gain some insight
into the true meaning of the univariate tests.
We encountered many, many diagnostic measures for simple linear regression in Sections \@ref(sec-residual-analysis-slr) and \@ref(sec-other-diagnostic-tools-slr). All of these are valid in multiple linear regression, too, but there are some slight changes that we need to make for the multivariate case. We list these below, and apply them to the trees example.
We spent the entire chapter talking about the trees
data, and all of
our models looked like Volume ~ Girth + Height
or a variant of this
model. But let us think again: we know from elementary school that the
volume of a rectangle is (V=lwh) and the volume of a cylinder (which
is closer to what a black cherry tree looks like) is
\begin{equation}
V=\pi r^{2}h\quad \mbox{or}\quad V=4\pi dh,
\end{equation}
where (r) and (d) represent the radius and diameter of the tree,
respectively. With this in mind, it would seem that a more appropriate
model for (\mu) might be
\begin{equation}
\label{eq-trees-nonlin-reg}
\mu(x_{1},x_{2})=\beta_{0}x_{1}^{\beta_{1}}x_{2}^{\beta_{2}},
\end{equation}
where (\beta_{1}) and (\beta_{2}) are parameters to adjust for the
fact that a black cherry tree is not a perfect cylinder.
How can we fit this model? The model is not linear in the parameters
any more, so our linear regression methods will not work... or will
they? In the trees
example we may take the logarithm of both sides
of Equation \eqref{eq-trees-nonlin-reg} to get
\begin{equation}
\mu^{\ast}(x_{1},x_{2})=\ln\left[\mu(x_{1},x_{2})\right]=\ln\beta_{0}+\beta_{1}\ln x_{1}+\beta_{2}\ln x_{2},
\end{equation}
and this new model (\mu^{\ast}) is linear in the parameters
(\beta_{0}^{\ast}=\ln\beta_{0}), (\beta_{1}^{\ast}=\beta_{1}) and
(\beta_{2}^{\ast}=\beta_{2}). We can use what we have learned to fit
a linear model log(Volume) ~ log(Girth) + log(Height)
, and
everything will proceed as before, with one exception: we will need to
be mindful when it comes time to make predictions because the model
will have been fit on the log scale, and we will need to transform our
predictions back to the original scale (by exponentiating with exp
)
to make sense.
treesNonlin.lm <- lm(log(Volume) ~ log(Girth) + log(Height), data = trees) summary(treesNonlin.lm)
This is our best model yet (judging by (R^{2}) and (\overline{R}^{2})), all of the parameters are significant, it is simpler than the quadratic or interaction models, and it even makes theoretical sense. It rarely gets any better than that.
We may get confidence intervals for the parameters, but remember that it is usually better to transform back to the original scale for interpretation purposes:
exp(confint(treesNonlin.lm))
(Note that we did not update the row labels of the matrix to show that we exponentiated and so they are misleading as written.) We do predictions just as before. Remember to transform the response variable back to the original scale after prediction.
new <- data.frame(Girth = c(9.1, 11.6, 12.5), Height = c(69, 74, 87)) exp(predict(treesNonlin.lm, newdata = new, interval = "confidence"))
The predictions and intervals are slightly different from those
calculated earlier, but they are close. Note that we did not need to
transform the Girth
and Height
arguments in the dataframe
new
. All transformations are done for us automatically.
We saw with the trees
data that a nonlinear model might be more
appropriate for the data based on theoretical considerations, and we
were lucky because the functional form of (\mu) allowed us to take
logarithms to transform the nonlinear model to a linear one. The same
trick will not work in other circumstances, however. We need
techniques to fit general models of the form
\begin{equation}
\mathbf{Y}=\mu(\mathbf{X})+\epsilon,
\end{equation}
where (\mu) is some crazy function that does not lend itself to
linear transformations.
There are a host of methods to address problems like these which are studied in advanced regression classes. The interested reader should see Neter et al [@Neter1996] or Tabachnick and Fidell [@Tabachnick2006].
It turns out that John Fox has posted an Appendix to his book [@Fox2002] which discusses some of the methods and issues associated with nonlinear regression; see http://cran.r-project.org/doc/contrib/Fox-Companion/appendix.html for more. Here is an example of how it works, based on a question from R-help.
# fake data set.seed(1) x <- seq(from = 0, to = 1000, length.out = 200) y <- 1 + 2*(sin((2*pi*x/360) - 3))^2 + rnorm(200, sd = 2) # plot(x, y) acc.nls <- nls(y ~ a + b*(sin((2*pi*x/360) - c))^2, start = list(a = 0.9, b = 2.3, c = 2.9)) summary(acc.nls) #plot(x, fitted(acc.nls))
A multiple regression model exhibits multicollinearity when two or more of the explanatory variables are substantially correlated with each other. We can measure multicollinearity by having one of the explanatory play the role of "dependent variable" and regress it on the remaining explanatory variables. The the (R^{2}) of the resulting model is near one, then we say that the model is multicollinear or shows multicollinearity.
Multicollinearity is a problem because it causes instability in the regression model. The instability is a consequence of redundancy in the explanatory variables: a high (R^{2}) indicates a strong dependence between the selected independent variable and the others. The redundant information inflates the variance of the parameter estimates which can cause them to be statistically insignificant when they would have been significant otherwise. To wit, multicollinearity is usually measured by what are called variance inflation factors.
Once multicollinearity has been diagnosed there are several approaches to remediate it. Here are a couple of important ones.
We decided to omit a thorough discussion of multicollinearity because we are not equipped to handle the mathematical details. Perhaps the topic will receive more attention in a later edition.
[ AIC = -2\ln L + 2(p + 1) ]
Exercise. Use Equations \eqref{eq-mlr-sse-matrix}, \eqref{eq-mlr-ssto-matrix}, and \eqref{eq-mlr-ssr-matrix} to prove the Anova Equality: [ SSTO = SSE + SSR.]
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.