knitr::opts_chunk$set( warning = FALSE, message = FALSE ) options(digits=4) par(mar=c(5,4,1,1)+.1)

This vignette illustrates the process of transforming a set of variables to a new set of uncorrelated (orthogonal)
variables. It carries out the Gram-Schmidt process **directly** by successively projecting each successive variable
on the previous ones and subtracting (taking residuals). This is equivalent by replacing each successive variable by
its residuals from a least squares regression on the previous variables.

When this method is used on the predictors in a regression problem,
the resulting orthogonal variables have exactly the same `anova()`

summary (based on "Type I", sequential sums of squares)
as do original variables.

We use the `class`

data set, but convert the character factor `sex`

to a dummy (0/1) variable `male`

.

library(matlib) data(class) class$male <- as.numeric(class$sex=="M")

For later use in regression, we create a variable `IQ`

as a response variable

class <- transform(class, IQ = round(20 + height + 3*age -.1*weight -3*male + 10*rnorm(nrow(class)))) head(class)

Reorder the predictors we want, forming a numeric matrix, `X`

.

X <- as.matrix(class[,c(3,4,2,5)]) head(X)

The Gram-Schmidt process treats the variables in a given order, according to the columns in `X`

.
We start with a new matrix `Z`

consisting of `X[,1]`

.
Then, find a new variable `Z[,2]`

orthogonal to `Z[,1]`

by subtracting the projection of `X[,2]`

on `Z[,1]`

.

Z <- cbind(X[,1], 0, 0, 0) Z[,2] <- X[,2] - Proj(X[,2], Z[,1]) crossprod(Z[,1], Z[,2]) # verify orthogonality

Continue in the same way, subtracting the projections of `X[,3]`

on the previous columns, and so forth

Z[,3] <- X[,3] - Proj(X[,3], Z[,1]) - Proj(X[,3], Z[,2]) Z[,4] <- X[,4] - Proj(X[,4], Z[,1]) - Proj(X[,4], Z[,2]) - Proj(X[,4], Z[,3])

Note that if any column of `X`

is a linear combination of the previous columns, the corresponding
column of `Z`

will be all zeros.

These computations are similar to the following set of linear regressions:

z2 <- residuals(lm(X[,2] ~ X[,1]), type="response") z3 <- residuals(lm(X[,3] ~ X[,1:2]), type="response") z4 <- residuals(lm(X[,4] ~ X[,1:3]), type="response")

The columns of `Z`

are now orthogonal, but not of unit length,

zapsmall(crossprod(Z)) # check orthogonality

We make standardize column to unit length, giving `Z`

as an **orthonormal** matrix, such that $Z' Z = I$.

Z <- Z %*% diag(1 / len(Z)) # make each column unit length zapsmall(crossprod(Z)) # check orthonormal colnames(Z) <- colnames(X)

The QR method uses essentially the same process, factoring the matrix $\mathbf{X}$ as $\mathbf{X = Q R}$,
where $\mathbf{Q}$ is the orthonormal matrix corresponding to `Z`

and $\mathbf{R}$ is upper triangualar.
However, the signs of the columns of $\mathbf{Q}$ are arbitrary, and `QR()`

returns `QR(X)$Q`

with
signs reversed, compared to `Z`

.

# same result as QR(X)$Q, but with signs reversed head(Z, 5) head(-QR(X)$Q, 5) all.equal( unname(Z), -QR(X)$Q )

We carry out two regressions of `IQ`

on the variables in `X`

and in `Z`

. These are equivalent, in the sense that

- The $R^2$ and MSE are the same in both models
- Residuals are the same
- The Type I tests given by
`anova()`

are the same.

class2 <- data.frame(Z, IQ=class$IQ)

Regression of IQ on the original variables in `X`

mod1 <- lm(IQ ~ height + weight + age + male, data=class) anova(mod1)

Regression of IQ on the orthogonalized variables in `Z`

mod2 <- lm(IQ ~ height + weight + age + male, data=class2) anova(mod2)

This illustrates that `anova()`

tests for linear models are *sequential* tests. They test hypotheses about the
extra contribution of each variable over and above all previous ones, in a given order. These usually do not make
substantive sense, except in testing ordered ("hierarchical") models.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.