library(LM2GLMM)
options(width = 110)
knitr::opts_chunk$set(error = TRUE,
                      cache = TRUE,
                      cache.path = "./cache_knitr/LM_fitting/",
                      fig.path = "./fig_knitr/LM_fitting/")

The Linear Model: LM


[Back to main menu](./Title.html#2)

You will learn in this session r .emo("info")

Data

Our toy dataset r .emo("alien")

Let's simulate the size of 12 aliens using the function integrated in this LM2GLMM package:

set.seed(123)
(Alien <- simulate_Aliens()) ## by default, N = 12, intercept = 50, slope = 1.5 and sigma2 = 25

Fitting LM with stats::lm() & spaMM::fitme()

A note on package::fn r .emo("info")

In R, when you call a function it looks for it in the global environment (i.e. where your objects are stored) and in all the packages attached to the so-called search list:

head(search(), n = 10) ## shows first 10 elements

The packages there are those that have been attached by the system or by you (e.g. using library()).

Using the syntax package::fn() allows for you to access a function from an installed package that has not been attached which can be useful if you don't want to load a package and save one line of code (e.g. as when we do drat::addRepo("courtiol")).

Importantly, it is also a way to be explicit about where a function is coming from.

So stats::lm() means that lm() comes from the build-in package {stats} and spaMM::fitme() means that fitme() comes from the optional package {spaMM}.

A note on the R package {spaMM} r .emo("info")

{spaMM} is an R package developed by François Rousset (a population geneticist from Montpellier, France).


We will use it a lot because it is very flexible and powerful:

Fitting the Alien data using stats::lm() r .emo("practice")

fit_alien_lm <- lm(size ~ humans_eaten, data = Alien)


The object created stores a lot of information that you can retrieve directly or (better) indirectly using lm() companion functions:

names(fit_alien_lm)

Many functions can take such an object as input and thus will use its inner components:

head(methods(class = class(fit_alien_lm)), n = 18) ## shows first 18 elements (items depend on packages attached)

Fitting the Alien data using spaMM::fitme() r .emo("practice")

library(spaMM)
fit_alien_spaMM <- fitme(size ~ humans_eaten, data = Alien)


The object created also stores a lot of information:

names(fit_alien_spaMM)

You can retrieve again directly, or (better) indirectly using fitme() companion extractor functions (more on that later).

Extraction of the design matrix r .emo("practice")

From fit with stats::lm()

model.matrix(fit_alien_lm)

Extraction of the design matrix r .emo("practice")

From fit with spaMM::fitme()

model.matrix(fit_alien_spaMM)

Extraction of model parameter estimates r .emo("practice")

From fit with stats::lm()

coef(fit_alien_lm) ## you can also do fit_alien_lm$coefficients but using dedicated 
                   ## extractor (if available) is a better idea!

From fit with spaMM::fitme()

fixef(fit_alien_spaMM)

Extraction of the residual variance r .emo("practice")

From fit with stats::lm()

summary(fit_alien_lm)$sigma^2 ## we expect something close to 25

From fit with spaMM::fitme()

fit_alien_spaMM$phi * nrow(Alien) / (nrow(model.matrix(fit_alien_spaMM)) - ncol(model.matrix(fit_alien_spaMM)))

Notes:

Extraction of the residuals r .emo("practice")

From fit with stats::lm()

residuals(fit_alien_lm)

From fit with spaMM::fitme()

residuals(fit_alien_spaMM)

Extraction of the log-likelihood r .emo("practice")

We will soon see what the likelihood is but for now consider that it is something quantifying how well the fit approximate the observations.

From fit with stats::lm()

logLik(fit_alien_lm)

From fit with spaMM::fitme()

logLik(fit_alien_spaMM)

Extraction of the residual sum of squares r .emo("practice")

This is another statistics quantifying how well the fit approximate the observations, which we will also study.

From fit with stats::lm()

anova(fit_alien_lm)[2, "Sum Sq"] ## this is the same as sum(residuals(fit_alien_lm)^2)

From fit with spaMM::fitme()

sum(residuals(fit_alien_spaMM)^2) ## no extractor I can think of 
                                  ## (even if in the specific case of LM you may use deviance(fit_alien_spaMM))

How fitting works?

How to fit a LM without stats::lm()? r .emo("info")

In general, by:

Maximum Likelihood (ML)

We want to find the $\widehat{\beta}$ (the column vector of model parameter estimates) maximizing the likelihood.

The likelihood of the model given the data is equal to the probability density assumed for those data given those parameter values, that is: $\displaystyle {\mathcal {L}}(\theta \mid x) = P(x \mid \theta)$.


In the case of LM (and not GLM, LMM, GLMM), this is equivalent to performing the fit by:

Ordinary Least Squares (OLS)

We want to find the $\widehat{\beta}$ minimizing the Residual Sum of Squares (RSS).

The RSS = $\displaystyle\sum_{i=1}^{n}{\varepsilon_i^2}$.

Fitting LM by Ordinary Least Squares (OLS)

The Residual Sum of Squares (RSS) r .emo("info")

wzxhzdk:22 wzxhzdk:23

The Residual Sum of Squares (RSS) r .emo("info")

wzxhzdk:24 wzxhzdk:25

The Residual Sum of Squares (RSS) r .emo("info")

wzxhzdk:26 wzxhzdk:27

Home made LM fitting by OLS r .emo("nerd") r .emo("proof")

We compute a function that compute the RSS given the parameter estimates, the formula and the data:

compute_rss <- function(vector.param, formula, data) {
  useful.data <- model.frame(formula = formula, data = data) ## ditch useless data and flag response
  X <- model.matrix(object = formula, data = useful.data) ## create the design matrix
  fitted_values <- X %*% matrix(vector.param) ## compute the fitted values given the coef
  response <- model.response(useful.data) ## extract the response
  sum((response - fitted_values)^2) ## compute and output the RSS
}

We can test that the function works by comparing the RSS at parameter values estimated with lm():

compute_rss(vector.param = coef(fit_alien_lm), formula = size ~ humans_eaten, data = Alien)
anova(fit_alien_lm)[2, "Sum Sq"] ## value for reference

Home made LM fitting by OLS r .emo("nerd") r .emo("proof")

We can now feed this function to an optimisation procedure that will find parameter values minimising the output by trying many combination of parameters:

res_optim_rss <- optim(par = c("intercept" = 0, "slope" = 1), ## the starting values for the parameters
                       fn = compute_rss, ## the function computing the output that must be minimised
                       formula = size ~ humans_eaten, ## any additional arguments for "fn"
                       data = Alien)                  ## any additional arguments for "fn"

Let's compare parameter and RSS values obtained to those obtained with lm():

res_optim_rss$par - coef(fit_alien_lm) ## very close to what is produced by lm() :-) !!!
res_optim_rss$value -anova(fit_alien_lm)[2, "Sum Sq"] ## very close to what is produced by lm() :-) !!!

Fitting LM by Maximum (log-)Likelihood (ML)

The maximum (log-)likelihood (ML) r .emo("info")

Alien$fitted_values <- fitted.values(fit_alien_lm)
Alien$residuals <- residuals(fit_alien_lm)
Alien$residual_var <- sum(Alien$residuals^2) / nrow(Alien) ## need biased estimate here!!
Alien$density <- dnorm(x = Alien$size, mean = Alien$fitted_values, sd = sqrt(Alien$residual_var))
Alien[1, ] ## we look in details at the first row, but you can look at all of them 
wzxhzdk:33 The log-likelihood is simply the sum of these log-densities: wzxhzdk:34

Home made LM fitting by ML r .emo("nerd") r .emo("proof")

compute_logLik <- function(vector.param, formula, data) {
  residual_var <- vector.param[1]  ## var for the normal distrib
  if (residual_var < 0) return(NA) ## negative variance not accepted
  useful.data <- model.frame(formula = formula, data = data) ## ditch useless data and flag response
  X <- model.matrix(object = formula, data = useful.data) ## create the design matrix
  fitted_values <- X %*% matrix(vector.param[-1]) ## compute the fitted values given the coef
  response <- model.response(useful.data) ## extract the response
  sum(dnorm(response, mean = fitted_values, sd = sqrt(residual_var), log = TRUE)) ## compute logLik
}

We can test that the function works by computing the logLik at parameter values estimated with lm():

compute_logLik(vector.param = c(sum(residuals(fit_alien_lm)^2)/nrow(Alien), coef(fit_alien_lm)),
               formula = size ~ humans_eaten, data = Alien)
logLik(fit_alien_lm)[1] ## value for reference

Home made LM fitting by ML r .emo("nerd") r .emo("proof")

We can now feed this function to an optimisation procedure that will find parameter values maximising the output:

res_optim_ML <- optim(par = c("residvar" = 10, "intercept" = 0, "slope" = 1),
                      fn = compute_logLik,
                      formula = size ~ humans_eaten,
                      data = Alien,
                      control = list(fnscale = -1)) ## to indicate to maximise and not minimise

Let's look at the output:

```{css csssmall, echo=FALSE, chache=FALSE} .prettyprint.lang-small { background-color: transparent; line-height: 13px; font-size: 12px; font: }

```r
res_optim_ML

Fitting methods comparison

Comparing the 3 methods on a REAL dataset r .emo("nerd")

fit_test_lm <- lm(height ~ weight*sex + cigarettes + drink, data = UK)

fit_test_homemade_OLS <- optim(par = rep(0, 8),
                               fn = compute_rss,
                               formula = height ~ weight*sex + cigarettes + drink,
                               data = UK,
                               method = "BFGS") ## better optim method (default is getting lost...)


fit_test_homemade_ML <- optim(par = c(10, rep(0, 8)),
                              fn = compute_logLik,
                              formula = height ~ weight*sex + cigarettes + drink, data = UK,
                              control = list(fnscale = -1), method = "BFGS")
round(rbind(lm = coef(fit_test_lm), OLS = fit_test_homemade_OLS$par, ML = fit_test_homemade_ML$par[-1]), 3)

Comparing the 3 methods on a REAL dataset r .emo("nerd")

Let's now perform a benchmark to compare the computational efficiency of the 3 fitting methods:

UKsmall <- UK[1:500, ] ## we reduce the sample size for saving time
microbenchmark::microbenchmark(
  lm = lm(height ~ weight*sex + cigarettes + drink, data = UKsmall),
  OLS = optim(par = rep(0, 8),
              fn = compute_rss,
              formula = height ~ weight*sex + cigarettes + drink,
              data = UKsmall,
              method = "BFGS"),
  ML = optim(par = c(10, rep(0, 8)),
             fn = compute_logLik,
             formula = height ~ weight*sex + cigarettes + drink, data = UKsmall,
             control = list(fnscale = -1), method = "BFGS"),
  times = 10)

So our homemade functions work, but they are hugely inefficient!

Fitting LM analytically

What the statistical textbooks tell us r .emo("nerd")

$\widehat{\beta} = (X^\text{T}X)^{-1}X^\text{T}Y$

This gives best estimates directly (no need for optimisation / trials and errors):

Y <- matrix(Alien$size)
X <- model.matrix(fit_alien_lm)
solve(t(X) %*% X) %*% t(X) %*% Y  ## Tip: solve(x) returns the inverse of the matrix x

... but lm() does not even do that because it is still (somewhat) inefficient.


Geek note: $\widehat{Y} = X \widehat{\beta} = X (X^TX)^{-1}X^TY=HY$, and $H$ is called the hat matrix (it is a projection matrix and it can be used for various purposes).

lm()-like fitting using QR decomposition r .emo("nerd")

The goal is simply to decompose the design matrix into a product of two new matrices that present nice properties for doing maths super efficiently.

qr_list <- qr(X)  ## same as fit_alien_lm$qr
Q <- qr.Q(qr_list, complete = TRUE)  ## orthogonal matrix n * n (transpose = inverse)
R <- qr.R(qr_list, complete = TRUE)  ## upper triangular matrix n * p
all.equal(Q %*% R, X, check.attributes = FALSE)  ## TRUE: Q %*% R is equal to X!!
QTY <- t(Q) %*% Y ## same as fit_alien_lm$effects
backsolve(R, QTY) ## solve B from RB = QTY

Actual lm() fitting procedure r .emo("nerd")

First, lm() processes information for lm.fit(); it is quite a difficult function to read but here are the important steps:

lm(size ~ humans_eaten, data = Alien)
mf <- model.frame(size ~ humans_eaten, data = Alien)
Y  <- model.response(mf)
X  <- model.matrix(~ humans_eaten, data = Alien)
lm.fit(X, Y)

Then, lm.fit() calls a C function (Cdqrls) that calls a Fortran function (dqrls, which will compute estimates after QR) calling another Fortran function (dqrdc2, which performs the QR decomposition).

C     Dqrdc2 is a *modification* of Linpack's dqrdc ('DQRDC') for R
c
c     dqrdc2 uses householder transformations to compute the qr
c     factorization of an n by p matrix x.  a limited column
c     pivoting strategy based on the 2-norms of the reduced columns
c     moves columns with near-zero norm to the right-hand edge of
c     the x matrix.  this strategy means that sequential one
c     degree-of-freedom effects can be computed in a natural way.
c
c     i am very nervous about modifying linpack code in this way.
c     if you are a computational linear algebra guru and you really
c     understand how to solve this problem please feel free to
c     suggest improvements to this code.

Dissecting the output from lm() r .emo("nerd")

names(fit_alien_lm)

We have already seen all non-trivial components!

We have not yet seen:

What you need to remember r .emo("goal")

Table of contents

The Linear Model: LM


[Back to main menu](./Title.html#2)


courtiol/LM2GLMM documentation built on July 3, 2022, 7:42 a.m.