In courtiol/LM2GLMM: Data science for biologists: Generalized linear modelling with R

library(LM2GLMM)
options(width = 110)
knitr::opts_chunk$set(error = TRUE,
                      cache = TRUE,
                      cache.path = "./cache_knitr/LM_fitting/",
                      fig.path = "./fig_knitr/LM_fitting/")

The Linear Model: LM

[Back to main menu](./Title.html#2)

You will learn in this session `r .emo("info")`

how to extract key statistics for models fitted with stats::lm() & spaMM::fitme()
how to compute estimates numerically in a few lines of code
how lm() works
what the components stored within lm() objects are

Data

Our toy dataset `r .emo("alien")`

Let's simulate the size of 12 aliens using the function integrated in this LM2GLMM package:

set.seed(123)
(Alien <- simulate_Aliens()) ## by default, N = 12, intercept = 50, slope = 1.5 and sigma2 = 25

Fitting LM with `stats::lm()` & `spaMM::fitme()`

A note on `package::fn` `r .emo("info")`

In R, when you call a function it looks for it in the global environment (i.e. where your objects are stored) and in all the packages attached to the so-called search list:

head(search(), n = 10) ## shows first 10 elements

The packages there are those that have been attached by the system or by you (e.g. using library()).

Using the syntax package::fn() allows for you to access a function from an installed package that has not been attached which can be useful if you don't want to load a package and save one line of code (e.g. as when we do drat::addRepo("courtiol")).

Importantly, it is also a way to be explicit about where a function is coming from.

So stats::lm() means that lm() comes from the build-in package {stats} and spaMM::fitme() means that fitme() comes from the optional package {spaMM}.

A note on the R package {spaMM} `r .emo("info")`

{spaMM} is an R package developed by François Rousset (a population geneticist from Montpellier, France).

We will use it a lot because it is very flexible and powerful:

it can fit from LM to GLMM (and more, e.g. DHGLM)
it can handle temporal and spatial autocorrelation
it allows for a straightforward computation of intervals, parametric bootstraps, and more

Fitting the Alien data using `stats::lm()` `r .emo("practice")`

fit_alien_lm <- lm(size ~ humans_eaten, data = Alien)

The object created stores a lot of information that you can retrieve directly or (better) indirectly using lm() companion functions:

names(fit_alien_lm)

Many functions can take such an object as input and thus will use its inner components:

head(methods(class = class(fit_alien_lm)), n = 18) ## shows first 18 elements (items depend on packages attached)

Fitting the Alien data using `spaMM::fitme()` `r .emo("practice")`

library(spaMM)
fit_alien_spaMM <- fitme(size ~ humans_eaten, data = Alien)

The object created also stores a lot of information:

names(fit_alien_spaMM)

You can retrieve again directly, or (better) indirectly using fitme() companion extractor functions (more on that later).

Extraction of the design matrix `r .emo("practice")`

From fit with `stats::lm`()

model.matrix(fit_alien_lm)

Extraction of the design matrix `r .emo("practice")`

From fit with `spaMM::fitme`()

model.matrix(fit_alien_spaMM)

Extraction of model parameter estimates `r .emo("practice")`

From fit with `stats::lm`()

coef(fit_alien_lm) ## you can also do fit_alien_lm$coefficients but using dedicated 
                   ## extractor (if available) is a better idea!

From fit with `spaMM::fitme`()

fixef(fit_alien_spaMM)

Extraction of the residual variance `r .emo("practice")`

From fit with `stats::lm`()

summary(fit_alien_lm)$sigma^2 ## we expect something close to 25

From fit with `spaMM::fitme`()

fit_alien_spaMM$phi * nrow(Alien) / (nrow(model.matrix(fit_alien_spaMM)) - ncol(model.matrix(fit_alien_spaMM)))

Notes:

in {spaMM}, for ML fits (the default method) phi contains a biased estimate of the residual variance; i.e. denominator = $n$, not $n-p$
in LM, the unbiased estimate of the residual variance can be obtained as shown here, or it can be directly obtained from a REML fit (more on that later)
switching between ML and REML methods becomes particularly useful for LMM & GLMM

Extraction of the residuals `r .emo("practice")`

From fit with `stats::lm`()

residuals(fit_alien_lm)

From fit with `spaMM::fitme`()

residuals(fit_alien_spaMM)

Extraction of the log-likelihood `r .emo("practice")`

We will soon see what the likelihood is but for now consider that it is something quantifying how well the fit approximate the observations.

From fit with `stats::lm`()

logLik(fit_alien_lm)

From fit with `spaMM::fitme`()

logLik(fit_alien_spaMM)

Extraction of the residual sum of squares `r .emo("practice")`

This is another statistics quantifying how well the fit approximate the observations, which we will also study.

From fit with `stats::lm`()

anova(fit_alien_lm)[2, "Sum Sq"] ## this is the same as sum(residuals(fit_alien_lm)^2)

From fit with `spaMM::fitme`()

sum(residuals(fit_alien_spaMM)^2) ## no extractor I can think of 
                                  ## (even if in the specific case of LM you may use deviance(fit_alien_spaMM))

How fitting works?

How to fit a LM without `stats::lm()`? `r .emo("info")`

In general, by:

Maximum Likelihood (ML)

We want to find the $\widehat{\beta}$ (the column vector of model parameter estimates) maximizing the likelihood.

The likelihood of the model given the data is equal to the probability density assumed for those data given those parameter values, that is: $\displaystyle {\mathcal {L}}(\theta \mid x) = P(x \mid \theta)$.

In the case of LM (and not GLM, LMM, GLMM), this is equivalent to performing the fit by:

Ordinary Least Squares (OLS)

We want to find the $\widehat{\beta}$ minimizing the Residual Sum of Squares (RSS).

The RSS = $\displaystyle\sum_{i=1}^{n}{\varepsilon_i^2}$.

Fitting LM by Ordinary Least Squares (OLS)

The Residual Sum of Squares (RSS) `r .emo("info")`

wzxhzdk:22 wzxhzdk:23

The Residual Sum of Squares (RSS) `r .emo("info")`

wzxhzdk:24 wzxhzdk:25

The Residual Sum of Squares (RSS) `r .emo("info")`

wzxhzdk:26 wzxhzdk:27

Home made LM fitting by OLS `r .emo("nerd")` `r .emo("proof")`

We compute a function that compute the RSS given the parameter estimates, the formula and the data:

compute_rss <- function(vector.param, formula, data) {
  useful.data <- model.frame(formula = formula, data = data) ## ditch useless data and flag response
  X <- model.matrix(object = formula, data = useful.data) ## create the design matrix
  fitted_values <- X %*% matrix(vector.param) ## compute the fitted values given the coef
  response <- model.response(useful.data) ## extract the response
  sum((response - fitted_values)^2) ## compute and output the RSS
}

We can test that the function works by comparing the RSS at parameter values estimated with lm():

compute_rss(vector.param = coef(fit_alien_lm), formula = size ~ humans_eaten, data = Alien)
anova(fit_alien_lm)[2, "Sum Sq"] ## value for reference

Home made LM fitting by OLS `r .emo("nerd")` `r .emo("proof")`

We can now feed this function to an optimisation procedure that will find parameter values minimising the output by trying many combination of parameters:

res_optim_rss <- optim(par = c("intercept" = 0, "slope" = 1), ## the starting values for the parameters
                       fn = compute_rss, ## the function computing the output that must be minimised
                       formula = size ~ humans_eaten, ## any additional arguments for "fn"
                       data = Alien)                  ## any additional arguments for "fn"

Let's compare parameter and RSS values obtained to those obtained with lm():

res_optim_rss$par - coef(fit_alien_lm) ## very close to what is produced by lm() :-) !!!
res_optim_rss$value -anova(fit_alien_lm)[2, "Sum Sq"] ## very close to what is produced by lm() :-) !!!

Fitting LM by Maximum (log-)Likelihood (ML)

The maximum (log-)likelihood (ML) `r .emo("info")`

Alien$fitted_values <- fitted.values(fit_alien_lm)
Alien$residuals <- residuals(fit_alien_lm)
Alien$residual_var <- sum(Alien$residuals^2) / nrow(Alien) ## need biased estimate here!!
Alien$density <- dnorm(x = Alien$size, mean = Alien$fitted_values, sd = sqrt(Alien$residual_var))
Alien[1, ] ## we look in details at the first row, but you can look at all of them

wzxhzdk:33 The log-likelihood is simply the sum of these log-densities: wzxhzdk:34

Home made LM fitting by ML `r .emo("nerd")` `r .emo("proof")`

compute_logLik <- function(vector.param, formula, data) {
  residual_var <- vector.param[1]  ## var for the normal distrib
  if (residual_var < 0) return(NA) ## negative variance not accepted
  useful.data <- model.frame(formula = formula, data = data) ## ditch useless data and flag response
  X <- model.matrix(object = formula, data = useful.data) ## create the design matrix
  fitted_values <- X %*% matrix(vector.param[-1]) ## compute the fitted values given the coef
  response <- model.response(useful.data) ## extract the response
  sum(dnorm(response, mean = fitted_values, sd = sqrt(residual_var), log = TRUE)) ## compute logLik
}

We can test that the function works by computing the logLik at parameter values estimated with lm():

compute_logLik(vector.param = c(sum(residuals(fit_alien_lm)^2)/nrow(Alien), coef(fit_alien_lm)),
               formula = size ~ humans_eaten, data = Alien)
logLik(fit_alien_lm)[1] ## value for reference

Home made LM fitting by ML `r .emo("nerd")` `r .emo("proof")`

We can now feed this function to an optimisation procedure that will find parameter values maximising the output:

res_optim_ML <- optim(par = c("residvar" = 10, "intercept" = 0, "slope" = 1),
                      fn = compute_logLik,
                      formula = size ~ humans_eaten,
                      data = Alien,
                      control = list(fnscale = -1)) ## to indicate to maximise and not minimise

Let's look at the output:

```{css csssmall, echo=FALSE, chache=FALSE} .prettyprint.lang-small { background-color: transparent; line-height: 13px; font-size: 12px; font: }

```r
res_optim_ML

Fitting methods comparison

Comparing the 3 methods on a REAL dataset `r .emo("nerd")`

fit_test_lm <- lm(height ~ weight*sex + cigarettes + drink, data = UK)

fit_test_homemade_OLS <- optim(par = rep(0, 8),
                               fn = compute_rss,
                               formula = height ~ weight*sex + cigarettes + drink,
                               data = UK,
                               method = "BFGS") ## better optim method (default is getting lost...)


fit_test_homemade_ML <- optim(par = c(10, rep(0, 8)),
                              fn = compute_logLik,
                              formula = height ~ weight*sex + cigarettes + drink, data = UK,
                              control = list(fnscale = -1), method = "BFGS")

round(rbind(lm = coef(fit_test_lm), OLS = fit_test_homemade_OLS$par, ML = fit_test_homemade_ML$par[-1]), 3)

Comparing the 3 methods on a REAL dataset `r .emo("nerd")`

Let's now perform a benchmark to compare the computational efficiency of the 3 fitting methods:

UKsmall <- UK[1:500, ] ## we reduce the sample size for saving time
microbenchmark::microbenchmark(
  lm = lm(height ~ weight*sex + cigarettes + drink, data = UKsmall),
  OLS = optim(par = rep(0, 8),
              fn = compute_rss,
              formula = height ~ weight*sex + cigarettes + drink,
              data = UKsmall,
              method = "BFGS"),
  ML = optim(par = c(10, rep(0, 8)),
             fn = compute_logLik,
             formula = height ~ weight*sex + cigarettes + drink, data = UKsmall,
             control = list(fnscale = -1), method = "BFGS"),
  times = 10)

So our homemade functions work, but they are hugely inefficient!

Fitting LM analytically

What the statistical textbooks tell us `r .emo("nerd")`

$\widehat{\beta} = (X^\text{T}X)^{-1}X^\text{T}Y$

This gives best estimates directly (no need for optimisation / trials and errors):

Y <- matrix(Alien$size)
X <- model.matrix(fit_alien_lm)
solve(t(X) %*% X) %*% t(X) %*% Y  ## Tip: solve(x) returns the inverse of the matrix x

... but lm() does not even do that because it is still (somewhat) inefficient.

Geek note: $\widehat{Y} = X \widehat{\beta} = X (X^TX)^{-1}X^TY=HY$, and $H$ is called the hat matrix (it is a projection matrix and it can be used for various purposes).

`lm()`-like fitting using QR decomposition `r .emo("nerd")`

The goal is simply to decompose the design matrix into a product of two new matrices that present nice properties for doing maths super efficiently.

qr_list <- qr(X)  ## same as fit_alien_lm$qr
Q <- qr.Q(qr_list, complete = TRUE)  ## orthogonal matrix n * n (transpose = inverse)
R <- qr.R(qr_list, complete = TRUE)  ## upper triangular matrix n * p

all.equal(Q %*% R, X, check.attributes = FALSE)  ## TRUE: Q %*% R is equal to X!!

QTY <- t(Q) %*% Y ## same as fit_alien_lm$effects
backsolve(R, QTY) ## solve B from RB = QTY

Note 1: QTY is a useful term that is also used by R to compute sum of squares efficiently (e.g. in anova()).
Note 2: other decompositions are sometimes used in other packages or other type of linear models (e.g. Cholesky decomposition).

Actual `lm()` fitting procedure `r .emo("nerd")`

First, lm() processes information for lm.fit(); it is quite a difficult function to read but here are the important steps:

lm(size ~ humans_eaten, data = Alien)
mf <- model.frame(size ~ humans_eaten, data = Alien)
Y  <- model.response(mf)
X  <- model.matrix(~ humans_eaten, data = Alien)
lm.fit(X, Y)

Then, lm.fit() calls a C function (Cdqrls) that calls a Fortran function (dqrls, which will compute estimates after QR) calling another Fortran function (dqrdc2, which performs the QR decomposition).

C     Dqrdc2 is a *modification* of Linpack's dqrdc ('DQRDC') for R
c
c     dqrdc2 uses householder transformations to compute the qr
c     factorization of an n by p matrix x.  a limited column
c     pivoting strategy based on the 2-norms of the reduced columns
c     moves columns with near-zero norm to the right-hand edge of
c     the x matrix.  this strategy means that sequential one
c     degree-of-freedom effects can be computed in a natural way.
c
c     i am very nervous about modifying linpack code in this way.
c     if you are a computational linear algebra guru and you really
c     understand how to solve this problem please feel free to
c     suggest improvements to this code.

Dissecting the output from `lm()` `r .emo("nerd")`

names(fit_alien_lm)

We have already seen all non-trivial components!

We have not yet seen:

rank: the number of columns that are linearly independent in the design matrix.
assign: it comes from attr(X, "assign"); it indicates which parameters belong to each covariate.
df.residuals: nrow(Alien) - rank.
xlevels: here, empty.
call: the clean function call to lm().
terms: an object of class terms; i.e. the formula with many attributes (?terms.object).
model: it comes from model.frame(); it gives the data used for the fit.

What you need to remember `r .emo("goal")`

how to extract key statistics for models fitted with stats::lm() & spaMM::fitme()
how maximum likelihood estimations work

The Linear Model: LM

[Back to main menu](./Title.html#2)

courtiol/LM2GLMM documentation built on July 3, 2022, 7:42 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

courtiol/LM2GLMM Data science for biologists: Generalized linear modelling with R

In courtiol/LM2GLMM: Data science for biologists: Generalized linear modelling with R

The Linear Model: LM

You will learn in this session r .emo("info")

Data

Our toy dataset r .emo("alien")

Fitting LM with stats::lm() & spaMM::fitme()

A note on package::fn r .emo("info")

A note on the R package {spaMM} r .emo("info")

Fitting the Alien data using stats::lm() r .emo("practice")

Fitting the Alien data using spaMM::fitme() r .emo("practice")

Extraction of the design matrix r .emo("practice")

From fit with stats::lm()

Extraction of the design matrix r .emo("practice")

From fit with spaMM::fitme()

Extraction of model parameter estimates r .emo("practice")

From fit with stats::lm()

From fit with spaMM::fitme()

Extraction of the residual variance r .emo("practice")

From fit with stats::lm()

From fit with spaMM::fitme()

Extraction of the residuals r .emo("practice")

From fit with stats::lm()

From fit with spaMM::fitme()

Extraction of the log-likelihood r .emo("practice")

From fit with stats::lm()

From fit with spaMM::fitme()

Extraction of the residual sum of squares r .emo("practice")

From fit with stats::lm()

From fit with spaMM::fitme()

How fitting works?

How to fit a LM without stats::lm()? r .emo("info")

Maximum Likelihood (ML)

Ordinary Least Squares (OLS)

Fitting LM by Ordinary Least Squares (OLS)

The Residual Sum of Squares (RSS) r .emo("info")

The Residual Sum of Squares (RSS) r .emo("info")

The Residual Sum of Squares (RSS) r .emo("info")

Home made LM fitting by OLS r .emo("nerd") r .emo("proof")

Home made LM fitting by OLS r .emo("nerd") r .emo("proof")

Fitting LM by Maximum (log-)Likelihood (ML)

The maximum (log-)likelihood (ML) r .emo("info")

Home made LM fitting by ML r .emo("nerd") r .emo("proof")

Home made LM fitting by ML r .emo("nerd") r .emo("proof")

Fitting methods comparison

Comparing the 3 methods on a REAL dataset r .emo("nerd")

Comparing the 3 methods on a REAL dataset r .emo("nerd")

Fitting LM analytically

What the statistical textbooks tell us r .emo("nerd")

lm()-like fitting using QR decomposition r .emo("nerd")

Actual lm() fitting procedure r .emo("nerd")

Dissecting the output from lm() r .emo("nerd")

What you need to remember r .emo("goal")

Table of contents

The Linear Model: LM

R Package Documentation

Browse R Packages

We want your feedback!

courtiol/LM2GLMM
Data science for biologists: Generalized linear modelling with R

You will learn in this session `r .emo("info")`

Our toy dataset `r .emo("alien")`

Fitting LM with `stats::lm()` & `spaMM::fitme()`

A note on `package::fn` `r .emo("info")`

A note on the R package {spaMM} `r .emo("info")`

Fitting the Alien data using `stats::lm()` `r .emo("practice")`

Fitting the Alien data using `spaMM::fitme()` `r .emo("practice")`

Extraction of the design matrix `r .emo("practice")`

From fit with `stats::lm`()

Extraction of the design matrix `r .emo("practice")`

From fit with `spaMM::fitme`()

Extraction of model parameter estimates `r .emo("practice")`

From fit with `stats::lm`()

From fit with `spaMM::fitme`()

Extraction of the residual variance `r .emo("practice")`

From fit with `stats::lm`()

From fit with `spaMM::fitme`()

Extraction of the residuals `r .emo("practice")`

From fit with `stats::lm`()

From fit with `spaMM::fitme`()

Extraction of the log-likelihood `r .emo("practice")`

From fit with `stats::lm`()

From fit with `spaMM::fitme`()

Extraction of the residual sum of squares `r .emo("practice")`

From fit with `stats::lm`()

From fit with `spaMM::fitme`()

How to fit a LM without `stats::lm()`? `r .emo("info")`

The Residual Sum of Squares (RSS) `r .emo("info")`

The Residual Sum of Squares (RSS) `r .emo("info")`

The Residual Sum of Squares (RSS) `r .emo("info")`

Home made LM fitting by OLS `r .emo("nerd")` `r .emo("proof")`

Home made LM fitting by OLS `r .emo("nerd")` `r .emo("proof")`

The maximum (log-)likelihood (ML) `r .emo("info")`

Home made LM fitting by ML `r .emo("nerd")` `r .emo("proof")`

Home made LM fitting by ML `r .emo("nerd")` `r .emo("proof")`

Comparing the 3 methods on a REAL dataset `r .emo("nerd")`

Comparing the 3 methods on a REAL dataset `r .emo("nerd")`

What the statistical textbooks tell us `r .emo("nerd")`

`lm()`-like fitting using QR decomposition `r .emo("nerd")`

Actual `lm()` fitting procedure `r .emo("nerd")`

Dissecting the output from `lm()` `r .emo("nerd")`

What you need to remember `r .emo("goal")`