library(LM2GLMM) options(width = 110) knitr::opts_chunk$set(error = TRUE, cache = TRUE, cache.path = "./cache_knitr/LM_fitting/", fig.path = "./fig_knitr/LM_fitting/")
r .emo("info")
stats::lm()
& spaMM::fitme()
lm()
workslm()
objects arer .emo("alien")
Let's simulate the size of 12 aliens using the function integrated in this LM2GLMM package:
set.seed(123) (Alien <- simulate_Aliens()) ## by default, N = 12, intercept = 50, slope = 1.5 and sigma2 = 25
stats::lm()
& spaMM::fitme()
package::fn
r .emo("info")
In R, when you call a function it looks for it in the global environment (i.e. where your objects are stored) and in all the packages attached to the so-called search list:
head(search(), n = 10) ## shows first 10 elements
The packages there are those that have been attached by the system or by you (e.g. using library()
).
Using the syntax package::fn()
allows for you to access a function from an installed package that has not been attached which can be useful if you don't want to load a package and save one line of code (e.g. as when we do drat::addRepo("courtiol")
).
Importantly, it is also a way to be explicit about where a function is coming from.
So stats::lm()
means that lm()
comes from the build-in package {stats} and spaMM::fitme()
means that fitme()
comes from the optional package {spaMM}.
r .emo("info")
{spaMM} is an R package developed by François Rousset (a population geneticist from Montpellier, France).
We will use it a lot because it is very flexible and powerful:
stats::lm()
r .emo("practice")
fit_alien_lm <- lm(size ~ humans_eaten, data = Alien)
The object created stores a lot of information that you can retrieve directly or (better) indirectly using lm()
companion functions:
names(fit_alien_lm)
Many functions can take such an object as input and thus will use its inner components:
head(methods(class = class(fit_alien_lm)), n = 18) ## shows first 18 elements (items depend on packages attached)
spaMM::fitme()
r .emo("practice")
library(spaMM) fit_alien_spaMM <- fitme(size ~ humans_eaten, data = Alien)
The object created also stores a lot of information:
names(fit_alien_spaMM)
You can retrieve again directly, or (better) indirectly using fitme()
companion extractor functions (more on that later).
r .emo("practice")
stats::lm
()model.matrix(fit_alien_lm)
r .emo("practice")
spaMM::fitme
()model.matrix(fit_alien_spaMM)
r .emo("practice")
stats::lm
()coef(fit_alien_lm) ## you can also do fit_alien_lm$coefficients but using dedicated ## extractor (if available) is a better idea!
spaMM::fitme
()fixef(fit_alien_spaMM)
r .emo("practice")
stats::lm
()summary(fit_alien_lm)$sigma^2 ## we expect something close to 25
spaMM::fitme
()fit_alien_spaMM$phi * nrow(Alien) / (nrow(model.matrix(fit_alien_spaMM)) - ncol(model.matrix(fit_alien_spaMM)))
Notes:
phi
contains a biased estimate of the residual variance; i.e. denominator = $n$, not $n-p$r .emo("practice")
stats::lm
()residuals(fit_alien_lm)
spaMM::fitme
()residuals(fit_alien_spaMM)
r .emo("practice")
We will soon see what the likelihood is but for now consider that it is something quantifying how well the fit approximate the observations.
stats::lm
()logLik(fit_alien_lm)
spaMM::fitme
()logLik(fit_alien_spaMM)
r .emo("practice")
This is another statistics quantifying how well the fit approximate the observations, which we will also study.
stats::lm
()anova(fit_alien_lm)[2, "Sum Sq"] ## this is the same as sum(residuals(fit_alien_lm)^2)
spaMM::fitme
()sum(residuals(fit_alien_spaMM)^2) ## no extractor I can think of ## (even if in the specific case of LM you may use deviance(fit_alien_spaMM))
stats::lm()
? r .emo("info")
In general, by:
We want to find the $\widehat{\beta}$ (the column vector of model parameter estimates) maximizing the likelihood.
The likelihood of the model given the data is equal to the probability density assumed for those data given those parameter values, that is: $\displaystyle {\mathcal {L}}(\theta \mid x) = P(x \mid \theta)$.
In the case of LM (and not GLM, LMM, GLMM), this is equivalent to performing the fit by:
We want to find the $\widehat{\beta}$ minimizing the Residual Sum of Squares (RSS).
The RSS = $\displaystyle\sum_{i=1}^{n}{\varepsilon_i^2}$.
r .emo("info")
r .emo("info")
r .emo("info")
r .emo("nerd")
r .emo("proof")
We compute a function that compute the RSS given the parameter estimates, the formula and the data:
compute_rss <- function(vector.param, formula, data) { useful.data <- model.frame(formula = formula, data = data) ## ditch useless data and flag response X <- model.matrix(object = formula, data = useful.data) ## create the design matrix fitted_values <- X %*% matrix(vector.param) ## compute the fitted values given the coef response <- model.response(useful.data) ## extract the response sum((response - fitted_values)^2) ## compute and output the RSS }
We can test that the function works by comparing the RSS at parameter values estimated with lm()
:
compute_rss(vector.param = coef(fit_alien_lm), formula = size ~ humans_eaten, data = Alien) anova(fit_alien_lm)[2, "Sum Sq"] ## value for reference
r .emo("nerd")
r .emo("proof")
We can now feed this function to an optimisation procedure that will find parameter values minimising the output by trying many combination of parameters:
res_optim_rss <- optim(par = c("intercept" = 0, "slope" = 1), ## the starting values for the parameters fn = compute_rss, ## the function computing the output that must be minimised formula = size ~ humans_eaten, ## any additional arguments for "fn" data = Alien) ## any additional arguments for "fn"
Let's compare parameter and RSS values obtained to those obtained with lm()
:
res_optim_rss$par - coef(fit_alien_lm) ## very close to what is produced by lm() :-) !!! res_optim_rss$value -anova(fit_alien_lm)[2, "Sum Sq"] ## very close to what is produced by lm() :-) !!!
r .emo("info")
Alien$fitted_values <- fitted.values(fit_alien_lm) Alien$residuals <- residuals(fit_alien_lm) Alien$residual_var <- sum(Alien$residuals^2) / nrow(Alien) ## need biased estimate here!! Alien$density <- dnorm(x = Alien$size, mean = Alien$fitted_values, sd = sqrt(Alien$residual_var)) Alien[1, ] ## we look in details at the first row, but you can look at all of them
r .emo("nerd")
r .emo("proof")
compute_logLik <- function(vector.param, formula, data) { residual_var <- vector.param[1] ## var for the normal distrib if (residual_var < 0) return(NA) ## negative variance not accepted useful.data <- model.frame(formula = formula, data = data) ## ditch useless data and flag response X <- model.matrix(object = formula, data = useful.data) ## create the design matrix fitted_values <- X %*% matrix(vector.param[-1]) ## compute the fitted values given the coef response <- model.response(useful.data) ## extract the response sum(dnorm(response, mean = fitted_values, sd = sqrt(residual_var), log = TRUE)) ## compute logLik }
We can test that the function works by computing the logLik at parameter values estimated with lm()
:
compute_logLik(vector.param = c(sum(residuals(fit_alien_lm)^2)/nrow(Alien), coef(fit_alien_lm)), formula = size ~ humans_eaten, data = Alien) logLik(fit_alien_lm)[1] ## value for reference
r .emo("nerd")
r .emo("proof")
We can now feed this function to an optimisation procedure that will find parameter values maximising the output:
res_optim_ML <- optim(par = c("residvar" = 10, "intercept" = 0, "slope" = 1), fn = compute_logLik, formula = size ~ humans_eaten, data = Alien, control = list(fnscale = -1)) ## to indicate to maximise and not minimise
Let's look at the output:
```{css csssmall, echo=FALSE, chache=FALSE} .prettyprint.lang-small { background-color: transparent; line-height: 13px; font-size: 12px; font: }
```r res_optim_ML
r .emo("nerd")
fit_test_lm <- lm(height ~ weight*sex + cigarettes + drink, data = UK) fit_test_homemade_OLS <- optim(par = rep(0, 8), fn = compute_rss, formula = height ~ weight*sex + cigarettes + drink, data = UK, method = "BFGS") ## better optim method (default is getting lost...) fit_test_homemade_ML <- optim(par = c(10, rep(0, 8)), fn = compute_logLik, formula = height ~ weight*sex + cigarettes + drink, data = UK, control = list(fnscale = -1), method = "BFGS")
round(rbind(lm = coef(fit_test_lm), OLS = fit_test_homemade_OLS$par, ML = fit_test_homemade_ML$par[-1]), 3)
r .emo("nerd")
Let's now perform a benchmark to compare the computational efficiency of the 3 fitting methods:
UKsmall <- UK[1:500, ] ## we reduce the sample size for saving time microbenchmark::microbenchmark( lm = lm(height ~ weight*sex + cigarettes + drink, data = UKsmall), OLS = optim(par = rep(0, 8), fn = compute_rss, formula = height ~ weight*sex + cigarettes + drink, data = UKsmall, method = "BFGS"), ML = optim(par = c(10, rep(0, 8)), fn = compute_logLik, formula = height ~ weight*sex + cigarettes + drink, data = UKsmall, control = list(fnscale = -1), method = "BFGS"), times = 10)
So our homemade functions work, but they are hugely inefficient!
r .emo("nerd")
This gives best estimates directly (no need for optimisation / trials and errors):
Y <- matrix(Alien$size) X <- model.matrix(fit_alien_lm) solve(t(X) %*% X) %*% t(X) %*% Y ## Tip: solve(x) returns the inverse of the matrix x
... but lm()
does not even do that because it is still (somewhat) inefficient.
Geek note: $\widehat{Y} = X \widehat{\beta} = X (X^TX)^{-1}X^TY=HY$, and $H$ is called the hat matrix (it is a projection matrix and it can be used for various purposes).
lm()
-like fitting using QR decomposition r .emo("nerd")
The goal is simply to decompose the design matrix into a product of two new matrices that present nice properties for doing maths super efficiently.
qr_list <- qr(X) ## same as fit_alien_lm$qr Q <- qr.Q(qr_list, complete = TRUE) ## orthogonal matrix n * n (transpose = inverse) R <- qr.R(qr_list, complete = TRUE) ## upper triangular matrix n * p
all.equal(Q %*% R, X, check.attributes = FALSE) ## TRUE: Q %*% R is equal to X!!
QTY <- t(Q) %*% Y ## same as fit_alien_lm$effects backsolve(R, QTY) ## solve B from RB = QTY
QTY
is a useful term that is also used by R to compute sum of squares efficiently (e.g. in anova()
).
lm()
fitting procedure r .emo("nerd")
First, lm()
processes information for lm.fit()
; it is quite a difficult function to read but here are the important steps:
lm(size ~ humans_eaten, data = Alien) mf <- model.frame(size ~ humans_eaten, data = Alien) Y <- model.response(mf) X <- model.matrix(~ humans_eaten, data = Alien) lm.fit(X, Y)
Then, lm.fit()
calls a C function (Cdqrls) that calls a Fortran function (dqrls, which will compute estimates after QR) calling another Fortran function (dqrdc2, which performs the QR decomposition).
C Dqrdc2 is a *modification* of Linpack's dqrdc ('DQRDC') for R c c dqrdc2 uses householder transformations to compute the qr c factorization of an n by p matrix x. a limited column c pivoting strategy based on the 2-norms of the reduced columns c moves columns with near-zero norm to the right-hand edge of c the x matrix. this strategy means that sequential one c degree-of-freedom effects can be computed in a natural way. c c i am very nervous about modifying linpack code in this way. c if you are a computational linear algebra guru and you really c understand how to solve this problem please feel free to c suggest improvements to this code.
lm()
r .emo("nerd")
names(fit_alien_lm)
We have already seen all non-trivial components!
We have not yet seen:
rank
: the number of columns that are linearly independent in the design matrix.assign
: it comes from attr(X, "assign")
; it indicates which parameters belong to each covariate.df.residuals
: nrow(Alien) - rank.xlevels
: here, empty.call
: the clean function call to lm()
.terms
: an object of class terms; i.e. the formula with many attributes (?terms.object
).model
: it comes from model.frame()
; it gives the data used for the fit.r .emo("goal")
stats::lm()
& spaMM::fitme()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.