In courtiol/LM2GLMM: Data science for biologists: Generalized linear modelling with R

library(LM2GLMM)
options(width = 110)
knitr::opts_chunk$set(error = TRUE)

The Linear Model: LM

[Back to main menu](./Title.html#2)

You should learn during this session `r .emo("info")`

how to write a linear model (using 3 different notations)
how to simulate the data according to simple linear models
some key vocabulary (e.g. response variable, fitted values, residuals...)
how to understand the meaning of parameters by investigating the design matrix
how to manipulate design matrices via formula and contrasts
how to compute predictions

Definition and notations

Disclaimer `r .emo("warn")`

There is (unfortunately) no such thing as a standard vocabulary to discuss LM.

The terms I chose to use are not used by all; even simple and widespread terms such as predictor variables and residuals can mean different things for different people.

Statisticians perhaps do not need words as much as we do because they can rely on their understanding of precise mathematical definitions.

As we are not statisticians, for the sake of clear communication, I will try to stick to quite precise terms which I am going to define in my own way.

Keep this in mind if you are looking at other content online or in books.

The important thing is for you to grasp what these words represent, more than the word themselves.

What is a Linear Model (LM)? `r .emo("info")`

What is a statistical model?

A statistical model represents, often in considerably idealized form, the data-generating process. (wikipedia)

What is a linear model?

A data-generating process produced by a linear function: the data are generated from a set of terms by multiplying each term by a constant (a model parameter) and summing them.

Mathematical notation of LM: simple notation `r .emo("info")`

$y_i = \beta_0 + \beta_1 \times x_{1,i} + \beta_2 \times x_{2,i} + \dots + \beta_{p} \times x_{p,i} + \epsilon_i$

$y_i$ = the response variable / dependent variable / observations to explain
$\beta_j$ = the model parameters / regression coefficients
$x_{j,i}$ = regressors / constants derived from the predictor variables (aka predictors for short)
$\epsilon_i$ = the errors / residual errors

This is the notation that biologists should use in their paper/thesis to describe their model.

Note: the regressors are sometimes directly given by the columns from your dataset (i.e. the predictors), but not always (more on that later).

Mathematical notation of LM: matrix notation `r .emo("info")`

$Y = X \beta + \epsilon$

$$ \begin{bmatrix} y_1 \ y_2 \ y_3 \ \dots \ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,p} \ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,p} \ 1 & x_{3,1} & x_{3,2} & \dots & x_{3,p} \ \dots & \dots & \dots & \dots & \dots \ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,p} \end{bmatrix} \begin{bmatrix} \beta_0 \ \beta_1 \ \beta_2 \ \dots \ \beta_p \end{bmatrix} + \begin{bmatrix} \epsilon_1 \ \epsilon_2 \ \epsilon_3 \ \dots \ \epsilon_n \end{bmatrix} $$

$Y$ = the vector of observations
$X$ = a matrix called the design (or model) matrix
$\beta$ = the vector of model parameters
$\epsilon$ = the vector of errors

This is the notation for statisticians and computer scientists.

Aliens generated by goddess Emerald Lion `r .emo("alien")`

Emerald Lion informed us that she grows aliens of a certain size as follows:

$\texttt{size}i = 50 + 1.5 \times \texttt{humans_eaten}{i} + \epsilon_i$
$\epsilon_i \sim \mathcal{N}(0, \sigma^2 = 5)$

We can thus generate the size of 6 aliens that have eaten between 1 and 3 humans following EL recipe:

set.seed(123L)
Alien <- data.frame(humans_eaten = rep(1:3, each = 2), intercept = 50, slope = 1.5)
Alien$error <- rnorm(n = 6, mean = 0, sd = sqrt(5))
Alien$size <- Alien$intercept + Alien$slope*Alien$humans_eaten + Alien$error

Note: we could also write the last 2 lines as the following line:

# Alien$size <- rnorm(n = 6, mean = 50 + 1.5*Alien$humans_eaten, sd = sqrt(5))

The study of the Aliens `r .emo("practice")`

This is the data we simulated:

Alien

Can you recognise:

the model parameters?
the regressor (here = predictor)?
the errors?
the response variable?

In practice, model parameters are unknown `r .emo("practice")`

So we will have to estimate them (that is the whole point of statistical modelling...)

Fitting the model to the data means estimating the values of the model parameters that maximise the probability of the data (more on that in the next sessions).

(fit_alien_lm <- lm(size ~ humans_eaten, data = Alien))

Note: we did not get $50$ and $1.5$, and only talking to Emerald Lion can get you such population-level values.

R notation for fitting LM fits `r .emo("info")`

lm(var_obs ~ var_a + var_b + ... + var_p, data = somedata)

lm() the function fitting the linear model
somedata the data used for the fitting procedure
var_obs ~ var_a + var_b + ... + var_p the model formula
var_obs = the name (not quoted) of the column with the observations for the response variable
var_a, var_b, ..., var_p = the names (not quoted) of the columns with the observations for the predictor (= independent or explanatory variables)

This is the notation that is good for talking to R and to R users, but it does not describe the assumed data generating process.

Mathematical notation of LM fits `r .emo("info")`

Simple notation

$\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} \times x_{1,i} + \hat{\beta_2} \times x_{2,i} + \dots + \hat{\beta_p} \times x_{p,i}$

$\hat{y_i}$ = the fitted values
$\hat{\beta_j}$ = the (model parameter / regression coefficient) estimates
$x_{j,i}$ = regressors / constants derived from the predictors
$y_i - \hat{y_i} = \varepsilon_i$ = the residuals (i.e. the estimates for the errors / residual errors)

Matrix notation `r .emo("info")`

$\widehat{Y} = X \widehat{\beta}$

$Y = X \widehat{\beta} + \varepsilon = \widehat{Y} + \varepsilon$

$\varepsilon = \widehat{\epsilon} = Y - \widehat{Y}$

Model parameters & model parameters estimates

Model parameters `r .emo("info")`

one dimension: $p \times 1$ vector
fixed non-observable quantities

They mediate the relationship between variables and the data generated: they convert each column from the design matrix into the elements of the response variable.

The meaning of the parameters thus always depends on the design matrix (more on that soon!)

Model parameters estimates `r .emo("practice")`

These are the estimations of the model parameters produced by the fit of the model:

Alien$intercept_est <- coef(fit_alien_lm)[["(Intercept)"]]
Alien$slope_est <- coef(fit_alien_lm)[["humans_eaten"]]
Alien

Response variable, mean response values & fitted values

The response variable `r .emo("info")`

It is the set of observations produced by the data-generating process

In LM: a single continuous variable

Characteristics:

one dimension: $n \times 1$
continuous
no other restriction

The response variable `r .emo("info")`

Alien ## it is the column "size"

### Simple notation * $y_1 = 50 + 1.5 \times `r Alien$humans_eaten[1]` + `r Alien$error[1]`$ * $y_2 = 50 + 1.5 \times `r Alien$humans_eaten[2]` + `r Alien$error[2]`$ * $y_3 = 50 + 1.5 \times `r Alien$humans_eaten[3]` + `r Alien$error[3]`$ * $y_4 = 50 + 1.5 \times `r Alien$humans_eaten[4]` + `r Alien$error[4]`$ * $y_5 = 50 + 1.5 \times `r Alien$humans_eaten[5]` + `r Alien$error[5]`$ * $y_6 = 50 + 1.5 \times `r Alien$humans_eaten[6]` + `r Alien$error[6]`$ ### Matrix notation $$ Y = \begin{bmatrix} 1 & `r Alien$humans_eaten[1]` \\ 1 & `r Alien$humans_eaten[2]` \\ 1 & `r Alien$humans_eaten[3]` \\ 1 & `r Alien$humans_eaten[4]` \\ 1 & `r Alien$humans_eaten[5]` \\ 1 & `r Alien$humans_eaten[6]` \end{bmatrix} \begin{bmatrix} 50 \\ 1.5 \end{bmatrix} + \begin{bmatrix} `r Alien$error[1]`\\ `r Alien$error[2]`\\ `r Alien$error[3]`\\ `r Alien$error[4]`\\ `r Alien$error[5]`\\ `r Alien$error[6]` \end{bmatrix} $$

Mean response values `r .emo("info")`

The mean response values are here the mean sizes of an infinite number of aliens eating the given amounts of humans.

### Simple notation * $\overline{y_1} = 50 + 1.5 \times `r Alien$humans_eaten[1]`$ * $\overline{y_2} = 50 + 1.5 \times `r Alien$humans_eaten[2]`$ * $\overline{y_3} = 50 + 1.5 \times `r Alien$humans_eaten[3]`$ * $\overline{y_4} = 50 + 1.5 \times `r Alien$humans_eaten[4]`$ * $\overline{y_5} = 50 + 1.5 \times `r Alien$humans_eaten[5]`$ * $\overline{y_6} = 50 + 1.5 \times `r Alien$humans_eaten[6]`$ ### Matrix notation $$ \overline{Y} = \begin{bmatrix} 1 & `r Alien$humans_eaten[1]` \\ 1 & `r Alien$humans_eaten[2]` \\ 1 & `r Alien$humans_eaten[3]` \\ 1 & `r Alien$humans_eaten[4]` \\ 1 & `r Alien$humans_eaten[5]` \\ 1 & `r Alien$humans_eaten[6]` \end{bmatrix} \begin{bmatrix} 50 \\ 1.5 \end{bmatrix} $$

Note: it is also the response minus the errors.

Mean response values `r .emo("info")`

They can easily be computed in R using vectors:

Alien$size_mean <- Alien$intercept + Alien$slope * Alien$humans_eaten
# or # Alien$size_mean <- Alien$size - Alien$error
Alien

The fitted values `r .emo("practice")`

These are the estimations of the mean response values produced by the fit of the model:

Alien$size_fitted <- fitted.values(fit_alien_lm)

Alien

The fitted values `r .emo("practice")`

They can easily be computed in R using vectors too:

size_fitted2 <- Alien$intercept_est + Alien$slope_est * Alien$humans_eaten

all.equal(size_fitted2, Alien$size_fitted)

The fitted values `r .emo("practice")`

They can also be computed using matrices (also true for mean response values):

size_fitted3 <- model.matrix(fit_alien_lm) %*% coef(fit_alien_lm)

all.equal(as.numeric(size_fitted3), Alien$size_fitted)

Note : R uses design matrices extensively (internally)!

Visualizing LM concepts `r .emo("info")`

mod_intercept <- coef(fit_alien_lm)[["(Intercept)"]]
mod_slope     <- coef(fit_alien_lm)[["humans_eaten"]]

#Data for mean and fitted lines
line_data <- data.frame(humans_eaten = rep(c(0, 3), times = 2),
                        var = rep(c("Mean response\nvalue", "Fitted value"), each = 2))
line_data$size <- ifelse(line_data$var == "Mean response\nvalue",
                         50 + line_data$humans_eaten*1.5,
                         mod_intercept + line_data$humans_eaten*mod_slope)

#Line and text data
label_data <- data.frame(label = c("Observed value",
                                   "Fitted value",
                                   "Mean response\nvalue"),
                         x = 3.5, xend = 3.05,
                         y = c(59.5, 56.5, 52.5),
                         yend = c(Alien[nrow(Alien), "size"] + 0.2,
                                  Alien[nrow(Alien), "size_fitted"],
                                  Alien[nrow(Alien), "size_mean"]),
                         curvature = c(0.15, -0.15, -0.3))

#Palette
my_palette <- c("#f06543", "#ddd92a", "#7371fc")


library(ggplot2)
ggplot(data = Alien) +
  geom_point(aes(x = humans_eaten, y = size, colour = "Observed value"),
             shape = 4, size = 1, stroke = 2) +
  geom_segment(aes(x = 0, xend = humans_eaten, y = size_mean, yend = size_mean),
               lty = 2) +
  geom_segment(aes(x = humans_eaten, xend = humans_eaten, y = size_mean, yend = 45),
               lty = 2) +
  geom_line(data = line_data, aes(x = humans_eaten, y = size, colour = var),
            size = 1) +
  geom_point(aes(x = humans_eaten, y = size_mean, colour = "Mean response\nvalue"),
             size = 2) +
  geom_point(aes(x = humans_eaten, y = size_fitted, colour = "Fitted value"),
             size = 2) +
  ### ADD CURVED ARROWS THAT POINT TO IMPORTANT VALUES
  geom_curve(data = label_data,
             aes(x = x, xend = xend, y = y, yend = yend, colour = label),
             curvature = 0.15, arrow = arrow(length = unit(0.3, units = "cm"))) +
  ### ADD TEXT TO LABELS
  geom_text(data = label_data,
            aes(x = x, y = y, label = label, colour = label),
            hjust = 0, vjust = 0.75) +
  scale_colour_manual(values = my_palette) +
  scale_x_continuous(limits = c(0, 4.25),
                     breaks = 0:3,
                     labels = c(0, 1, 2, 3),
                     expand = c(0, 0)) +
  scale_y_continuous(limits = c(45, 60),
                     breaks = c(45, mod_intercept, 50, 51.5, 53.0, 54.5, 60),
                     labels = c("45.0",
                                paste0("Intercept estimate --> ", signif(mod_intercept, 3)),
                                "Intercept -->  50.0",
                                "51.5", "53.0", "54.5", "60.0"),
                     expand = c(0, 0)) +
  labs(x = "Humans eaten", y = "Size") +
  theme_classic() +
  theme(legend.position = "none",
        axis.title.y = element_text(vjust = -25),
        plot.background = element_blank(),
        panel.background = element_blank(),
        axis.title = element_text(size = 20, colour = "black"),
        axis.text = element_text(size = 12, colour = "black"),
        axis.ticks = element_line(colour = "black"),
        plot.margin = margin(t = 10, r = 10, b = 10, l = 10))

The design matrix

The design matrix `r .emo("info")`

It is made of the regressors (= adequate representations of the predictors)

You cannot interpret model parameters correctly without understanding the design matrix of a model!

Characteristics:

multi-dimensional: $n \times p$ with $n > p$
deduced from the predictors
known and measured without error (more on that later)
columns must be linearly independent (more on that later)

The design matrix `r .emo("info")`

It is not estimated and is thus the same for both the true data-generating process and the inferred one:

$Y = \color{red}{X} \beta + \epsilon$

$\widehat{Y} = \color{red}{X} \widehat{\beta}$

Design matrix `r .emo("practice")`

The model matrix is automatically produced when fitting a model:

model.matrix(fit_alien_lm)

lm() always estimate one model parameter per column of the design matrix:

names(coef(fit_alien_lm))

Design matrix = formula + the data (+ contrasts) `r .emo("practice")`

To see the design matrix without fitting the model, you can also use formula in model.matrix():

model.matrix(object = ~ humans_eaten, data = Alien)

The term ~ humans_eaten is a formula despite not having a response variable on the left.

The formula tells model.matrix() (direclty or via lm()) how to create a design matrix from the dataset.

Turning a biological problem into a linear model is nothing more that figuring out which design matrix you need and thus which formula to write when calling lm()!

A design matrix without intercept `r .emo("practice")`

model.matrix(object = ~ 0 + humans_eaten, data = Alien)
lm(size ~ 0 + humans_eaten, data = Alien) ## you can also use -1

Can you think of what such fit would look like if plotted?

A design matrix with function calls `r .emo("practice")`

(fit_log_humans_within <- lm(size ~ log(humans_eaten), data = Alien))

You can also apply such transformations in your data frame instead:

Alien_temp <- Alien
Alien_temp$log_humans_eaten <- log(Alien_temp$humans_eaten)
(fit_log_humans_without <- lm(size ~ log_humans_eaten, data = Alien_temp))

A design matrix with function calls `r .emo("info")`

Note: the coefficients did not change because the content of the 2 model matrices is the same:

model.matrix(fit_log_humans_within)

model.matrix(fit_log_humans_without)

A design matrix with polynomials `r .emo("practice")`

(fit_poly_alien <- lm(size ~ poly(humans_eaten, 2), data = Alien))

(fit_poly_raw_alien <- lm(size ~ poly(humans_eaten, 2, raw = TRUE), data = Alien))

A design matrix with polynomials `r .emo("info")`

Note: the coefficients did change because the content of the 2 model matrices is not the same:

model.matrix(fit_poly_alien)

model.matrix(fit_poly_raw_alien)

A design matrix with polynomials `r .emo("info")`

Note: alternative parametrizations do not necessarily impact the fitted values as is the case for the 2 parametrization for polynomials.

data.frame(fitted_poly = fitted.values(fit_poly_alien),
           fitted_poly_raw = fitted.values(fit_poly_raw_alien))

In case you are curious: the default polynomial formulation for the design matrix produces columns in the design matrix which are orthogonal (no correlation), which presents better numerical properties for linear algebra, and which makes tests on them more informative. E.g. poly1 provides information on whether considering a linear relationship is useful or not, irrespective of if there is a quadratic signal or not. Yet, the interpretation of parameter values is easier when polynomials are expressed under their raw form.

(Galactic) Interlude `r .emo("alien")`

Goddess Emerald Lion informed us that Aliens have colonised different exoplanets since their creation and that their age depends on when they arrived on these planets, on their sex (the sex ZZ ages much faster), and on how many times they visited a black hole (due to the time dilatation happening there):

set.seed(123)
Alien2 <- data.frame(planet = factor(c(rep("Chambr", 2), rep("Riplyx", 2), rep("Wickor", 2))),
                     sex = factor(rep(c("Z", "ZZ"), times = 3)),
                     bh_trips = 6:1)
Alien2$age <- 0.5*Alien2$bh_trips
Alien2$age[Alien2$planet == "Chambr"] <- Alien2$age[Alien2$planet == "Chambr"] + 100 
Alien2$age[Alien2$planet == "Riplyx"] <- Alien2$age[Alien2$planet == "Riplyx"] + 1000
Alien2$age[Alien2$planet == "Wickor"] <- Alien2$age[Alien2$planet == "Wickor"] + 10000
Alien2$age[Alien2$sex == "Z"] <- Alien2$age[Alien2$sex == "Z"] + -20
Alien2$age[Alien2$sex == "ZZ"] <- Alien2$age[Alien2$sex == "ZZ"] + 20
Alien2$age <- Alien2$age + rnorm(nrow(Alien2), mean = 0, sd = 4)
Alien2

A design matrix with a factor `r .emo("info")`

model.matrix(object = ~ planet, data = Alien2)

This parametrization of factors such as planet follows the so-called matrix of contrasts called "treatment".

It is the easiest one to interpret and the one used by default in R (but not in all software).

A design matrix with a factor `r .emo("practice")`

(fit_alien_lm2 <- lm(age ~ planet, data = Alien2))

Can you reproduce the fitted values in the following table?

D <- model.matrix(object = ~ planet, data = Alien2) ## same as D <- model.matrix(fit_alien_lm2)
cbind(D, data.frame(fitted = D %*% coef(fit_alien_lm2))) ## same as fitted = fitted.values(fit_alien_lm2)

Interpreting the treatment contrasts `r .emo("info")`

model.matrix(object = ~ planet, data = Alien2)

The intercept represents the mean response value for the baseline (here, reference level = "Chambr").
planetRiplyx represents the mean response value for "Riplyx" minus the mean of the baseline.
planetWickor represents the mean response value for "Wickor" minus the mean of the baseline.

Note: in base R the default level of reference is always the first level of a factor. Which level comes first can be changed, but by default it is the first level when ranked by alphabetical order given the locale (mind that forcats::as_factor() behave differently than as.factor(); the forcats function follows order in which levels appear).

Interpreting the treatment contrasts `r .emo("info")`

Alien2$planet <- factor(Alien2$planet, levels = c("Riplyx", "Chambr", "Wickor")) #Change reference to 'Riplyx'
(fit_alien_lm_revel <- lm(age ~ planet, data = Alien2)) #Different regression coefficients

Regression coefficients are different, but fitted values are unchanged!

D <- model.matrix(object = ~ planet, data = Alien2)
cbind(D, data.frame(fitted = D %*% coef(fit_alien_lm2)))

#Relevel to make sure following code still works
Alien2$planet <- factor(Alien2$planet, levels = c("Chambr", "Riplyx", "Wickor"))

There are alternative parametrizations! `r .emo("warn")`

model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))

Alternative parametrizations of the contrasts can ease convergence in GLM(M) (for continuous variable) and can allow for testing specific hypotheses or fit particular experimental design (for qualitative variables).

It will not change the predictions, the likelihood, the AIC, ...; only the parameter values and their interpretation!

The example of "sum contrasts" `r .emo("nerd")`

model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))

$$ \frac{(\text{Int}+\text{plt1})+(\text{Int}+\text{plt2})+(\text{Int}-\text{plt1}-\text{plt2})}{3}=\text{Int} $$

The intercept thus represents the mean of the mean responses across all groups.

The example of "sum contrasts" `r .emo("nerd")`

model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))

$$ (\text{Int} + \text{plt1}) - \frac{(\text{Int}+\text{plt1})+(\text{Int}+\text{plt2})+(\text{Int}-\text{plt1}-\text{plt2})}{3}=\text{plt1} $$

The intercept thus represents the mean of the mean responses of all groups.
planet1 represents the mean response of the first group minus the intercept.

The example of "sum contrasts" `r .emo("nerd")`

model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))

$$ (\text{Int} + \text{plt2}) - \frac{(\text{Int}+\text{plt1})+(\text{Int}+\text{plt2})+(\text{Int}-\text{plt1}-\text{plt2})}{3}=\text{plt2} $$

The intercept thus represents the mean of the mean responses of all groups.
planet1 represents the mean response of the first group minus the intercept.
planet2 represents the mean response of the second group minus the intercept.

Note: not much used by R users (because not the default) but useful when there is no clear control.

Example of estimates across contrasts `r .emo("nerd")`

lm(age ~ planet, data = Alien2, contrasts = list(planet = "contr.treatment"))
lm(age ~ planet, data = Alien2, contrasts = list(planet = "contr.sum"))

A design matrix with > 1 factors `r .emo("practice")`

model.matrix(object = ~ planet + sex, data = Alien2)

Now the intercept corresponds to observations that have the first level for all factors.

A design matrix with > 1 factors `r .emo("practice")`

lm(age ~ planet + sex, data = Alien2)

A design matrix with interactions `r .emo("practice")`

model.matrix(object = ~ planet + sex + planet:sex, data = Alien2)

Interactions allow for the consideration of the effect of one predictor dependent on the value taken by other(s). If interactions are not considered, effects are assumed to be independent.

A design matrix with interactions `r .emo("practice")`

model.matrix(object = ~ planet*sex, data = Alien2)

The * is a shortcut to + & :

A design matrix with interactions `r .emo("practice")`

Interaction between a quantitative and qualitative predictor:

model.matrix(object = ~ planet * bh_trips, data = Alien2)

A design matrix with interactions `r .emo("practice")`

lm(age ~ planet * bh_trips, data = Alien2)

How do you interpret each parameter estimate?

Errors and residuals

The errors `r .emo("info")`

Characteristics:

one dimension: $n \times 1$ vector
random variable
Gaussian (normally distributed), independent (no autocorrelation) and identically distributed (homoscedastic)

$\epsilon \sim \mathcal{N}(0, \sigma^2 I)$

With the covariance matrix $\sigma^2 I$ being the following $n \times n$ matrix:

$$ \begin{bmatrix} \sigma^2 & 0 & \dots & 0 \ 0 & \sigma^2 & \dots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \dots & \sigma^2 \ \end{bmatrix} $$

The residuals `r .emo("practice")`

These are the estimations of the errors produced during the fit of the model:

Alien$residuals <- residuals(fit_alien_lm)
Alien

cor(Alien$residuals, Alien$error)

mean(Alien$residuals) ## residuals have a mean of 0 in the sample, while error have a mean 0 in the population

The residuals `r .emo("practice")`

mod_intercept <- coef(fit_alien_lm)[["(Intercept)"]]
mod_slope     <- coef(fit_alien_lm)[["humans_eaten"]]

#Data for mean and fitted lines
line_data <- data.frame(humans_eaten = rep(c(0, 3), times = 2),
                        var = rep(c("Mean response\nvalue", "Fitted value"), each = 2))
line_data$size <- ifelse(line_data$var == "Mean response\nvalue",
                         50 + line_data$humans_eaten*1.5,
                         mod_intercept + line_data$humans_eaten*mod_slope)

#Line and text data
label_data <- data.frame(label = c("Observed value",
                                   "Fitted value",
                                   "Mean response\nvalue",
                                   "Error",
                                   "Residual"),
                         x = c(3.5, 3.5, 3.5, 1.7, 2.3),
                         xend = c(3.05, 3.05, 3.05, 1.9, 2.1),
                         y = c(59.5, 56.5, 52.5, Alien[3, "size"] + 1, Alien[3, "size"] + 1),
                         yend = c(Alien[nrow(Alien), "size"] + 0.2,
                                  Alien[nrow(Alien), "size_fitted"],
                                  Alien[nrow(Alien), "size_mean"],
                                  Alien[3, "size"] - 1,
                                  Alien[3, "size"] - 1))

#Palette
my_palette <- c("#3a405a", "#f06543", "#ddd92a", "#7371fc", "#20BF55")

ggplot(data = Alien) +
  ### ADD POINTS SHOWING OBSERVED VALUES
  geom_point(aes(x = humans_eaten, y = size, colour = "Observed value"),
             shape = 4, size = 1, stroke = 2) +
  geom_segment(aes(x = 0, xend = humans_eaten, y = size_mean, yend = size_mean),
               lty = 2) +
  geom_segment(aes(x = humans_eaten, xend = humans_eaten, y = size_mean, yend = 45),
               lty = 2) +
  geom_line(data = line_data, aes(x = humans_eaten, y = size, colour = var),
            size = 1) +
  geom_point(aes(x = humans_eaten, y = size_mean, colour = "Mean response\nvalue"),
             size = 2) +
  geom_point(aes(x = humans_eaten, y = size_fitted, colour = "Fitted value"),
             size = 2) +
  ### ADD SEGMENTS TO SHOW RESIDUALS V. ERRORS
  #### ERROR SEGMENTS
  geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten - 0.05,
                                      y = size - 0.02, yend = size - 0.02, colour = "Error"),
               size = 1) +
  geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten - 0.05,
                                      y = size_mean + 0.02, yend = size_mean + 0.02, colour = "Error"),
               size = 1) +
  geom_segment(data = Alien[3, ], aes(x = humans_eaten - 0.05, xend = humans_eaten - 0.05, y = size, yend = size_mean,
                                      colour = "Error"),
               lty = 1, size = 1) +
  #### RESIDUAL SEGMENTS
    geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten + 0.05,
                                      y = size - 0.02, yend = size - 0.02, colour = "Residual"),
               size = 1) +
  geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten + 0.05,
                                      y = size_fitted + 0.02, yend = size_fitted + 0.02, colour = "Residual"),
               size = 1) +
  geom_segment(data = Alien[3, ], aes(x = humans_eaten + 0.05, xend = humans_eaten + 0.05, y = size, yend = size_fitted, 
                                      colour = "Residual"),
               lty = 1, size = 1) +
  ### ADD CURVED ARROWS THAT POINT TO IMPORTANT VALUES
  geom_curve(data = label_data[1:3, ],
             aes(x = x, xend = xend, y = y, yend = yend, colour = label),
             curvature = 0.15, arrow = arrow(length = unit(0.3, units = "cm"))) +
  geom_curve(data = label_data[4, ],
             aes(x = x, xend = xend, y = y, yend = yend, colour = label),
             curvature = 0.15, arrow = arrow(length = unit(0.3, units = "cm"))) +
  geom_curve(data = label_data[5, ],
             aes(x = x, xend = xend, y = y, yend = yend, colour = label),
             curvature = -0.15, arrow = arrow(length = unit(0.3, units = "cm"))) +
  ### ADD TEXT TO LABELS
  geom_text(data = label_data[1:3, ],
            aes(x = x, y = y, label = label, colour = label),
            hjust = 0, vjust = 0.75) +
  geom_text(data = label_data[4:5, ],
            aes(x = x, y = y + 0.2, label = label, colour = label),
            hjust = 0.5, vjust = 0.5) +
  scale_colour_manual(values = my_palette) +
  scale_x_continuous(limits = c(0, 4.25),
                     breaks = 0:3,
                     labels = c(0, 1, 2, 3),
                     expand = c(0, 0)) +
  scale_y_continuous(limits = c(45, 60),
                     breaks = c(45, mod_intercept, 50, 51.5, 53.0, 54.5, 60),
                     labels = c("45.0",
                                paste0("Intercept estimate --> ", signif(mod_intercept, 3)),
                                "Intercept -->  50.0",
                                "51.5", "53.0", "54.5", "60.0"),
                     expand = c(0, 0)) +
  labs(x = "Humans eaten", y = "Size") +
  theme_classic() +
  theme(legend.position = "none",
        axis.title.y = element_text(vjust = -25),
        plot.background = element_blank(),
        panel.background = element_blank(),
        axis.title = element_text(size = 20, colour = "black"),
        axis.text = element_text(size = 12, colour = "black"),
        axis.ticks = element_line(colour = "black"),
        plot.margin = margin(t = 10, r = 10, b = 10, l = 10))

Variances for the error & residuals `r .emo("info")`

The variance of the error ($\sigma^2$) cannot be directly observed...

... but it can be estimated from the variance of the residual.

Interlude: estimation of variances `r .emo("recap")`

Do you remember that?

Biased estimation of the variance for $x$

$$\frac{\displaystyle\sum^n_{i = 1}(x_i - \bar{x})^2}{n}$$

Unbiased estimation of the variance for $x$

$$\frac{\displaystyle\sum^n_{i = 1}(x_i - \bar{x})^2}{n\color{red}{-1}}$$

Interlude: estimation of variances `r .emo("proof")`

set.seed(1)
bias_vars_list <- replicate(100000, {
  observations <- rnorm(n = 20, mean = 0, sd = sqrt(5))
  numerator <- sum((observations - mean(observations))^2)
  c(bias_var1    = (numerator/20) - 5,
    bias_var2 = (numerator/19) - 5)
}, simplify = FALSE)
bias_vars_df <- data.frame(do.call("rbind", bias_vars_list)) ## to turn the output as a clean dataframe

apply(bias_vars_df, 2, mean)

The bias corresponds to an underestimation stemming from looking at the variation around the mean estimate: the variation around the mean estimate is always smaller that the variation around the true mean in the population. This is because by construction the estimate of the mean is the value generally the closest to all sampled points.

The bias can be removed by considering $n - p$ in the denominator to correct for the bias when you estimate $p$ parameters ($p = 1$, when there is only one mean).

Estimating the variances for the error `r .emo("info")` `r .emo("practice")`

The variance of the error ($\sigma^2$) cannot be directly observed...

... but it can be estimated from the variance of the residuals ($\varepsilon_i$):

$$\hat{\sigma}^2 = \frac{\displaystyle\sum^n_{i = 1}\varepsilon_i^2}{n\color{red}{-p}}$$

sum(residuals(fit_alien_lm)^2) / (nrow(Alien) - ncol(model.matrix(fit_alien_lm)))
summary(fit_alien_lm)$sigma^2 ## same thing, simpler :-)

Note: this term is (somewhat confusingly) referred to as an unbiased estimate of the residual variance.

Predictions

Predicting the mean response using $X$ `r .emo("info")`

We here want to predict the average of large groups of future observations.

Example: prediction for the average size of Aliens having eaten 1, 1.5, 2, 2.5 and 3 humans.

(new_design <- cbind(Intercept = 1, humans_eaten = seq(1, 3, by = 0.5)))
new_design %*% coef(fit_alien_lm)

Same using `predict()` `r .emo("practice")`

The function predict() saves you the trouble of building a design matrix by hand:

data_for_predict <- data.frame(humans_eaten = seq(1, 3, by = 0.5))
predict(fit_alien_lm, newdata = data_for_predict)

Note: the fact that you must feed the function predict() with predictors and not regressors is particularly useful when you have qualitative variable (see exercises).