library(LM2GLMM) options(width = 110) knitr::opts_chunk$set(error = TRUE)
r .emo("info")
r .emo("warn")
There is (unfortunately) no such thing as a standard vocabulary to discuss LM.
The terms I chose to use are not used by all; even simple and widespread terms such as predictor variables and residuals can mean different things for different people.
Statisticians perhaps do not need words as much as we do because they can rely on their understanding of precise mathematical definitions.
As we are not statisticians, for the sake of clear communication, I will try to stick to quite precise terms which I am going to define in my own way.
Keep this in mind if you are looking at other content online or in books.
The important thing is for you to grasp what these words represent, more than the word themselves.
r .emo("info")
A statistical model represents, often in considerably idealized form, the data-generating process. (wikipedia)
A data-generating process produced by a linear function: the data are generated from a set of terms by multiplying each term by a constant (a model parameter) and summing them.
r .emo("info")
This is the notation that biologists should use in their paper/thesis to describe their model.
Note: the regressors are sometimes directly given by the columns from your dataset (i.e. the predictors), but not always (more on that later).
r .emo("info")
$$ \begin{bmatrix} y_1 \ y_2 \ y_3 \ \dots \ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \dots & x_{1,p} \ 1 & x_{2,1} & x_{2,2} & \dots & x_{2,p} \ 1 & x_{3,1} & x_{3,2} & \dots & x_{3,p} \ \dots & \dots & \dots & \dots & \dots \ 1 & x_{n,1} & x_{n,2} & \dots & x_{n,p} \end{bmatrix} \begin{bmatrix} \beta_0 \ \beta_1 \ \beta_2 \ \dots \ \beta_p \end{bmatrix} + \begin{bmatrix} \epsilon_1 \ \epsilon_2 \ \epsilon_3 \ \dots \ \epsilon_n \end{bmatrix} $$
This is the notation for statisticians and computer scientists.
r .emo("alien")
Emerald Lion informed us that she grows aliens of a certain size as follows:
We can thus generate the size of 6 aliens that have eaten between 1 and 3 humans following EL recipe:
set.seed(123L) Alien <- data.frame(humans_eaten = rep(1:3, each = 2), intercept = 50, slope = 1.5) Alien$error <- rnorm(n = 6, mean = 0, sd = sqrt(5)) Alien$size <- Alien$intercept + Alien$slope*Alien$humans_eaten + Alien$error
Note: we could also write the last 2 lines as the following line:
# Alien$size <- rnorm(n = 6, mean = 50 + 1.5*Alien$humans_eaten, sd = sqrt(5))
r .emo("practice")
This is the data we simulated:
Alien
Can you recognise:
r .emo("practice")
So we will have to estimate them (that is the whole point of statistical modelling...)
Fitting the model to the data means estimating the values of the model parameters that maximise the probability of the data (more on that in the next sessions).
(fit_alien_lm <- lm(size ~ humans_eaten, data = Alien))
Note: we did not get $50$ and $1.5$, and only talking to Emerald Lion can get you such population-level values.
r .emo("info")
lm(var_obs ~ var_a + var_b + ... + var_p, data = somedata)
lm()
the function fitting the linear modelsomedata
the data used for the fitting procedurevar_obs ~ var_a + var_b + ... + var_p
the model formulavar_obs
= the name (not quoted) of the column with the observations for the response variablevar_a, var_b, ..., var_p
= the names (not quoted) of the columns with the observations for the predictor (= independent or explanatory variables)This is the notation that is good for talking to R and to R users, but it does not describe the assumed data generating process.
r .emo("info")
r .emo("info")
r .emo("info")
The meaning of the parameters thus always depends on the design matrix (more on that soon!)
r .emo("practice")
These are the estimations of the model parameters produced by the fit of the model:
Alien$intercept_est <- coef(fit_alien_lm)[["(Intercept)"]] Alien$slope_est <- coef(fit_alien_lm)[["humans_eaten"]] Alien
r .emo("info")
Characteristics:
r .emo("info")
Alien ## it is the column "size"
r .emo("info")
The mean response values are here the mean sizes of an infinite number of aliens eating the given amounts of humans.
Note: it is also the response minus the errors.
r .emo("info")
They can easily be computed in R using vectors:
Alien$size_mean <- Alien$intercept + Alien$slope * Alien$humans_eaten # or # Alien$size_mean <- Alien$size - Alien$error Alien
r .emo("practice")
These are the estimations of the mean response values produced by the fit of the model:
Alien$size_fitted <- fitted.values(fit_alien_lm)
Alien
r .emo("practice")
They can easily be computed in R using vectors too:
size_fitted2 <- Alien$intercept_est + Alien$slope_est * Alien$humans_eaten
all.equal(size_fitted2, Alien$size_fitted)
r .emo("practice")
They can also be computed using matrices (also true for mean response values):
size_fitted3 <- model.matrix(fit_alien_lm) %*% coef(fit_alien_lm)
all.equal(as.numeric(size_fitted3), Alien$size_fitted)
Note : R uses design matrices extensively (internally)!
r .emo("info")
mod_intercept <- coef(fit_alien_lm)[["(Intercept)"]] mod_slope <- coef(fit_alien_lm)[["humans_eaten"]] #Data for mean and fitted lines line_data <- data.frame(humans_eaten = rep(c(0, 3), times = 2), var = rep(c("Mean response\nvalue", "Fitted value"), each = 2)) line_data$size <- ifelse(line_data$var == "Mean response\nvalue", 50 + line_data$humans_eaten*1.5, mod_intercept + line_data$humans_eaten*mod_slope) #Line and text data label_data <- data.frame(label = c("Observed value", "Fitted value", "Mean response\nvalue"), x = 3.5, xend = 3.05, y = c(59.5, 56.5, 52.5), yend = c(Alien[nrow(Alien), "size"] + 0.2, Alien[nrow(Alien), "size_fitted"], Alien[nrow(Alien), "size_mean"]), curvature = c(0.15, -0.15, -0.3)) #Palette my_palette <- c("#f06543", "#ddd92a", "#7371fc") library(ggplot2) ggplot(data = Alien) + geom_point(aes(x = humans_eaten, y = size, colour = "Observed value"), shape = 4, size = 1, stroke = 2) + geom_segment(aes(x = 0, xend = humans_eaten, y = size_mean, yend = size_mean), lty = 2) + geom_segment(aes(x = humans_eaten, xend = humans_eaten, y = size_mean, yend = 45), lty = 2) + geom_line(data = line_data, aes(x = humans_eaten, y = size, colour = var), size = 1) + geom_point(aes(x = humans_eaten, y = size_mean, colour = "Mean response\nvalue"), size = 2) + geom_point(aes(x = humans_eaten, y = size_fitted, colour = "Fitted value"), size = 2) + ### ADD CURVED ARROWS THAT POINT TO IMPORTANT VALUES geom_curve(data = label_data, aes(x = x, xend = xend, y = y, yend = yend, colour = label), curvature = 0.15, arrow = arrow(length = unit(0.3, units = "cm"))) + ### ADD TEXT TO LABELS geom_text(data = label_data, aes(x = x, y = y, label = label, colour = label), hjust = 0, vjust = 0.75) + scale_colour_manual(values = my_palette) + scale_x_continuous(limits = c(0, 4.25), breaks = 0:3, labels = c(0, 1, 2, 3), expand = c(0, 0)) + scale_y_continuous(limits = c(45, 60), breaks = c(45, mod_intercept, 50, 51.5, 53.0, 54.5, 60), labels = c("45.0", paste0("Intercept estimate --> ", signif(mod_intercept, 3)), "Intercept --> 50.0", "51.5", "53.0", "54.5", "60.0"), expand = c(0, 0)) + labs(x = "Humans eaten", y = "Size") + theme_classic() + theme(legend.position = "none", axis.title.y = element_text(vjust = -25), plot.background = element_blank(), panel.background = element_blank(), axis.title = element_text(size = 20, colour = "black"), axis.text = element_text(size = 12, colour = "black"), axis.ticks = element_line(colour = "black"), plot.margin = margin(t = 10, r = 10, b = 10, l = 10))
r .emo("info")
Characteristics:
r .emo("info")
It is not estimated and is thus the same for both the true data-generating process and the inferred one:
r .emo("practice")
The model matrix is automatically produced when fitting a model:
model.matrix(fit_alien_lm)
lm()
always estimate one model parameter per column of the design matrix:
names(coef(fit_alien_lm))
r .emo("practice")
To see the design matrix without fitting the model, you can also use formula in model.matrix()
:
model.matrix(object = ~ humans_eaten, data = Alien)
The term ~ humans_eaten
is a formula despite not having a response variable on the left.
The formula tells model.matrix()
(direclty or via lm()
) how to create a design matrix from the dataset.
Turning a biological problem into a linear model is nothing more that figuring out which design matrix you need and thus which formula to write when calling lm()
!
r .emo("practice")
model.matrix(object = ~ 0 + humans_eaten, data = Alien) lm(size ~ 0 + humans_eaten, data = Alien) ## you can also use -1
Can you think of what such fit would look like if plotted?
r .emo("practice")
(fit_log_humans_within <- lm(size ~ log(humans_eaten), data = Alien))
You can also apply such transformations in your data frame instead:
Alien_temp <- Alien Alien_temp$log_humans_eaten <- log(Alien_temp$humans_eaten) (fit_log_humans_without <- lm(size ~ log_humans_eaten, data = Alien_temp))
r .emo("info")
Note: the coefficients did not change because the content of the 2 model matrices is the same:
model.matrix(fit_log_humans_within)
model.matrix(fit_log_humans_without)
r .emo("practice")
(fit_poly_alien <- lm(size ~ poly(humans_eaten, 2), data = Alien))
(fit_poly_raw_alien <- lm(size ~ poly(humans_eaten, 2, raw = TRUE), data = Alien))
r .emo("info")
Note: the coefficients did change because the content of the 2 model matrices is not the same:
model.matrix(fit_poly_alien)
model.matrix(fit_poly_raw_alien)
r .emo("info")
Note: alternative parametrizations do not necessarily impact the fitted values as is the case for the 2 parametrization for polynomials.
data.frame(fitted_poly = fitted.values(fit_poly_alien), fitted_poly_raw = fitted.values(fit_poly_raw_alien))
In case you are curious: the default polynomial formulation for the design matrix produces columns in the design matrix which are orthogonal (no correlation), which presents better numerical properties for linear algebra, and which makes tests on them more informative. E.g. poly1
provides information on whether considering a linear relationship is useful or not, irrespective of if there is a quadratic signal or not. Yet, the interpretation of parameter values is easier when polynomials are expressed under their raw form.
r .emo("alien")
Goddess Emerald Lion informed us that Aliens have colonised different exoplanets since their creation and that their age depends on when they arrived on these planets, on their sex (the sex ZZ ages much faster), and on how many times they visited a black hole (due to the time dilatation happening there):
set.seed(123) Alien2 <- data.frame(planet = factor(c(rep("Chambr", 2), rep("Riplyx", 2), rep("Wickor", 2))), sex = factor(rep(c("Z", "ZZ"), times = 3)), bh_trips = 6:1) Alien2$age <- 0.5*Alien2$bh_trips Alien2$age[Alien2$planet == "Chambr"] <- Alien2$age[Alien2$planet == "Chambr"] + 100 Alien2$age[Alien2$planet == "Riplyx"] <- Alien2$age[Alien2$planet == "Riplyx"] + 1000 Alien2$age[Alien2$planet == "Wickor"] <- Alien2$age[Alien2$planet == "Wickor"] + 10000 Alien2$age[Alien2$sex == "Z"] <- Alien2$age[Alien2$sex == "Z"] + -20 Alien2$age[Alien2$sex == "ZZ"] <- Alien2$age[Alien2$sex == "ZZ"] + 20 Alien2$age <- Alien2$age + rnorm(nrow(Alien2), mean = 0, sd = 4) Alien2
r .emo("info")
model.matrix(object = ~ planet, data = Alien2)
This parametrization of factors such as planet
follows the so-called matrix of contrasts called "treatment".
It is the easiest one to interpret and the one used by default in R (but not in all software).
r .emo("practice")
(fit_alien_lm2 <- lm(age ~ planet, data = Alien2))
Can you reproduce the fitted
values in the following table?
D <- model.matrix(object = ~ planet, data = Alien2) ## same as D <- model.matrix(fit_alien_lm2) cbind(D, data.frame(fitted = D %*% coef(fit_alien_lm2))) ## same as fitted = fitted.values(fit_alien_lm2)
r .emo("info")
model.matrix(object = ~ planet, data = Alien2)
"Chambr"
).planetRiplyx
represents the mean response value for "Riplyx"
minus the mean of the baseline.planetWickor
represents the mean response value for "Wickor"
minus the mean of the baseline.Note: in base R the default level of reference is always the first level of a factor. Which level comes first can be changed, but by default it is the first level when ranked by alphabetical order given the locale (mind that forcats::as_factor()
behave differently than as.factor()
; the forcats function follows order in which levels appear).
r .emo("info")
Alien2$planet <- factor(Alien2$planet, levels = c("Riplyx", "Chambr", "Wickor")) #Change reference to 'Riplyx' (fit_alien_lm_revel <- lm(age ~ planet, data = Alien2)) #Different regression coefficients
Regression coefficients are different, but fitted values are unchanged!
D <- model.matrix(object = ~ planet, data = Alien2) cbind(D, data.frame(fitted = D %*% coef(fit_alien_lm2)))
#Relevel to make sure following code still works Alien2$planet <- factor(Alien2$planet, levels = c("Chambr", "Riplyx", "Wickor"))
r .emo("warn")
model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))
Alternative parametrizations of the contrasts can ease convergence in GLM(M) (for continuous variable) and can allow for testing specific hypotheses or fit particular experimental design (for qualitative variables).
It will not change the predictions, the likelihood, the AIC, ...; only the parameter values and their interpretation!
r .emo("nerd")
model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))
$$ \frac{(\text{Int}+\text{plt1})+(\text{Int}+\text{plt2})+(\text{Int}-\text{plt1}-\text{plt2})}{3}=\text{Int} $$
r .emo("nerd")
model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))
$$ (\text{Int} + \text{plt1}) - \frac{(\text{Int}+\text{plt1})+(\text{Int}+\text{plt2})+(\text{Int}-\text{plt1}-\text{plt2})}{3}=\text{plt1} $$
planet1
represents the mean response of the first group minus the intercept.r .emo("nerd")
model.matrix(object = ~ planet, data = Alien2, contrasts.arg = list(planet = "contr.sum"))
$$ (\text{Int} + \text{plt2}) - \frac{(\text{Int}+\text{plt1})+(\text{Int}+\text{plt2})+(\text{Int}-\text{plt1}-\text{plt2})}{3}=\text{plt2} $$
planet1
represents the mean response of the first group minus the intercept.planet2
represents the mean response of the second group minus the intercept.Note: not much used by R users (because not the default) but useful when there is no clear control.
r .emo("nerd")
lm(age ~ planet, data = Alien2, contrasts = list(planet = "contr.treatment")) lm(age ~ planet, data = Alien2, contrasts = list(planet = "contr.sum"))
r .emo("practice")
model.matrix(object = ~ planet + sex, data = Alien2)
Now the intercept corresponds to observations that have the first level for all factors.
r .emo("practice")
lm(age ~ planet + sex, data = Alien2)
r .emo("practice")
model.matrix(object = ~ planet + sex + planet:sex, data = Alien2)
Interactions allow for the consideration of the effect of one predictor dependent on the value taken by other(s). If interactions are not considered, effects are assumed to be independent.
r .emo("practice")
model.matrix(object = ~ planet*sex, data = Alien2)
The *
is a shortcut to +
& :
r .emo("practice")
Interaction between a quantitative and qualitative predictor:
model.matrix(object = ~ planet * bh_trips, data = Alien2)
r .emo("practice")
lm(age ~ planet * bh_trips, data = Alien2)
How do you interpret each parameter estimate?
r .emo("info")
Characteristics:
With the covariance matrix $\sigma^2 I$ being the following $n \times n$ matrix:
$$ \begin{bmatrix} \sigma^2 & 0 & \dots & 0 \ 0 & \sigma^2 & \dots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \dots & \sigma^2 \ \end{bmatrix} $$
r .emo("practice")
These are the estimations of the errors produced during the fit of the model:
Alien$residuals <- residuals(fit_alien_lm) Alien
cor(Alien$residuals, Alien$error)
mean(Alien$residuals) ## residuals have a mean of 0 in the sample, while error have a mean 0 in the population
r .emo("practice")
mod_intercept <- coef(fit_alien_lm)[["(Intercept)"]] mod_slope <- coef(fit_alien_lm)[["humans_eaten"]] #Data for mean and fitted lines line_data <- data.frame(humans_eaten = rep(c(0, 3), times = 2), var = rep(c("Mean response\nvalue", "Fitted value"), each = 2)) line_data$size <- ifelse(line_data$var == "Mean response\nvalue", 50 + line_data$humans_eaten*1.5, mod_intercept + line_data$humans_eaten*mod_slope) #Line and text data label_data <- data.frame(label = c("Observed value", "Fitted value", "Mean response\nvalue", "Error", "Residual"), x = c(3.5, 3.5, 3.5, 1.7, 2.3), xend = c(3.05, 3.05, 3.05, 1.9, 2.1), y = c(59.5, 56.5, 52.5, Alien[3, "size"] + 1, Alien[3, "size"] + 1), yend = c(Alien[nrow(Alien), "size"] + 0.2, Alien[nrow(Alien), "size_fitted"], Alien[nrow(Alien), "size_mean"], Alien[3, "size"] - 1, Alien[3, "size"] - 1)) #Palette my_palette <- c("#3a405a", "#f06543", "#ddd92a", "#7371fc", "#20BF55") ggplot(data = Alien) + ### ADD POINTS SHOWING OBSERVED VALUES geom_point(aes(x = humans_eaten, y = size, colour = "Observed value"), shape = 4, size = 1, stroke = 2) + geom_segment(aes(x = 0, xend = humans_eaten, y = size_mean, yend = size_mean), lty = 2) + geom_segment(aes(x = humans_eaten, xend = humans_eaten, y = size_mean, yend = 45), lty = 2) + geom_line(data = line_data, aes(x = humans_eaten, y = size, colour = var), size = 1) + geom_point(aes(x = humans_eaten, y = size_mean, colour = "Mean response\nvalue"), size = 2) + geom_point(aes(x = humans_eaten, y = size_fitted, colour = "Fitted value"), size = 2) + ### ADD SEGMENTS TO SHOW RESIDUALS V. ERRORS #### ERROR SEGMENTS geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten - 0.05, y = size - 0.02, yend = size - 0.02, colour = "Error"), size = 1) + geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten - 0.05, y = size_mean + 0.02, yend = size_mean + 0.02, colour = "Error"), size = 1) + geom_segment(data = Alien[3, ], aes(x = humans_eaten - 0.05, xend = humans_eaten - 0.05, y = size, yend = size_mean, colour = "Error"), lty = 1, size = 1) + #### RESIDUAL SEGMENTS geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten + 0.05, y = size - 0.02, yend = size - 0.02, colour = "Residual"), size = 1) + geom_segment(data = Alien[3, ], aes(xend = humans_eaten, x = humans_eaten + 0.05, y = size_fitted + 0.02, yend = size_fitted + 0.02, colour = "Residual"), size = 1) + geom_segment(data = Alien[3, ], aes(x = humans_eaten + 0.05, xend = humans_eaten + 0.05, y = size, yend = size_fitted, colour = "Residual"), lty = 1, size = 1) + ### ADD CURVED ARROWS THAT POINT TO IMPORTANT VALUES geom_curve(data = label_data[1:3, ], aes(x = x, xend = xend, y = y, yend = yend, colour = label), curvature = 0.15, arrow = arrow(length = unit(0.3, units = "cm"))) + geom_curve(data = label_data[4, ], aes(x = x, xend = xend, y = y, yend = yend, colour = label), curvature = 0.15, arrow = arrow(length = unit(0.3, units = "cm"))) + geom_curve(data = label_data[5, ], aes(x = x, xend = xend, y = y, yend = yend, colour = label), curvature = -0.15, arrow = arrow(length = unit(0.3, units = "cm"))) + ### ADD TEXT TO LABELS geom_text(data = label_data[1:3, ], aes(x = x, y = y, label = label, colour = label), hjust = 0, vjust = 0.75) + geom_text(data = label_data[4:5, ], aes(x = x, y = y + 0.2, label = label, colour = label), hjust = 0.5, vjust = 0.5) + scale_colour_manual(values = my_palette) + scale_x_continuous(limits = c(0, 4.25), breaks = 0:3, labels = c(0, 1, 2, 3), expand = c(0, 0)) + scale_y_continuous(limits = c(45, 60), breaks = c(45, mod_intercept, 50, 51.5, 53.0, 54.5, 60), labels = c("45.0", paste0("Intercept estimate --> ", signif(mod_intercept, 3)), "Intercept --> 50.0", "51.5", "53.0", "54.5", "60.0"), expand = c(0, 0)) + labs(x = "Humans eaten", y = "Size") + theme_classic() + theme(legend.position = "none", axis.title.y = element_text(vjust = -25), plot.background = element_blank(), panel.background = element_blank(), axis.title = element_text(size = 20, colour = "black"), axis.text = element_text(size = 12, colour = "black"), axis.ticks = element_line(colour = "black"), plot.margin = margin(t = 10, r = 10, b = 10, l = 10))
r .emo("info")
The variance of the error ($\sigma^2$) cannot be directly observed...
... but it can be estimated from the variance of the residual.
r .emo("recap")
Do you remember that?
r .emo("proof")
set.seed(1) bias_vars_list <- replicate(100000, { observations <- rnorm(n = 20, mean = 0, sd = sqrt(5)) numerator <- sum((observations - mean(observations))^2) c(bias_var1 = (numerator/20) - 5, bias_var2 = (numerator/19) - 5) }, simplify = FALSE) bias_vars_df <- data.frame(do.call("rbind", bias_vars_list)) ## to turn the output as a clean dataframe
apply(bias_vars_df, 2, mean)
The bias corresponds to an underestimation stemming from looking at the variation around the mean estimate: the variation around the mean estimate is always smaller that the variation around the true mean in the population. This is because by construction the estimate of the mean is the value generally the closest to all sampled points.
The bias can be removed by considering $n - p$ in the denominator to correct for the bias when you estimate $p$ parameters ($p = 1$, when there is only one mean).
r .emo("info")
r .emo("practice")
The variance of the error ($\sigma^2$) cannot be directly observed...
... but it can be estimated from the variance of the residuals ($\varepsilon_i$):
sum(residuals(fit_alien_lm)^2) / (nrow(Alien) - ncol(model.matrix(fit_alien_lm))) summary(fit_alien_lm)$sigma^2 ## same thing, simpler :-)
Note: this term is (somewhat confusingly) referred to as an unbiased estimate of the residual variance.
r .emo("info")
We here want to predict the average of large groups of future observations.
Example: prediction for the average size of Aliens having eaten 1, 1.5, 2, 2.5 and 3 humans.
(new_design <- cbind(Intercept = 1, humans_eaten = seq(1, 3, by = 0.5))) new_design %*% coef(fit_alien_lm)
predict()
r .emo("practice")
The function predict()
saves you the trouble of building a design matrix by hand:
data_for_predict <- data.frame(humans_eaten = seq(1, 3, by = 0.5)) predict(fit_alien_lm, newdata = data_for_predict)
Note: the fact that you must feed the function predict()
with predictors and not regressors is particularly useful when you have qualitative variable (see exercises).
r .emo("practice")
When several predictors are involved, predictions can be computed differently...
Option 1: you can set non-focal variable to 0 for quantitative variables (or to the control group for qualitative variables):
fit_apb <- lm(age ~ planet * bh_trips, data = Alien2) (newd0 <- data.frame(planet = c("Chambr", "Riplyx", "Wickor"), bh_trips = 0)) (pred0 <- predict(fit_apb, newdata = newd0))
Note: these predictions are only meaningful when fixing estimates to 0 represent something meaningful.
r .emo("practice")
When several predictors are involved, predictions can be computed differently...
Option 2: you can set non-focal variable to their mean (for quantitative variables) or mode (for qualitative variables) within the set of all sampled observations:
fit_apb <- lm(age ~ planet * bh_trips, data = Alien2) (newd_partial <- data.frame(planet = c("Chambr", "Riplyx", "Wickor"), bh_trips = mean(Alien2$bh_trips))) (pred_partial <- predict(fit_apb, newdata = newd_partial))
r .emo("practice")
When several predictors are involved, predictions can be computed differently...
Option 2: you can set non-focal variable to their mean (for quantitative variables) or mode (for qualitative variables) within the set of all sampled observations:
Predictions computed this way can directly be obtained with pdep_effects()
in {spaMM}:
fit_apb_spaMM <- spaMM::fitme(age ~ planet * bh_trips, data = Alien2) spaMM::pdep_effects(fit_apb_spaMM, focal_var = "planet") ## here no CI because N = P (overfitted model)
Note: those predictions are sometimes called partial-dependence effects.
r .emo("practice")
When several predictors are involved, predictions can be computed differently...
Option 3: you can set non-focal variable to their mean (for quantitative variables) or mode (for qualitative variables) within the set of observations matching each given value for the focal predictor:
fit_apb <- lm(age ~ planet * bh_trips, data = Alien2) (newd_conditional <- data.frame(planet = c("Chambr", "Riplyx", "Wickor"), bh_trips = tapply(Alien2$bh_trips, Alien2$planet, mean))) (pred_conditional <- predict(fit_apb, newdata = newd_conditional))
r .emo("info")
When several predictors are involved, predictions can be computed differently...
And this can make a big difference:
cbind(data.frame(planet = c("Chambr", "Riplyx", "Wickor"), method = c("0", "partial dep", "conditional")), pred0, pred_partial, pred_conditional)
Notes:
r .emo("info")
Sometimes we do not want to predict the mean response (i.e. the average response value for a group of observations) but single future observations.
For this, you need to predict the mean response and then draw a realisation of the error.
Example: prediction for the size of 3 aliens having eaten 1.5 humans:
(pred_mean_response <- predict(fit_alien_lm, newdata = data.frame(humans_eaten = 1.5))) set.seed(123) rnorm(n = 3, mean = pred_mean_response, sd = summary(fit_alien_lm)$sigma)
r .emo("goal")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.