library(LM2GLMM)
library(spaMM)
library(doSNOW)
options(width = 110)
spaMM.options(nb_cores = 4L)
#knitr::knit_hooks$set(webgl = hook_webgl) ## for rgl
knitr::opts_chunk$set(cache = TRUE, cache.path = "./cache_knitr/LM_test_intervals/", fig.path = "./fig_knitr/LM_test_intervals/", error = TRUE,
                      fig.align = "center", fig.asp = 1, dev.args = list(pointsize = 8))

The Linear Model: LM


[Back to main menu](./Title.html#2)

You will learn in this session r .emo("info")

Model fits for this session

Aliens fitted with stats::lm() r .emo("alien")

set.seed(123L)
Alien <- simulate_Aliens()
fit_lm_AL1 <- lm(size ~ humans_eaten, data = Alien)
fit_lm_AL1

Aliens fitted with spaMM::fitme() r .emo("alien")

ML fit:

fit_spaMM_AL1_ML   <- fitme(size ~ humans_eaten, data = Alien)
fit_spaMM_AL1_ML

Note: ML fits are needed for whenever we are using the likelihood to study regression coefficients.

Aliens fitted with spaMM::fitme() r .emo("alien")

REML fit:

fit_spaMM_AL1_REML <- fitme(size ~ humans_eaten, data = Alien, method = "REML")
fit_spaMM_AL1_REML

Note: REML fits are needed for whenever we are using estimate(s) of variance(s).

Britons fitted with stats::lm() r .emo("alien")

fit_lm_UK1 <- lm(height ~ drink + sex*weight, data = UK)
fit_lm_UK1


How do you interpret each parameter estimate?

RECAP: Interpreting regression coefficients r .emo("recap")

#Plot interaction effect of sex:weight at reference level of drink
plot_data <- expand.grid(drink = c("2 to 3 times a week"),
                         sex = c("Girl", "Boy"),
                         weight = c(0, 50))
plot_data$height <- predict(fit_lm_UK1, newdata = plot_data)
library(ggplot2)
ggplot(plot_data) +
  geom_line(aes(x  = weight, y = height, colour = sex), size = 1) +
  geom_text(data = plot_data[plot_data$weight == 50, ],
            aes(x = weight - 4, y = height * c(0.95, 1.01), colour = sex, label = c("\U2640", "\U2642")),
            size = 10) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(limits = c(110, 160),
                     breaks = c(fit_lm_UK1$coefficients["(Intercept)"],
                                fit_lm_UK1$coefficients["(Intercept)"] + fit_lm_UK1$coefficients["sexGirl"],
                                120, 130, 140, 150, 160),
                     labels = c(format(as.numeric(fit_lm_UK1$coefficients["(Intercept)"]), digits = 5),
                                format(as.numeric(fit_lm_UK1$coefficients["(Intercept)"] + fit_lm_UK1$coefficients["sexGirl"]), digits = 5),
                                "120", "130", "140", "150", "160")) +
  scale_colour_manual(values = c("#41658A", "#70A37F")) +
  labs(x = "Weight (kg)",
       y = "Height (cm)",
       title = "Relationship between weight and height for\nthe reference level of 'drink'") +
  theme_classic() +
  theme(legend.position = "none",
        axis.text = element_text(colour = "black"),
        axis.ticks = element_line(colour = "black"))

Britons fitted with spaMM::fitme() r .emo("alien")

ML fit:

fit_spaMM_UK1_ML <- fitme(height ~ drink + sex*weight, data = UK)
fit_spaMM_UK1_ML

Britons fitted with spaMM::fitme() r .emo("alien")

REML fit:

fit_spaMM_UK1_REML <- fitme(height ~ drink + sex*weight, data = UK, method = "REML")
fit_spaMM_UK1_REML

Regression coefficients

Point estimates for regression coefficients r .emo("practice")

Fitting a model leads to point estimates:

coef(fit_lm_AL1)  ## use fixef() for spaMM fits
coef(fit_lm_UK1)  ## use fixef() for spaMM fits

Point estimates are random variables. Thus, to know if these numbers differ from specific values in the population, their distributions must be characterised and accounted for during tests.

Variances of estimates r .emo("info")

Each regression parameter is associated with an estimation variance and since they are all being jointly estimated, there are also covariances between them:

vcov(fit_lm_AL1) ## = summary(fit_lm_AL1)$sigma^2 * solve(t(model.matrix(fit_lm_AL1)) %*% model.matrix(fit_lm_AL1))

The variances are on the diagonals, so you can get the standard errors on the regression coefficients as follows:

sqrt(diag(vcov(fit_lm_AL1)))

There are sometimes referred to as conditional standard errors because these SE do not account for uncertainty stemming from other sources (which happens in LMM & GLMM).

Testing regression coefficients with {stats} r .emo("practice")

Regression coefficient estimates are tested using Student's t-test (t value = Estimate / Std. Error):

summary(fit_lm_AL1)$coefficients  ## you can also use simply summary(fit_lm_AL1)!!!
summary(fit_lm_UK1)$coefficients


Note: the null hypothesis for these tests is that the estimate is null (= 0).

Testing regression coefficients with {spaMM} r .emo("nerd")

{spaMM} does not compute the p-values for the t-tests but it is possible to compute them:

ttable <- summary(fit_spaMM_AL1_REML, verbose = FALSE)$beta_table  ## verbose is just for the slide display
cbind(ttable, p = 2*pt(abs(ttable[, "t-value"]),
                       df = nrow(model.matrix(fit_spaMM_AL1_REML)) - ncol(model.matrix(fit_spaMM_AL1_REML)),## n-p
                       lower.tail = FALSE))

Notes:

t-tests and linear models r .emo("info") r .emo("proof")

You can actually substitute simple two samples t-tests with LM:

t.test(extra ~ group, data = sleep, var.equal = TRUE)
summary(lm(extra ~ group, data = sleep))$coef

Note: the p-value obtained is the same r .emo("party")

Why testing LM using t-tests? r .emo("info") r .emo("proof")

This is because, in the case of LM, regression coefficients are normally distributed...

set.seed(123)
new_beta_estimates <- t(replicate(1000, coef(update(fit_lm_AL1, data = simulate_Aliens(N = 100)))))
qqnorm(new_beta_estimates[, 2])  ## draw a qqplot
qqline(new_beta_estimates[, 2], col = "red")  ## draw line of normality

Note: update() refits a model with some of its input updated, but you can instead recall lm() if you provide the latter with all its arguments.

Why testing LM using t-tests? r .emo("info")

This is because, in the case of LM, regression coefficients are normally distributed and in cases where the variance of the estimate is estimated (i.e not known) one must use the Student's t distribution (not the Gaussian) to model the distribution of scaled estimates with no bias:


$\widehat{\beta}\sim \mathcal{N}(\beta, \sigma_{\widehat{\beta}})$

$z = \frac{\widehat{\beta} - \beta}{\sigma_{\widehat{\beta}}}\sim \mathcal{N}(0, 1)$

$t = \frac{\widehat{\beta} - \beta}{\widehat{\sigma_{\widehat{\beta}}}}\sim \mathcal{t}_{n - p}$ ($t$ = Student's t distribution)

Testing a null but non-nill hypothesis r .emo("practice")

The t-test compare each regression coefficient to 0, but you may want to test something more specific.

First, you need to reproduce the previous t-tests by hand:

summary(fit_lm_AL1)$coefficients["humans_eaten", ]
estimate <- coef(fit_lm_AL1)["humans_eaten"]
std.error <- sqrt(diag(vcov(fit_lm_AL1))["humans_eaten"])
t.value <- (estimate - 0)/std.error
p.value <- 2*pt(abs(t.value), df = fit_lm_AL1$df.residual, lower.tail = FALSE) ## 2x as bilateral test
cbind(estimate, std.error, t.value, p.value)

Testing a null but non-nill hypothesis r .emo("practice")

Then, you can simply remove the hypothetical value from the numerator of the t-value.

E.g. we test if the coefficient for human_eaten is different from 1.5:

estimate <- coef(fit_lm_AL1)["humans_eaten"]
std.error <- sqrt(diag(vcov(fit_lm_AL1))["humans_eaten"])
t.value <- (estimate - 1.5)/std.error
p.value <- 2*pt(abs(t.value), df = fit_lm_AL1$df.residual, lower.tail = FALSE) ## 2x as bilateral test
cbind(estimate, std.error, t.value, p.value)

You can alternatively use the package {car} to do that automatically:

car::linearHypothesis(fit_lm_AL1, c("humans_eaten = 1.5"))$"Pr(>F)"[2]

Testing all null non-nill hypotheses at once: CI r .emo("info")

A more general approach is to identify the entire range of population-level parameter values that would not be considered as significantly different from the estimate of interest.

This is what Confidence Intervals (CI) are used for!

There are many methods for building such confidence intervals:


In the case of LM the method based on the Student's t distribution is best, but we will see the alternative methods because they will become useful for more complex models and it is easier to start exploring these methods in the context of simple models.

Student's based confidence intervals r .emo("practice")

This is the method available when calling confint() on a lm() fit:

confint(fit_lm_AL1)

It is easy to compute the same thing by hand:

cbind(coef(fit_lm_AL1) + qt(0.025, fit_lm_AL1$df.residual) * sqrt(diag(vcov(fit_lm_AL1))),
      coef(fit_lm_AL1) + qt(0.975, fit_lm_AL1$df.residual) * sqrt(diag(vcov(fit_lm_AL1))))

Note: qt() is used to extract the quantiles from the Student's t distribution and the rest is used to unscale the result so it follows the scale of the response.

Why are there other methods to compute CI? r .emo("info")

A confidence interval with given nominal level X is considered exact if the probability that the interval contains the true population value $k$ is X for any $k$. We call this probability coverage.


Example: if you draw 100 datasets from a population, fit 100 models and compute 100 CI95% around a given parameter estimate. You expect 95 of them to include the true parameter value of the population.


Unfortunately, most methods for building confidence intervals are only approximate, i.e. the empirical coverage differs from the given coverage X.


Depending on the type of the statistical model and the data, some methods show a coverage closer to their nominal level than others.

Measuring the coverage of Student's based CI r .emo("practice")

You can always simulate data close to your situation to quantify the coverage of a given method:

set.seed(123)
#Remember, we know the true intercept and slope terms generated by `simulate_Aliens()` to be 50 and 1.5
mean(replicate(1000, sum(findInterval(confint(update(fit_lm_AL1, data = simulate_Aliens()))[1, ], 50)) == 1))
mean(replicate(1000, sum(findInterval(confint(update(fit_lm_AL1, data = simulate_Aliens()))[2, ], 1.5)) == 1))


Note:

Confidence intervals by likelihood profiling r .emo("info")

Here the idea is to quantify the decrease in (log)-likelihood when parameter values depart from its estimate and to consider as part of the interval all the parameter values that do not lead the (log)-likelihood to decrease below a specific threshold.

Here is how you can obtain the log-likelihood under any parameter values:

logLik(lm(size ~ 0 + offset(50 + 1.5 * humans_eaten), data = Alien))


Note: spaMM::fitme() allows for the fixation of the variance in addition to the fixation of the regression parameters but lm() does not (see later).

Confidence intervals by likelihood profiling r .emo("info")

candidate_slopes <- seq(from = 0, to = 3, by = 0.1)
loglik <- sapply(candidate_slopes, function(slope) logLik(lm(size ~ 1 + offset(slope*humans_eaten), data = Alien)))
plot(loglik ~ candidate_slopes, type = "o")
abline(v = 1.5, col = "green", lwd = 2) ## ground truth (usually unknown)
abline(v = coef(fit_lm_AL1)["humans_eaten"], col = "blue", lwd = 2) ## estimate
abline(h = logLik(fit_lm_AL1)[[1]] - 0.5*qchisq(0.95, df = 1), col = "red", lty = 2, lwd = 2) ## logLik threshold

Confidence intervals by likelihood profiling r .emo("nerd")

You can compute the boundaries of the CI95% more precisely by optimising the following function:

diff_logLik_slope <- function(slope, varname, fit) {
  newdata <- fit$model
  newdata$offset <- slope*newdata[, varname]
  newfit <- lm(size ~ 1 + offset(offset), data = newdata)
  logLik_newfit <- logLik(newfit)[[1]]
  logLik_goal <- logLik(fit)[[1]] - 0.5*qchisq(0.95, df = 1)
  abs(logLik_newfit - logLik_goal)
}
CI_prof_slope <- function(fit, varname) {
  lwr <- optimise(diff_logLik_slope, interval = c(coef(fit)[2] - 1, coef(fit)[2]),
                  varname = varname, fit = fit)
  upr <- optimise(diff_logLik_slope, interval = c(coef(fit)[2], coef(fit)[2] + 1),
                  varname = varname, fit = fit)
  c(lwr = lwr$minimum, upr = upr$minimum)
}
CI_prof_slope(fit = fit_lm_AL1, varname = "humans_eaten")

Confidence intervals by likelihood profiling r .emo("info")



Joint CI by likelihood profiling r .emo("info")

It can also be easily extended to build confidence regions (i.e. a multivariate CI):

candidate_intercepts <- seq(from = 30, to = 70,  by = 0.1)
candidates <- expand.grid(candidate_intercept = candidate_intercepts, 
                          candidate_slope = candidate_slopes)

candidates$logLik <- NA

for (i in 1:nrow(candidates)) { ## could use sapply here too
  fit_temp <- lm(size ~  0 + offset(candidates[i, 1] + candidates[i, 2] * humans_eaten), data = Alien)
  candidates[i, "logLik"] <- logLik(fit_temp)
}

par(las = 1)
with(candidates, contour(candidate_intercepts, candidate_slopes,
  matrix(candidates$logLik, nrow = length(candidate_intercepts)), nlevels = 15,
  ylab = "slope", xlab = "intercept"))
points(fit_spaMM_AL1_ML$fixef["(Intercept)"], fit_spaMM_AL1_ML$fixef["humans_eaten"],
        col = "blue", pch = 4, lwd = 2, cex = 3)
points(50, 1.5, col = "green", pch = 4, lwd = 2, cex = 3)

limit <- c(logLik(fit_lm_AL1) - 0.5*qchisq(0.95, df = 2))
with(candidates, contour(candidate_intercepts, candidate_slopes,
  matrix(candidates$logLik, nrow = length(candidate_intercepts)),
  levels = limit, add = TRUE, col = "red", lwd = 2))

Confidence intervals by parametric bootstrap r .emo("info")

The general aim of bootstrap methods is to approximate the distribution of a statistic, in cases where parameters are not known and where this distribution is not known analytically.

Here, we do that by simulating new data using the fitted model (i.e. considering the parameter estimates as population parameters) and refitting models based on such simulated data.

Thus, we have:


From that, confidence intervals can be computed under the assumption that:

Confidence intervals by parametric bootstrap r .emo("nerd")

For model fitted with stats::lm(), it is a little complicated (even if the package {boot} helps):

CI_boot <- function(fit, param, nboot) {
  newYs <- simulate(fit, nsim = nboot) ## simulate new responses from the fit!!
  res_sim <- sapply(1:nboot, function(i) {
    newdata <- model.frame(fit) ## extract the data
    newdata[, 1] <- newYs[, i] ## replace the response var
    newfit <- update(fit, data = newdata)  ## refit the model
    coef(newfit)[[param]]  ## extract the estimate (replace with fixef() if spaMM fit)
  })
  boot::boot.ci(boot.out = list(R = length(res_sim)),
                t0 = coef(fit)[[param]], ## â
                t  = res_sim,            ## a*
                type = "basic")$basic[, 4:5]  ## more complex types exist!
} ## to use with spaMM fit, just run coef.HLfit <- fixef.HLfit beforehand
CI_boot(fit = fit_lm_AL1, param = "humans_eaten", nboot = 1000)

Notes: for more complex models, there are packages doing that for you, but I do not know one that would work with stats::lm() (it may exist).

Confidence intervals by parametric bootstrap r .emo("practice")

With {spaMM}, it is much easier:

confint(fit_spaMM_AL1_ML, parm = "humans_eaten", boot_args = list(nsim = 1000))

Comparing the 3 methods of CI construction r .emo("practice")

CI <-   rbind(confint(fit_lm_AL1)["humans_eaten", ],
              CI_prof_slope(fit_lm_AL1, "humans_eaten"),
              CI_boot(fit = fit_lm_AL1, param = "humans_eaten", nboot = 5000))
data.frame(method = c("Student", "Profile", "Parametric bootstrap"),
           lwr = CI[, 1],
           upr = CI[, 2])


Notes:

Testing several estimates at once r .emo("info")

  1. For comparing nested models (e.g. to test the effect of a qualitative variable)
  2. For comparing factor levels


Example:

summary(fit_lm_UK1)$coefficients

Comparing nested models using a F-test r .emo("practice")

You can use anova() to compare pairs of model fits:

fit_lm_UK2 <- lm(height ~ drink, data = fit_lm_UK1$model) ## we use the model frame to keep same dataset!
anova(fit_lm_UK2, fit_lm_UK1)

Note: here we test the overall effect of sex * weight.

Comparing nested models using a F-test r .emo("info")

The default behaviour for anova() on LM fit is to perform a F-test:

fit_lm_UK2 <- lm(height ~ drink, data = fit_lm_UK1$model) ## we use model frame to keep same dataset!
anova(fit_lm_UK2, fit_lm_UK1)

The F-test is used to check if the decrease in residual variance for the more complex fit is higher than expected under H0:

RSS_fit1 <- sum(residuals(fit_lm_UK1)^2)
RSS_fit2 <- sum(residuals(fit_lm_UK2)^2)
((RSS_fit1 - RSS_fit2)/(fit_lm_UK1$df.residual - fit_lm_UK2$df.residual)) / (RSS_fit1/fit_lm_UK1$df.residual)

Note: RSS/Res.Df = unbiased estimate of the residual variance

Comparing nested models using a LRT r .emo("practice")

It is also possible to use a likelihood ratio test (LRT), similarly to what happened when we used likelihood profiling:

lmtest::lrtest(fit_lm_UK2, fit_lm_UK1)

Notes:

Comparing nested models using a LRT r .emo("practice")

It is possible to use anova() to perform a LRT if the model is fitted with {spaMM} (using lmtest::lrtest() won't work):

fit_spaMM_UK2_ML <- fitme(height ~ drink, data = fit_spaMM_UK1_ML$data)
anova(fit_spaMM_UK1_ML, fit_spaMM_UK2_ML)


Notes:

Comparing nested models using a LRT r .emo("practice")

{spaMM} makes it also easy to test the LRT by parametric bootstrap instead of using the $\chi^2$, which is good if we are far from asymptotic conditions (but it can be slow to run) r .emo("slow"):

LRT(fit_spaMM_UK1_ML, fit_spaMM_UK2_ML, boot.repl = 1000)

Notes:

Comparing nested models to test factors r .emo("practice")

Testing the overall effect cannot be done with summary() because it would only give you test outputs separately for each factor level. Instead, you should rely on model comparison for this purpose:

fit_lm_UK3 <- lm(height ~ sex * weight, data = fit_lm_UK1$model)
anova(fit_lm_UK3, fit_lm_UK1)


CONCLUSION:

the consumption of alcohol during pregnancy is (statistically) related to the children's height.

F-test vs t-test r .emo("practice")

Testing the effect of human_eatens can be done with summary() or with model comparison:

summary(fit_lm_AL1)$coefficients["humans_eaten", ]
fit_lm_AL2 <- lm(size ~ 1, data = fit_lm_AL1$model)
anova(fit_lm_AL2, fit_lm_AL1)

Note: when the models differ by one parameter (as in the case of quantitative variable), the F-test and the t-test give an identical p-value ($t^2 = F$ when the numerator degree of freedom is 1).

Testing all predictors r .emo("practice")

Instead of doing many model comparisons to test all your predictors one by one, you can use car::Anova() as a shortcut:

car::Anova(fit_lm_UK1)

Anova() performs a so-called Type 2 ANOVA: it tests for the effect of simple terms considering other simple terms are included, but not interactions. The function also tests for the effect of interactions assuming all simple terms are included.

Testing all predictors r .emo("practice")

Better not use anova() for that since it performs the comparisons sequentially (i.e. Type 1 ANOVA, which is quite silly)...

anova(fit_lm_UK1)


Note: sequential comparisons are nonetheless useful to identify multicollinearity issues (more on that later).

Testing against the null model r .emo("practice")

A null model is a model estimating nothing more than an intercept:

fit_lm_UK0 <- lm(height ~ 1, data = fit_lm_UK1$model)
anova(fit_lm_UK0, fit_lm_UK1)


Note: comparing a fit to the fit of a so-called null model should always be performed first, i.e. before testing the effect of predictors individually! r .emo("info")

In the case of LM, it is also given by the last row produced by summary()!

Comparing different levels from a factor r .emo("practice")

After establishing that a qualitative predictor is significant, you may wonder which level is different from which. Yet, comparing many levels leads to an increase in the rate of false positives, so it is best to apply a multiple testing correction. The package {multcomp} can help you do that:

comparing_drink_levels <- multcomp::glht(fit_lm_UK1, linfct = multcomp::mcp(drink = "Tukey"))
summary(comparing_drink_levels)

Comparing different levels from a factor r .emo("practice")

After establishing that a qualitative predictor is significant, you may wonder which level is different from which. Yet, comparing many levels leads to an increase in the rate of false positives, so it is best to apply a multiple testing correction. The package {multcomp} can help you do that:

comparing_drink_levels_spaMM <- multcomp::glht(fit_spaMM_UK1_ML, linfct = multcomp::mcp(drink = "Tukey"),
                                               coef. = fixef.HLfit)  ## works with spaMM fit too!
par(mar = c(4, 20, 2, 2))  ## increase left margin
plot(comparing_drink_levels_spaMM) ## you can plot the outcome!

Comparing different levels from a factor r .emo("practice")

After establishing that a qualitative predictor is significant, you may wonder which level is different from which. Yet, comparing many levels leads to an increase in the rate of false positives, so it is best to apply a multiple testing correction. The package {multcomp} can help you do that.

You can also test specific hypotheses instead of testing all combinations (it increases the power!):

UK$drink2 <- factor(gsub(" ", "_", UK$drink))  ## removes space in levels to avoid issues in multcomp
fit_lm_UK1bis <- lm(height ~ drink2 + sex*weight, data = UK)
summary(multcomp::glht(fit_lm_UK1bis, linfct = multcomp::mcp(drink2 = c("Not_at_all - Once_a_week_or_less == 0",
                                                                        "Not_at_all == 0",  ## compare to intercept as no param for 2_to_3... 
                                                                        "Not_at_all - Most_days == 0"))))

Comparing different levels from a factor r .emo("practice")

To reduce the number of comparisons, you can also select carefully the first level of a factor.

UK$drink3 <- relevel(UK$drink, ref = "Not at all")
summary(lm(height ~ drink3 + sex*weight, data = UK))$coefficients


Note: although this is an established practice, it does not correct for multiple testing (which is why the p-values are smaller than on the previous slide).

Comparing non nested models with AIC r .emo("practice")

So far all model comparisons we have seen considered nested models. Non-nested models can be compared too, as long as they have been fitted on the exact same response.

For this, we rely on information criterions such as the Akaike Information Criterion:

$\texttt{AIC} = -2 \times (\texttt{logLik} - p)$, with $p$ the number of parameters

Example:

UKselect <- na.omit(UK[, c("height", "drink", "sex", "mother_height", "father_height")])  ## removes NA
AIC(lm(height ~ drink * sex, data = UKselect))
AIC(lm(height ~ mother_height + father_height, data = UKselect))

Notes:

Interlude: comparing too many things

Simulate nothingness r .emo("alien")

To study what happens when testing many things, we will first simulate predictors with no effect:

simu_null <- function(n, p) {
  d <- as.data.frame(replicate(p + 1, rnorm(n)))
  colnames(d)[1] <- "y"
  colnames(d)[2:(p + 1)] <- paste0("x", 1:p)
  return(d)
}

set.seed(123)
simu_null(n = 5, p = 3)

Testing nothingness r .emo("info")

p <- 3
set.seed(123)
lm(y ~ ., data = simu_null(n = 5, p = p))
set.seed(123)
(p.values <- car::Anova(lm(y ~ ., data = simu_null(n = 5, p = p)))["Pr(>F)"][1:p, 1])
min(p.values) <= 0.05  ## test if any regression coef is significant

Comparing many estimates increases false $+$ r .emo("info")

set.seed(123)
mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 100, p = 1)))["Pr(>F)"][1, 1]) <= 0.05))

mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 100, p = 3)))["Pr(>F)"][1:3, 1]) <= 0.05))

mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 100, p = 10)))["Pr(>F)"][1:10, 1]) <= 0.05))

mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 1000, p = 50)))["Pr(>F)"][1:50, 1]) <= 0.05))

Why not do backward selection? r .emo("warn")

Some scientists (with poor knowledge in statistics) rely on so-called backward stepwise selection to discard predictors. This is a terrible thing if you want to do inference (i.e. test things). Backward stepwise selection is evil, even when you compare the minimal "adequate" fit to a null fit.

Here is why:

backward <- function(p, nsimu = 100) {
  pv.signif.rec <- matrix(rep(NA, 2*nsimu), nrow = 2)
  row.names(pv.signif.rec) <- c("full", "backward")
  for (i in 1:nsimu) {
    d <- simu_null(n = nsimu, p = p)
    mod.full <- lm(y ~ ., data = d)
    mod.null <- lm(y ~ 1, data = d)
    mod.backward <- step(mod.full, direction = "backward", trace = FALSE)
    pv.full <- anova(mod.null, mod.full)$Pr[2]
    pv.backward <- anova(mod.null, mod.backward)$Pr[2]
    pv.signif.rec["full", i] <- ifelse(is.na(pv.full), FALSE, pv.full <= 0.05)
    pv.signif.rec["backward", i] <- ifelse(is.na(pv.backward), FALSE, pv.backward <= 0.05)
    }
  apply(pv.signif.rec, 1, mean)
}

result_backward <- sapply(c(3, 5, 10, 15, 20, 50), function(i) backward(p = i))

Why not do backward selection? r .emo("proof")

plot(result_backward["full", ] ~ c(3, 5, 10, 15, 20, 50),
  type = "b", col = "green", ylim = c(0, 1), xlab = "Initial number of variables",
  ylab = "Proportion of significant models | H0", log = "x")
points(result_backward["backward", ] ~ c(3, 5, 10, 15, 20, 50), type = "b", col = "red")
abline(h = 0.05, lty = 2)
legend("topleft", fill = c("red", "green"), legend = c("step selection (backward)", "full model"), bty = "n")

What shall you do? r .emo("info") r .emo("proof")

mean(replicate(1000, {
  d <- simu_null(n = 1000, p = 50)
  mod <- lm(y ~ ., data = d)
  mod0 <- lm(y ~ 1, data = d)
  anova(mod, mod0)[2, "Pr(>F)"] <= 0.05
}))

Residual variance

Point estimate for the residual variance r .emo("practice")

Remember how to extract the unbiased estimate of the residual variance:

(residvar <- summary(fit_lm_AL1)$sigma^2)
fit_spaMM_AL1_REML$phi


Note: you need to use the REML estimate of the residual variance, otherwise we have seen it is biased.

Testing differences in residual variance r .emo("practice")

Testing if the residual variance is null makes no sense, but it could make sense to compare the estimate to a given value (e.g., here 25). For this we could use a very general recipe one can use for any comparison between nested models:

1- fit a model in which the parameter is constrained

fit_spaMM_AL1_REML_varfixed <- fitme(size ~ humans_eaten, fixed = list(phi = 25), data = Alien, method = "REML")

2- compare the likelihood between the constrained and the non-constrained fit using a Likelihood Ratio Test (LRT):

stats_LRT <- -2 * (logLik(fit_spaMM_AL1_REML_varfixed) - logLik(fit_spaMM_AL1_REML))
pchisq(stats_LRT, df = 1, lower.tail = FALSE)

Note: we saw that before but did so automatically using lmtest::lrtest() and that would not work here.

Testing differences in residual variance r .emo("practice")

For comparing 2 residual variances in LM specifically, there is a better method:

pchisq(sum(residuals(fit_spaMM_AL1_REML)^2)/25,
       df = nrow(fit_spaMM_AL1_REML$data) - length(fixef(fit_spaMM_AL1_REML)),
       lower.tail = FALSE)


This method is better but much less general and thus only applies to LM (not GLM, LMM, GLMM).

Which alternative to choose for testing? r .emo("info")

Sometimes theory helps, but sometimes there is no clear answer.

In such cases, you can use simulations to answer the question!

Statistical tests should meet 2 properties: the rate of false negatives should be as small as possible, and more importantly, the rate of false positives must be close to the nominal level. You can test the latter by checking if the test produces a uniform distribution of p-values under the null hypothesis (it should!):

tests_var_under_H0 <- function(fit, n = nrow(fit$data), residvar_true = 25) {
  d <- simulate_Aliens(N = n, sigma2 =residvar_true)
  fit_phi_est <- update(fit, data = d)
  fit_phi_fixed <- update(fit, fixed = list(phi = residvar_true), data = d)
  stats_LRT <- -2 * (logLik(fit_phi_fixed) - logLik(fit_phi_est))
  c(chisq = pchisq(sum(residuals(fit_phi_est)^2)/residvar_true,
                   df = nrow(fit$data) - length(spaMM::fixef(fit)), lower.tail = FALSE),
    LRT = pchisq(stats_LRT, df = 1, lower.tail = FALSE)[[1]])
}
set.seed(123)
res <- replicate(1000, tests_var_under_H0(fit_spaMM_AL1_REML))

Which alternative to choose for testing? r .emo("info")

plot(ecdf(res["chisq", ]), lwd = 2)
plot(ecdf(res["LRT", ]), add = TRUE, col = "blue", lwd = 2)
abline(0, 1, col = "red", lwd = 2)
legend("topleft", fill = c("black", "blue"), legend = c("Chisq", "LRT"), bty = "n")

Notes:

Distribution of residual variance estimates r .emo("nerd")

In LM, variance estimates follow a Gamma ($\Gamma$) distribution with mean = $\texttt{estimate}$ and var = $2 (\texttt{estimate}^2) / \texttt{df}$.

For $\Gamma$, mean = $\texttt{shape} \times \texttt{scale}$ and var = $\texttt{shape} \times \texttt{scale}^2$ ($\texttt{shape}$ and $\texttt{scale}$ are the two parameters of the $\Gamma$ distribution); thus, little maths show that in our context $\texttt{shape} = \texttt{df}/2$ and $\texttt{scale} = 2 \times\texttt{estimate}/\texttt{df}$.

curve(dgamma(x, shape = fit_lm_AL1$df.residual/2, scale = 2*residvar/fit_lm_AL1$df.residual), ## df = n - p
      from = 0, to = 100, xlab = "residual variance", ylab = "probability density", lwd = 3)

CI for $\widehat{\sigma}^2$ using the $\Gamma$ distribution r .emo("practice")

So, we can rely on the Gamma to build a CI95% of the variance estimate as follows:

CI_residvar_gamma <- function(fit) {
  residvar <- summary(fit)$sigma^2
  1/qgamma(p = c(0.975, 0.025), shape = fit$df.residual/2, scale = 2/(residvar*fit$df.residual))
}

Since the $\chi^2$ is a particular case of the $\Gamma$ distribution, this is equivalent to:

CI_residvar_chi2 <- function(fit) {
 sum(residuals(fit)^2)/qchisq(c(0.975, 0.025), df = fit$df.residual)
}
CI_residvar_gamma(fit_lm_AL1)
CI_residvar_chi2(fit_lm_AL1)

Note: if we did not know better, we could also use likelihood profiling or parametric bootstrap as we have seen it in the context of CI for regression coefficients. Let's try that anyway for learning a thing or two...

CI for $\widehat{\sigma}^2$ using likelihood profiling r .emo("nerd")

candidate_residvar <- seq(10, 120, 5)
logLik_profile <- sapply(candidate_residvar, function(residvar) ## for each residvar we re-estimate other parameters
  logLik(update(fit_spaMM_AL1_REML_varfixed, fixed = list(phi = residvar))))
plot(logLik_profile ~ candidate_residvar, type = "o")
abline(v = 25, col = "green", lwd = 2) ## ground truth (usually unknown)
abline(v = fit_spaMM_AL1_REML$phi, col = "blue", lwd = 2) ## estimate
abline(h = logLik(fit_spaMM_AL1_REML) - 0.5*qchisq(0.95, 1), col = "red", lty = 2, lwd = 2) ## logLik threshold

CI for $\widehat{\sigma}^2$ using likelihood profiling r .emo("nerd")

Again, you can compute the boundaries of the CI95% more precisely by optimising a function:

diff_logLik_residvar <- function(residvar, fit) {
  newfit <- update(fit, fixed = list(phi = residvar))
  logLik_newfit <- logLik(newfit)[[1]]
  logLik_goal <- logLik(fit)[[1]] - 0.5*qchisq(0.95, df = 1)
  abs(logLik_newfit - logLik_goal)
}
CI_residvar_prof <- function(fit, range = c(0.1, 1000)) {
  lwr <- optimise(diff_logLik_residvar, interval = c(range[1], fit$phi),
                  fit = fit)
  upr <- optimise(diff_logLik_residvar, interval = c(fit$phi, range[2]),
                  fit = fit)
  c(lwr = lwr$minimum, upr = upr$minimum)
}
CI_residvar_prof(fit = fit_spaMM_AL1_REML)

CI for $\widehat{\sigma}^2$ using parametric bootstrap r .emo("nerd")

Again, you can use the package {boot} to perform a parametric bootstrap:

CI_residvar_boot <- function(fit, nboot = 1000) {
  newYs <- simulate(fit, nsim = nboot) ## simulate new responses from the fit!!
  res_sim <- sapply(1:nboot, function(i) {
    newdata <- model.frame(fit) ## extract the data
    newdata[, 1] <- newYs[, i] ## replace the response var
    newfit <- update(fit, data = newdata)  ## refit the model
    newfit$phi  ## extract the estimate
  })
  boot::boot.ci(boot.out = list(R = length(res_sim)),
                t0 = fit$phi, ## â
                t  = res_sim, ## a*
                type = "basic")$basic[, 4:5]  ## more types exist!
}
set.seed(123)
CI_residvar_boot(fit = fit_spaMM_AL1_REML)

Note: that we obtained a negative value is not great since variances cannot be negative...

Which method is best to build a CI for $\widehat{\sigma}^2$? r .emo("info") r .emo("proof")

Here we know: the one using the $\Gamma$ (or equivalently $\chi^2$) distribution, but sometime it is unclear (either because you don't know, because it is not known, or because there is no general answer) what the best method is. In such cases, you can always test for yourself by investigating the coverage of different methods by simulating data close to yours (as we did it earlier for Student's based CI).

Here, we will check the coverage for all 3 methods we have just mentioned:

1- CI using the $\chi^2$

set.seed(123)
mean(replicate(1000,
  sum(findInterval(CI_residvar_gamma(update(fit_lm_AL1, data = simulate_Aliens())), 25)) == 1))  ## for lm fit

Note: as expected, the coverage is very close to the nominal level of 95%.

Which method is best to build CI for $\widehat{\sigma}^2$? r .emo("info") r .emo("proof")

2- CI using likelihood profiling r .emo("slow")

set.seed(123)
mean(replicate(1000,  ## SLOW!
  sum(findInterval(CI_residvar_prof(update(fit_spaMM_AL1_REML, data = simulate_Aliens())), 25)) == 1))
## only designed to work with spaMM REML fit!
## RESULT  = 0.921

3- CI using parametric boostrap r .emo("slow") r .emo("slow") r .emo("slow")

set.seed(123)
mean(replicate(1000, ## VERY SLOW!
  sum(findInterval(CI_residvar_boot(update(fit_spaMM_AL1_REML, data = simulate_Aliens()), nboot = 1000), 25)) == 1))  ## only designed to work with spaMM REML fit!
## RESULT  = 0.811

Notes:

Predictions

CI(predicted mean response) r .emo("info")

The CI95% on the predicted mean response is a nice way to represent the uncertainty of your fit:

pred <- data.frame(humans_eaten = seq(1, 12, 0.1))
pred <- cbind(pred, predict(fit_lm_AL1, newdata = pred, interval = "confidence"))
plot(size ~ humans_eaten, data = Alien)
points(fit ~ humans_eaten, data = pred, lty = 1, lwd = 2, col = "blue", type = "l")
points(upr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "blue", type = "l")
points(lwr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "blue", type = "l")

CI(predicted mean response) with {stats} r .emo("practice")

This is very easy to get in the case of lm() fits:

predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "confidence")

CI(predicted mean response) with {stats} r .emo("info")

There is however a more general recipe which we will reuse for GLMs:

pred <- predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "confidence", se.fit = TRUE)
rbind(c(pred$fit[1, "fit"] + qt(0.025, 10) * pred$se.fit[1],
        pred$fit[1, "fit"] + qt(0.975, 10) * pred$se.fit[1]),
      c(pred$fit[2, "fit"] + qt(0.025, 10) * pred$se.fit[2], 
        pred$fit[2, "fit"] + qt(0.975, 10) * pred$se.fit[2]))

Notes:

CI(predicted mean response) with {spaMM} r .emo("practice")

It is also quite easy to derive the CI for the mean response from models fitted with {spaMM}:

get_intervals(fit_spaMM_AL1_ML, newdata = data.frame(humans_eaten = 4:5), intervals = "fixefVar")

Note: here it does not matter if you use the ML or REML fit... (here {spaMM} uses the right thing automatically)

CI(new observations) r .emo("info")

CI for new observations (sometimes called Prediction Intervals) are used to predict the limits within which X% of new observations should fall (on average for future samples).

pred <- data.frame(humans_eaten = seq(1, 12, 0.1))
pred <- cbind(pred, predict(fit_lm_AL1, newdata = pred, interval = "prediction"))
plot(size ~ humans_eaten, data = Alien)
points(fit ~ humans_eaten, data = pred, lty = 1, lwd = 2, col = "red", type = "l")
points(upr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "red", type = "l")
points(lwr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "red", type = "l")

CI(new observations) with {stats} r .emo("practice")

This is very easy to get in the case of lm() fits:

predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "prediction")

CI(new observations) with {stats} r .emo("info")

There is however a more general recipe which we will reuse for GLMs:

pred <- predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "prediction", se.fit = TRUE)
pred$fit
se.pred <- sqrt(pred$se.fit^2 + pred$residual.scale^2) ## residual.scale^2 = residual variance
rbind(c(pred$fit[1, "fit"] + qt(0.025, df = 10) * se.pred[1],
        pred$fit[1, "fit"] + qt(0.975, df = 10) * se.pred[1]),
      c(pred$fit[2, "fit"] + qt(0.025, df = 10) * se.pred[2],
        pred$fit[2, "fit"] + qt(0.975, df = 10) * se.pred[2]))

Note: se.pred accounts for both the uncertainty in regression parameters (fixed-effect variance) and the residual variance, so se.pred^2 is called the response variance.

CI(new observations) with {spaMM} r .emo("info")

With {spaMM}:

get_intervals(fit_spaMM_AL1_REML, newdata = data.frame(humans_eaten = 4:5), intervals = "respVar")


Note: here it does not matter if you use the ML or REML fit... {spaMM} knows what to do.

What you need to remember r .emo("goal")

Table of contents

The Linear Model: LM


[Back to main menu](./Title.html#2)


courtiol/LM2GLMM documentation built on July 3, 2022, 7:42 a.m.