library(LM2GLMM) library(spaMM) library(doSNOW) options(width = 110) spaMM.options(nb_cores = 4L) #knitr::knit_hooks$set(webgl = hook_webgl) ## for rgl knitr::opts_chunk$set(cache = TRUE, cache.path = "./cache_knitr/LM_test_intervals/", fig.path = "./fig_knitr/LM_test_intervals/", error = TRUE, fig.align = "center", fig.asp = 1, dev.args = list(pointsize = 8))
r .emo("info")
summary()
, anova()
and Anova()
stats::lm()
r .emo("alien")
set.seed(123L) Alien <- simulate_Aliens() fit_lm_AL1 <- lm(size ~ humans_eaten, data = Alien) fit_lm_AL1
spaMM::fitme()
r .emo("alien")
ML fit:
fit_spaMM_AL1_ML <- fitme(size ~ humans_eaten, data = Alien) fit_spaMM_AL1_ML
Note: ML fits are needed for whenever we are using the likelihood to study regression coefficients.
spaMM::fitme()
r .emo("alien")
REML fit:
fit_spaMM_AL1_REML <- fitme(size ~ humans_eaten, data = Alien, method = "REML") fit_spaMM_AL1_REML
Note: REML fits are needed for whenever we are using estimate(s) of variance(s).
stats::lm()
r .emo("alien")
fit_lm_UK1 <- lm(height ~ drink + sex*weight, data = UK) fit_lm_UK1
How do you interpret each parameter estimate?
r .emo("recap")
#Plot interaction effect of sex:weight at reference level of drink plot_data <- expand.grid(drink = c("2 to 3 times a week"), sex = c("Girl", "Boy"), weight = c(0, 50)) plot_data$height <- predict(fit_lm_UK1, newdata = plot_data) library(ggplot2) ggplot(plot_data) + geom_line(aes(x = weight, y = height, colour = sex), size = 1) + geom_text(data = plot_data[plot_data$weight == 50, ], aes(x = weight - 4, y = height * c(0.95, 1.01), colour = sex, label = c("\U2640", "\U2642")), size = 10) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(limits = c(110, 160), breaks = c(fit_lm_UK1$coefficients["(Intercept)"], fit_lm_UK1$coefficients["(Intercept)"] + fit_lm_UK1$coefficients["sexGirl"], 120, 130, 140, 150, 160), labels = c(format(as.numeric(fit_lm_UK1$coefficients["(Intercept)"]), digits = 5), format(as.numeric(fit_lm_UK1$coefficients["(Intercept)"] + fit_lm_UK1$coefficients["sexGirl"]), digits = 5), "120", "130", "140", "150", "160")) + scale_colour_manual(values = c("#41658A", "#70A37F")) + labs(x = "Weight (kg)", y = "Height (cm)", title = "Relationship between weight and height for\nthe reference level of 'drink'") + theme_classic() + theme(legend.position = "none", axis.text = element_text(colour = "black"), axis.ticks = element_line(colour = "black"))
spaMM::fitme()
r .emo("alien")
ML fit:
fit_spaMM_UK1_ML <- fitme(height ~ drink + sex*weight, data = UK) fit_spaMM_UK1_ML
spaMM::fitme()
r .emo("alien")
REML fit:
fit_spaMM_UK1_REML <- fitme(height ~ drink + sex*weight, data = UK, method = "REML") fit_spaMM_UK1_REML
r .emo("practice")
Fitting a model leads to point estimates:
coef(fit_lm_AL1) ## use fixef() for spaMM fits coef(fit_lm_UK1) ## use fixef() for spaMM fits
Point estimates are random variables. Thus, to know if these numbers differ from specific values in the population, their distributions must be characterised and accounted for during tests.
r .emo("info")
Each regression parameter is associated with an estimation variance and since they are all being jointly estimated, there are also covariances between them:
vcov(fit_lm_AL1) ## = summary(fit_lm_AL1)$sigma^2 * solve(t(model.matrix(fit_lm_AL1)) %*% model.matrix(fit_lm_AL1))
The variances are on the diagonals, so you can get the standard errors on the regression coefficients as follows:
sqrt(diag(vcov(fit_lm_AL1)))
There are sometimes referred to as conditional standard errors because these SE do not account for uncertainty stemming from other sources (which happens in LMM & GLMM).
r .emo("practice")
Regression coefficient estimates are tested using Student's t-test (t value = Estimate / Std. Error
):
summary(fit_lm_AL1)$coefficients ## you can also use simply summary(fit_lm_AL1)!!! summary(fit_lm_UK1)$coefficients
Note: the null hypothesis for these tests is that the estimate is null (= 0).
r .emo("nerd")
{spaMM} does not compute the p-values for the t-tests but it is possible to compute them:
ttable <- summary(fit_spaMM_AL1_REML, verbose = FALSE)$beta_table ## verbose is just for the slide display cbind(ttable, p = 2*pt(abs(ttable[, "t-value"]), df = nrow(model.matrix(fit_spaMM_AL1_REML)) - ncol(model.matrix(fit_spaMM_AL1_REML)),## n-p lower.tail = FALSE))
Notes:
REML
fits because the denominator of the t-value (Cond. SE
) are (sqrt) variance estimatesML
in {spaMM} even if you used method = "REML"
r .emo("info")
r .emo("proof")
You can actually substitute simple two samples t-tests with LM:
t.test(extra ~ group, data = sleep, var.equal = TRUE) summary(lm(extra ~ group, data = sleep))$coef
Note: the p-value obtained is the same r .emo("party")
r .emo("info")
r .emo("proof")
This is because, in the case of LM, regression coefficients are normally distributed...
set.seed(123) new_beta_estimates <- t(replicate(1000, coef(update(fit_lm_AL1, data = simulate_Aliens(N = 100))))) qqnorm(new_beta_estimates[, 2]) ## draw a qqplot qqline(new_beta_estimates[, 2], col = "red") ## draw line of normality
Note: update()
refits a model with some of its input updated, but you can instead recall lm()
if you provide the latter with all its arguments.
r .emo("info")
This is because, in the case of LM, regression coefficients are normally distributed and in cases where the variance of the estimate is estimated (i.e not known) one must use the Student's t distribution (not the Gaussian) to model the distribution of scaled estimates with no bias:
$\widehat{\beta}\sim \mathcal{N}(\beta, \sigma_{\widehat{\beta}})$
$z = \frac{\widehat{\beta} - \beta}{\sigma_{\widehat{\beta}}}\sim \mathcal{N}(0, 1)$
$t = \frac{\widehat{\beta} - \beta}{\widehat{\sigma_{\widehat{\beta}}}}\sim \mathcal{t}_{n - p}$ ($t$ = Student's t distribution)
r .emo("practice")
The t-test compare each regression coefficient to 0, but you may want to test something more specific.
First, you need to reproduce the previous t-tests by hand:
summary(fit_lm_AL1)$coefficients["humans_eaten", ]
estimate <- coef(fit_lm_AL1)["humans_eaten"] std.error <- sqrt(diag(vcov(fit_lm_AL1))["humans_eaten"]) t.value <- (estimate - 0)/std.error p.value <- 2*pt(abs(t.value), df = fit_lm_AL1$df.residual, lower.tail = FALSE) ## 2x as bilateral test cbind(estimate, std.error, t.value, p.value)
r .emo("practice")
Then, you can simply remove the hypothetical value from the numerator of the t-value.
E.g. we test if the coefficient for human_eaten
is different from 1.5:
estimate <- coef(fit_lm_AL1)["humans_eaten"] std.error <- sqrt(diag(vcov(fit_lm_AL1))["humans_eaten"]) t.value <- (estimate - 1.5)/std.error p.value <- 2*pt(abs(t.value), df = fit_lm_AL1$df.residual, lower.tail = FALSE) ## 2x as bilateral test cbind(estimate, std.error, t.value, p.value)
You can alternatively use the package {car} to do that automatically:
car::linearHypothesis(fit_lm_AL1, c("humans_eaten = 1.5"))$"Pr(>F)"[2]
r .emo("info")
A more general approach is to identify the entire range of population-level parameter values that would not be considered as significantly different from the estimate of interest.
This is what Confidence Intervals (CI) are used for!
There are many methods for building such confidence intervals:
In the case of LM the method based on the Student's t distribution is best, but we will see the alternative methods because they will become useful for more complex models and it is easier to start exploring these methods in the context of simple models.
r .emo("practice")
This is the method available when calling confint()
on a lm()
fit:
confint(fit_lm_AL1)
It is easy to compute the same thing by hand:
cbind(coef(fit_lm_AL1) + qt(0.025, fit_lm_AL1$df.residual) * sqrt(diag(vcov(fit_lm_AL1))), coef(fit_lm_AL1) + qt(0.975, fit_lm_AL1$df.residual) * sqrt(diag(vcov(fit_lm_AL1))))
Note: qt()
is used to extract the quantiles from the Student's t distribution and the rest is used to unscale the result so it follows the scale of the response.
r .emo("info")
A confidence interval with given nominal level X
is considered exact if the probability that the interval contains the true population value $k$ is X
for any $k$. We call this probability coverage.
Example: if you draw 100 datasets from a population, fit 100 models and compute 100 CI95% around a given parameter estimate. You expect 95 of them to include the true parameter value of the population.
Unfortunately, most methods for building confidence intervals are only approximate, i.e. the empirical coverage differs from the given coverage X
.
Depending on the type of the statistical model and the data, some methods show a coverage closer to their nominal level than others.
r .emo("practice")
You can always simulate data close to your situation to quantify the coverage of a given method:
set.seed(123) #Remember, we know the true intercept and slope terms generated by `simulate_Aliens()` to be 50 and 1.5 mean(replicate(1000, sum(findInterval(confint(update(fit_lm_AL1, data = simulate_Aliens()))[1, ], 50)) == 1)) mean(replicate(1000, sum(findInterval(confint(update(fit_lm_AL1, data = simulate_Aliens()))[2, ], 1.5)) == 1))
Note:
r .emo("party")
r .emo("info")
Here the idea is to quantify the decrease in (log)-likelihood when parameter values depart from its estimate and to consider as part of the interval all the parameter values that do not lead the (log)-likelihood to decrease below a specific threshold.
Here is how you can obtain the log-likelihood under any parameter values:
logLik(lm(size ~ 0 + offset(50 + 1.5 * humans_eaten), data = Alien))
Note: spaMM::fitme()
allows for the fixation of the variance in addition to the fixation of the regression parameters but lm()
does not (see later).
r .emo("info")
candidate_slopes <- seq(from = 0, to = 3, by = 0.1) loglik <- sapply(candidate_slopes, function(slope) logLik(lm(size ~ 1 + offset(slope*humans_eaten), data = Alien))) plot(loglik ~ candidate_slopes, type = "o") abline(v = 1.5, col = "green", lwd = 2) ## ground truth (usually unknown) abline(v = coef(fit_lm_AL1)["humans_eaten"], col = "blue", lwd = 2) ## estimate abline(h = logLik(fit_lm_AL1)[[1]] - 0.5*qchisq(0.95, df = 1), col = "red", lty = 2, lwd = 2) ## logLik threshold
r .emo("nerd")
You can compute the boundaries of the CI95% more precisely by optimising the following function:
diff_logLik_slope <- function(slope, varname, fit) { newdata <- fit$model newdata$offset <- slope*newdata[, varname] newfit <- lm(size ~ 1 + offset(offset), data = newdata) logLik_newfit <- logLik(newfit)[[1]] logLik_goal <- logLik(fit)[[1]] - 0.5*qchisq(0.95, df = 1) abs(logLik_newfit - logLik_goal) }
CI_prof_slope <- function(fit, varname) { lwr <- optimise(diff_logLik_slope, interval = c(coef(fit)[2] - 1, coef(fit)[2]), varname = varname, fit = fit) upr <- optimise(diff_logLik_slope, interval = c(coef(fit)[2], coef(fit)[2] + 1), varname = varname, fit = fit) c(lwr = lwr$minimum, upr = upr$minimum) }
CI_prof_slope(fit = fit_lm_AL1, varname = "humans_eaten")
r .emo("info")
r .emo("info")
It can also be easily extended to build confidence regions (i.e. a multivariate CI):
candidate_intercepts <- seq(from = 30, to = 70, by = 0.1) candidates <- expand.grid(candidate_intercept = candidate_intercepts, candidate_slope = candidate_slopes) candidates$logLik <- NA for (i in 1:nrow(candidates)) { ## could use sapply here too fit_temp <- lm(size ~ 0 + offset(candidates[i, 1] + candidates[i, 2] * humans_eaten), data = Alien) candidates[i, "logLik"] <- logLik(fit_temp) } par(las = 1) with(candidates, contour(candidate_intercepts, candidate_slopes, matrix(candidates$logLik, nrow = length(candidate_intercepts)), nlevels = 15, ylab = "slope", xlab = "intercept")) points(fit_spaMM_AL1_ML$fixef["(Intercept)"], fit_spaMM_AL1_ML$fixef["humans_eaten"], col = "blue", pch = 4, lwd = 2, cex = 3) points(50, 1.5, col = "green", pch = 4, lwd = 2, cex = 3) limit <- c(logLik(fit_lm_AL1) - 0.5*qchisq(0.95, df = 2)) with(candidates, contour(candidate_intercepts, candidate_slopes, matrix(candidates$logLik, nrow = length(candidate_intercepts)), levels = limit, add = TRUE, col = "red", lwd = 2))
r .emo("info")
The general aim of bootstrap methods is to approximate the distribution of a statistic, in cases where parameters are not known and where this distribution is not known analytically.
Here, we do that by simulating new data using the fitted model (i.e. considering the parameter estimates as population parameters) and refitting models based on such simulated data.
Thus, we have:
From that, confidence intervals can be computed under the assumption that:
r .emo("nerd")
For model fitted with stats::lm()
, it is a little complicated (even if the package {boot} helps):
CI_boot <- function(fit, param, nboot) { newYs <- simulate(fit, nsim = nboot) ## simulate new responses from the fit!! res_sim <- sapply(1:nboot, function(i) { newdata <- model.frame(fit) ## extract the data newdata[, 1] <- newYs[, i] ## replace the response var newfit <- update(fit, data = newdata) ## refit the model coef(newfit)[[param]] ## extract the estimate (replace with fixef() if spaMM fit) }) boot::boot.ci(boot.out = list(R = length(res_sim)), t0 = coef(fit)[[param]], ## â t = res_sim, ## a* type = "basic")$basic[, 4:5] ## more complex types exist! } ## to use with spaMM fit, just run coef.HLfit <- fixef.HLfit beforehand CI_boot(fit = fit_lm_AL1, param = "humans_eaten", nboot = 1000)
Notes: for more complex models, there are packages doing that for you, but I do not know one that would work with stats::lm()
(it may exist).
r .emo("practice")
With {spaMM}, it is much easier:
confint(fit_spaMM_AL1_ML, parm = "humans_eaten", boot_args = list(nsim = 1000))
r .emo("practice")
CI <- rbind(confint(fit_lm_AL1)["humans_eaten", ], CI_prof_slope(fit_lm_AL1, "humans_eaten"), CI_boot(fit = fit_lm_AL1, param = "humans_eaten", nboot = 5000))
data.frame(method = c("Student", "Profile", "Parametric bootstrap"), lwr = CI[, 1], upr = CI[, 2])
Notes:
confint()
which uses the Student's t distribution when called on a fit produced with stats::lm()
r .emo("info")
Example:
summary(fit_lm_UK1)$coefficients
r .emo("practice")
You can use anova()
to compare pairs of model fits:
fit_lm_UK2 <- lm(height ~ drink, data = fit_lm_UK1$model) ## we use the model frame to keep same dataset! anova(fit_lm_UK2, fit_lm_UK1)
Note: here we test the overall effect of sex * weight
.
r .emo("info")
The default behaviour for anova()
on LM fit is to perform a F-test:
fit_lm_UK2 <- lm(height ~ drink, data = fit_lm_UK1$model) ## we use model frame to keep same dataset! anova(fit_lm_UK2, fit_lm_UK1)
The F-test is used to check if the decrease in residual variance for the more complex fit is higher than expected under H0:
RSS_fit1 <- sum(residuals(fit_lm_UK1)^2) RSS_fit2 <- sum(residuals(fit_lm_UK2)^2) ((RSS_fit1 - RSS_fit2)/(fit_lm_UK1$df.residual - fit_lm_UK2$df.residual)) / (RSS_fit1/fit_lm_UK1$df.residual)
Note: RSS/Res.Df = unbiased estimate of the residual variance
r .emo("practice")
It is also possible to use a likelihood ratio test (LRT), similarly to what happened when we used likelihood profiling:
lmtest::lrtest(fit_lm_UK2, fit_lm_UK1)
Notes:
r .emo("practice")
It is possible to use anova()
to perform a LRT if the model is fitted with {spaMM} (using lmtest::lrtest()
won't work):
fit_spaMM_UK2_ML <- fitme(height ~ drink, data = fit_spaMM_UK1_ML$data) anova(fit_spaMM_UK1_ML, fit_spaMM_UK2_ML)
Notes:
anova(..., test = "LRT")
exists for model fitted with lm()
but it does not really do a true LRT (unless the residual variances of the 2 fits are identical) r .emo("broken")
(drop1()
does work)r .emo("practice")
{spaMM} makes it also easy to test the LRT by parametric bootstrap instead of using the $\chi^2$, which is good if we are far from asymptotic conditions (but it can be slow to run) r .emo("slow")
:
LRT(fit_spaMM_UK1_ML, fit_spaMM_UK2_ML, boot.repl = 1000)
Notes:
boot.repl
must be small)r .emo("practice")
Testing the overall effect cannot be done with summary()
because it would only give you test outputs separately for each factor level. Instead, you should rely on model comparison for this purpose:
fit_lm_UK3 <- lm(height ~ sex * weight, data = fit_lm_UK1$model) anova(fit_lm_UK3, fit_lm_UK1)
CONCLUSION:
the consumption of alcohol during pregnancy is (statistically) related to the children's height.
r .emo("practice")
Testing the effect of human_eatens
can be done with summary()
or with model comparison:
summary(fit_lm_AL1)$coefficients["humans_eaten", ] fit_lm_AL2 <- lm(size ~ 1, data = fit_lm_AL1$model) anova(fit_lm_AL2, fit_lm_AL1)
Note: when the models differ by one parameter (as in the case of quantitative variable), the F-test and the t-test give an identical p-value ($t^2 = F$ when the numerator degree of freedom is 1).
r .emo("practice")
Instead of doing many model comparisons to test all your predictors one by one, you can use car::Anova()
as a shortcut:
car::Anova(fit_lm_UK1)
Anova()
performs a so-called Type 2 ANOVA: it tests for the effect of simple terms considering other simple terms are included, but not interactions. The function also tests for the effect of interactions assuming all simple terms are included.
r .emo("practice")
Better not use anova()
for that since it performs the comparisons sequentially (i.e. Type 1 ANOVA, which is quite silly)...
anova(fit_lm_UK1)
Note: sequential comparisons are nonetheless useful to identify multicollinearity issues (more on that later).
r .emo("practice")
A null model is a model estimating nothing more than an intercept:
fit_lm_UK0 <- lm(height ~ 1, data = fit_lm_UK1$model) anova(fit_lm_UK0, fit_lm_UK1)
Note: comparing a fit to the fit of a so-called null model should always be performed first, i.e. before testing the effect of predictors individually! r .emo("info")
In the case of LM, it is also given by the last row produced by summary()
!
r .emo("practice")
After establishing that a qualitative predictor is significant, you may wonder which level is different from which. Yet, comparing many levels leads to an increase in the rate of false positives, so it is best to apply a multiple testing correction. The package {multcomp} can help you do that:
comparing_drink_levels <- multcomp::glht(fit_lm_UK1, linfct = multcomp::mcp(drink = "Tukey")) summary(comparing_drink_levels)
r .emo("practice")
After establishing that a qualitative predictor is significant, you may wonder which level is different from which. Yet, comparing many levels leads to an increase in the rate of false positives, so it is best to apply a multiple testing correction. The package {multcomp} can help you do that:
comparing_drink_levels_spaMM <- multcomp::glht(fit_spaMM_UK1_ML, linfct = multcomp::mcp(drink = "Tukey"), coef. = fixef.HLfit) ## works with spaMM fit too! par(mar = c(4, 20, 2, 2)) ## increase left margin plot(comparing_drink_levels_spaMM) ## you can plot the outcome!
r .emo("practice")
After establishing that a qualitative predictor is significant, you may wonder which level is different from which. Yet, comparing many levels leads to an increase in the rate of false positives, so it is best to apply a multiple testing correction. The package {multcomp} can help you do that.
You can also test specific hypotheses instead of testing all combinations (it increases the power!):
UK$drink2 <- factor(gsub(" ", "_", UK$drink)) ## removes space in levels to avoid issues in multcomp fit_lm_UK1bis <- lm(height ~ drink2 + sex*weight, data = UK) summary(multcomp::glht(fit_lm_UK1bis, linfct = multcomp::mcp(drink2 = c("Not_at_all - Once_a_week_or_less == 0", "Not_at_all == 0", ## compare to intercept as no param for 2_to_3... "Not_at_all - Most_days == 0"))))
r .emo("practice")
To reduce the number of comparisons, you can also select carefully the first level of a factor.
UK$drink3 <- relevel(UK$drink, ref = "Not at all") summary(lm(height ~ drink3 + sex*weight, data = UK))$coefficients
Note: although this is an established practice, it does not correct for multiple testing (which is why the p-values are smaller than on the previous slide).
r .emo("practice")
So far all model comparisons we have seen considered nested models. Non-nested models can be compared too, as long as they have been fitted on the exact same response.
For this, we rely on information criterions such as the Akaike Information Criterion:
$\texttt{AIC} = -2 \times (\texttt{logLik} - p)$, with $p$ the number of parameters
Example:
UKselect <- na.omit(UK[, c("height", "drink", "sex", "mother_height", "father_height")]) ## removes NA AIC(lm(height ~ drink * sex, data = UKselect)) AIC(lm(height ~ mother_height + father_height, data = UKselect))
Notes:
r .emo("alien")
To study what happens when testing many things, we will first simulate predictors with no effect:
simu_null <- function(n, p) { d <- as.data.frame(replicate(p + 1, rnorm(n))) colnames(d)[1] <- "y" colnames(d)[2:(p + 1)] <- paste0("x", 1:p) return(d) } set.seed(123) simu_null(n = 5, p = 3)
r .emo("info")
p <- 3 set.seed(123) lm(y ~ ., data = simu_null(n = 5, p = p)) set.seed(123) (p.values <- car::Anova(lm(y ~ ., data = simu_null(n = 5, p = p)))["Pr(>F)"][1:p, 1]) min(p.values) <= 0.05 ## test if any regression coef is significant
r .emo("info")
set.seed(123) mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 100, p = 1)))["Pr(>F)"][1, 1]) <= 0.05)) mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 100, p = 3)))["Pr(>F)"][1:3, 1]) <= 0.05)) mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 100, p = 10)))["Pr(>F)"][1:10, 1]) <= 0.05)) mean(replicate(200, min(car::Anova(lm(y ~ ., data = simu_null(n = 1000, p = 50)))["Pr(>F)"][1:50, 1]) <= 0.05))
r .emo("warn")
Some scientists (with poor knowledge in statistics) rely on so-called backward stepwise selection to discard predictors. This is a terrible thing if you want to do inference (i.e. test things). Backward stepwise selection is evil, even when you compare the minimal "adequate" fit to a null fit.
Here is why:
backward <- function(p, nsimu = 100) { pv.signif.rec <- matrix(rep(NA, 2*nsimu), nrow = 2) row.names(pv.signif.rec) <- c("full", "backward") for (i in 1:nsimu) { d <- simu_null(n = nsimu, p = p) mod.full <- lm(y ~ ., data = d) mod.null <- lm(y ~ 1, data = d) mod.backward <- step(mod.full, direction = "backward", trace = FALSE) pv.full <- anova(mod.null, mod.full)$Pr[2] pv.backward <- anova(mod.null, mod.backward)$Pr[2] pv.signif.rec["full", i] <- ifelse(is.na(pv.full), FALSE, pv.full <= 0.05) pv.signif.rec["backward", i] <- ifelse(is.na(pv.backward), FALSE, pv.backward <= 0.05) } apply(pv.signif.rec, 1, mean) } result_backward <- sapply(c(3, 5, 10, 15, 20, 50), function(i) backward(p = i))
r .emo("proof")
plot(result_backward["full", ] ~ c(3, 5, 10, 15, 20, 50), type = "b", col = "green", ylim = c(0, 1), xlab = "Initial number of variables", ylab = "Proportion of significant models | H0", log = "x") points(result_backward["backward", ] ~ c(3, 5, 10, 15, 20, 50), type = "b", col = "red") abline(h = 0.05, lty = 2) legend("topleft", fill = c("red", "green"), legend = c("step selection (backward)", "full model"), bty = "n")
r .emo("info")
r .emo("proof")
mean(replicate(1000, { d <- simu_null(n = 1000, p = 50) mod <- lm(y ~ ., data = d) mod0 <- lm(y ~ 1, data = d) anova(mod, mod0)[2, "Pr(>F)"] <= 0.05 }))
r .emo("practice")
Remember how to extract the unbiased estimate of the residual variance:
(residvar <- summary(fit_lm_AL1)$sigma^2) fit_spaMM_AL1_REML$phi
Note: you need to use the REML estimate of the residual variance, otherwise we have seen it is biased.
r .emo("practice")
Testing if the residual variance is null makes no sense, but it could make sense to compare the estimate to a given value (e.g., here 25). For this we could use a very general recipe one can use for any comparison between nested models:
1- fit a model in which the parameter is constrained
fit_spaMM_AL1_REML_varfixed <- fitme(size ~ humans_eaten, fixed = list(phi = 25), data = Alien, method = "REML")
2- compare the likelihood between the constrained and the non-constrained fit using a Likelihood Ratio Test (LRT):
stats_LRT <- -2 * (logLik(fit_spaMM_AL1_REML_varfixed) - logLik(fit_spaMM_AL1_REML)) pchisq(stats_LRT, df = 1, lower.tail = FALSE)
Note: we saw that before but did so automatically using lmtest::lrtest()
and that would not work here.
r .emo("practice")
For comparing 2 residual variances in LM specifically, there is a better method:
pchisq(sum(residuals(fit_spaMM_AL1_REML)^2)/25, df = nrow(fit_spaMM_AL1_REML$data) - length(fixef(fit_spaMM_AL1_REML)), lower.tail = FALSE)
This method is better but much less general and thus only applies to LM (not GLM, LMM, GLMM).
r .emo("info")
Sometimes theory helps, but sometimes there is no clear answer.
In such cases, you can use simulations to answer the question!
Statistical tests should meet 2 properties: the rate of false negatives should be as small as possible, and more importantly, the rate of false positives must be close to the nominal level. You can test the latter by checking if the test produces a uniform distribution of p-values under the null hypothesis (it should!):
tests_var_under_H0 <- function(fit, n = nrow(fit$data), residvar_true = 25) { d <- simulate_Aliens(N = n, sigma2 =residvar_true) fit_phi_est <- update(fit, data = d) fit_phi_fixed <- update(fit, fixed = list(phi = residvar_true), data = d) stats_LRT <- -2 * (logLik(fit_phi_fixed) - logLik(fit_phi_est)) c(chisq = pchisq(sum(residuals(fit_phi_est)^2)/residvar_true, df = nrow(fit$data) - length(spaMM::fixef(fit)), lower.tail = FALSE), LRT = pchisq(stats_LRT, df = 1, lower.tail = FALSE)[[1]]) }
set.seed(123) res <- replicate(1000, tests_var_under_H0(fit_spaMM_AL1_REML))
r .emo("info")
plot(ecdf(res["chisq", ]), lwd = 2) plot(ecdf(res["LRT", ]), add = TRUE, col = "blue", lwd = 2) abline(0, 1, col = "red", lwd = 2) legend("topleft", fill = c("black", "blue"), legend = c("Chisq", "LRT"), bty = "n")
Notes:
ecdf()
computes the empirical cumulative distribution functionr .emo("nerd")
In LM, variance estimates follow a Gamma ($\Gamma$) distribution with mean = $\texttt{estimate}$ and var = $2 (\texttt{estimate}^2) / \texttt{df}$.
For $\Gamma$, mean = $\texttt{shape} \times \texttt{scale}$ and var = $\texttt{shape} \times \texttt{scale}^2$ ($\texttt{shape}$ and $\texttt{scale}$ are the two parameters of the $\Gamma$ distribution); thus, little maths show that in our context $\texttt{shape} = \texttt{df}/2$ and $\texttt{scale} = 2 \times\texttt{estimate}/\texttt{df}$.
curve(dgamma(x, shape = fit_lm_AL1$df.residual/2, scale = 2*residvar/fit_lm_AL1$df.residual), ## df = n - p from = 0, to = 100, xlab = "residual variance", ylab = "probability density", lwd = 3)
r .emo("practice")
So, we can rely on the Gamma to build a CI95% of the variance estimate as follows:
CI_residvar_gamma <- function(fit) { residvar <- summary(fit)$sigma^2 1/qgamma(p = c(0.975, 0.025), shape = fit$df.residual/2, scale = 2/(residvar*fit$df.residual)) }
Since the $\chi^2$ is a particular case of the $\Gamma$ distribution, this is equivalent to:
CI_residvar_chi2 <- function(fit) { sum(residuals(fit)^2)/qchisq(c(0.975, 0.025), df = fit$df.residual) }
CI_residvar_gamma(fit_lm_AL1) CI_residvar_chi2(fit_lm_AL1)
Note: if we did not know better, we could also use likelihood profiling or parametric bootstrap as we have seen it in the context of CI for regression coefficients. Let's try that anyway for learning a thing or two...
r .emo("nerd")
candidate_residvar <- seq(10, 120, 5) logLik_profile <- sapply(candidate_residvar, function(residvar) ## for each residvar we re-estimate other parameters logLik(update(fit_spaMM_AL1_REML_varfixed, fixed = list(phi = residvar)))) plot(logLik_profile ~ candidate_residvar, type = "o") abline(v = 25, col = "green", lwd = 2) ## ground truth (usually unknown) abline(v = fit_spaMM_AL1_REML$phi, col = "blue", lwd = 2) ## estimate abline(h = logLik(fit_spaMM_AL1_REML) - 0.5*qchisq(0.95, 1), col = "red", lty = 2, lwd = 2) ## logLik threshold
r .emo("nerd")
Again, you can compute the boundaries of the CI95% more precisely by optimising a function:
diff_logLik_residvar <- function(residvar, fit) { newfit <- update(fit, fixed = list(phi = residvar)) logLik_newfit <- logLik(newfit)[[1]] logLik_goal <- logLik(fit)[[1]] - 0.5*qchisq(0.95, df = 1) abs(logLik_newfit - logLik_goal) }
CI_residvar_prof <- function(fit, range = c(0.1, 1000)) { lwr <- optimise(diff_logLik_residvar, interval = c(range[1], fit$phi), fit = fit) upr <- optimise(diff_logLik_residvar, interval = c(fit$phi, range[2]), fit = fit) c(lwr = lwr$minimum, upr = upr$minimum) }
CI_residvar_prof(fit = fit_spaMM_AL1_REML)
r .emo("nerd")
Again, you can use the package {boot} to perform a parametric bootstrap:
CI_residvar_boot <- function(fit, nboot = 1000) { newYs <- simulate(fit, nsim = nboot) ## simulate new responses from the fit!! res_sim <- sapply(1:nboot, function(i) { newdata <- model.frame(fit) ## extract the data newdata[, 1] <- newYs[, i] ## replace the response var newfit <- update(fit, data = newdata) ## refit the model newfit$phi ## extract the estimate }) boot::boot.ci(boot.out = list(R = length(res_sim)), t0 = fit$phi, ## â t = res_sim, ## a* type = "basic")$basic[, 4:5] ## more types exist! }
set.seed(123) CI_residvar_boot(fit = fit_spaMM_AL1_REML)
Note: that we obtained a negative value is not great since variances cannot be negative...
r .emo("info")
r .emo("proof")
Here we know: the one using the $\Gamma$ (or equivalently $\chi^2$) distribution, but sometime it is unclear (either because you don't know, because it is not known, or because there is no general answer) what the best method is. In such cases, you can always test for yourself by investigating the coverage of different methods by simulating data close to yours (as we did it earlier for Student's based CI).
Here, we will check the coverage for all 3 methods we have just mentioned:
1- CI using the $\chi^2$
set.seed(123) mean(replicate(1000, sum(findInterval(CI_residvar_gamma(update(fit_lm_AL1, data = simulate_Aliens())), 25)) == 1)) ## for lm fit
Note: as expected, the coverage is very close to the nominal level of 95%.
r .emo("info")
r .emo("proof")
2- CI using likelihood profiling r .emo("slow")
set.seed(123) mean(replicate(1000, ## SLOW! sum(findInterval(CI_residvar_prof(update(fit_spaMM_AL1_REML, data = simulate_Aliens())), 25)) == 1)) ## only designed to work with spaMM REML fit! ## RESULT = 0.921
3- CI using parametric boostrap r .emo("slow")
r .emo("slow")
r .emo("slow")
set.seed(123) mean(replicate(1000, ## VERY SLOW! sum(findInterval(CI_residvar_boot(update(fit_spaMM_AL1_REML, data = simulate_Aliens()), nboot = 1000), 25)) == 1)) ## only designed to work with spaMM REML fit! ## RESULT = 0.811
Notes:
r .emo("info")
The CI95% on the predicted mean response is a nice way to represent the uncertainty of your fit:
pred <- data.frame(humans_eaten = seq(1, 12, 0.1)) pred <- cbind(pred, predict(fit_lm_AL1, newdata = pred, interval = "confidence")) plot(size ~ humans_eaten, data = Alien) points(fit ~ humans_eaten, data = pred, lty = 1, lwd = 2, col = "blue", type = "l") points(upr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "blue", type = "l") points(lwr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "blue", type = "l")
r .emo("practice")
This is very easy to get in the case of lm()
fits:
predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "confidence")
r .emo("info")
There is however a more general recipe which we will reuse for GLMs:
pred <- predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "confidence", se.fit = TRUE) rbind(c(pred$fit[1, "fit"] + qt(0.025, 10) * pred$se.fit[1], pred$fit[1, "fit"] + qt(0.975, 10) * pred$se.fit[1]), c(pred$fit[2, "fit"] + qt(0.025, 10) * pred$se.fit[2], pred$fit[2, "fit"] + qt(0.975, 10) * pred$se.fit[2]))
Notes:
se.fit
accounts for the uncertainty in the regression parameterssqrt(diag(X %*% vcov(fit) %*% t(X)))
r .emo("nerd")
r .emo("practice")
It is also quite easy to derive the CI for the mean response from models fitted with {spaMM}:
get_intervals(fit_spaMM_AL1_ML, newdata = data.frame(humans_eaten = 4:5), intervals = "fixefVar")
Note: here it does not matter if you use the ML or REML fit... (here {spaMM} uses the right thing automatically)
r .emo("info")
CI for new observations (sometimes called Prediction Intervals) are used to predict the limits within which X% of new observations should fall (on average for future samples).
pred <- data.frame(humans_eaten = seq(1, 12, 0.1)) pred <- cbind(pred, predict(fit_lm_AL1, newdata = pred, interval = "prediction")) plot(size ~ humans_eaten, data = Alien) points(fit ~ humans_eaten, data = pred, lty = 1, lwd = 2, col = "red", type = "l") points(upr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "red", type = "l") points(lwr ~ humans_eaten, data = pred, lty = 2, lwd = 2, col = "red", type = "l")
r .emo("practice")
This is very easy to get in the case of lm()
fits:
predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "prediction")
r .emo("info")
There is however a more general recipe which we will reuse for GLMs:
pred <- predict(fit_lm_AL1, newdata = data.frame(humans_eaten = 4:5), interval = "prediction", se.fit = TRUE) pred$fit se.pred <- sqrt(pred$se.fit^2 + pred$residual.scale^2) ## residual.scale^2 = residual variance rbind(c(pred$fit[1, "fit"] + qt(0.025, df = 10) * se.pred[1], pred$fit[1, "fit"] + qt(0.975, df = 10) * se.pred[1]), c(pred$fit[2, "fit"] + qt(0.025, df = 10) * se.pred[2], pred$fit[2, "fit"] + qt(0.975, df = 10) * se.pred[2]))
Note: se.pred
accounts for both the uncertainty in regression parameters (fixed-effect variance) and the residual variance, so se.pred^2
is called the response variance.
r .emo("info")
With {spaMM}:
get_intervals(fit_spaMM_AL1_REML, newdata = data.frame(humans_eaten = 4:5), intervals = "respVar")
Note: here it does not matter if you use the ML or REML fit... {spaMM} knows what to do.
r .emo("goal")
summary()
, anova()
and Anova()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.