library(LM2GLMM) knitr::opts_chunk$set(cache = TRUE, fig.align = "center", fig.width = 7, fig.height = 6, cache.path = "./cache_knitr/Exo_GLM_solution/", fig.path = "./fig_knitr/Exo_GLM_solution/") options(width = 100, rmarkdown.html_vignette.check_title = FALSE)
In this vignette I illustrate the key steps required to solve the exercises. By no means I am trying to provide a detailed report of the analysis of each dataset. Any such report may need additional analyses and results would have to be written in good English, not in telegraphic style or lines of codes.
r .emo("goal")
Test whether insecticide type significantly affects the number of insects in agricultural experimental units.
r .emo("goal")
Compare the effect of spray C to all other spray types.
r .emo("goal")
What is the mean number of insects we would predict to find on a plot treated with each insecticide spray. Compare results between the LM approach (with and without BoxCox transformation) to those from the GLM approach.
Note: I will not check the assumptions for LM, nor detail the dataset (and plotting variables) since we did it before (see LM exercises).
We fit the LM:
fit_insect <- lm(count ~ spray, data = InsectSprays)
We fit the LM on the BoxCox transformed response:
InsectSprays$count_bis <- InsectSprays$count + 1 fit_insect_bis <- update(fit_insect, count_bis ~ .) bc <- car::powerTransform(fit_insect_bis) InsectSprays$count_bc <- car::bcPower(InsectSprays$count_bis, bc$lambda) fit_insect_bc <- update(fit_insect_bis, count_bc ~ .)
We fit the GLM:
fit_insect_glm <- glm(count ~ spray, data = InsectSprays, family = poisson())
All assumptions about model structure have been checked before (see exercises for LM).
We still have to test all assumptions about the errors.
library(DHARMa) insect_glm_resid <- simulateResiduals(fit_insect_glm, n = 1000) plot(insect_glm_resid) ## many tests at once
This reveals one issue concerning the dispersion. Let's thus look at this in more details:
testDispersion(insect_glm_resid)
The dispersion value is higher than expected, so there is slight overdispersion.
Let's attempt to solve this issue before testing other assumptions.
Is the overdispersion caused by two many zeros?
table(fit_insect_glm$model$count) ## we count the zeros and other counts testZeroInflation(insect_glm_resid)
nope... the number of observed zeros matches the expectation of the model.
Let's try to fit an alternative model to account for the overdispersion:
library(spaMM) fit_insect_glm_nb <- fitme(count ~ spray, data = InsectSprays, family = spaMM::negbin()) AIC(fit_insect_glm) AIC(fit_insect_glm_nb)
The negative binomial model fits the data much better as shown by its smaller deviance.
Note that the prediction are the same under the 2 models:
plot(predict(fit_insect_glm, type = "response"), fit_insect_glm$model$count, col = "red", lwd = 6) abline(0, 1, col = "red", lwd = 6) points(predict(fit_insect_glm_nb), fit_insect_glm_nb$data$count, col = "blue") abline(0, 1, col = "blue")
The only thing that differs is the residual variance around those predictions (not shown).
Note also that the Gaussian models have a deviance expressed on another scale, which makes the comparison between models particularly difficult when using Gaussian models (but this is not the case here).
Let us stick to the negative binomial model as our best GLM!
insect_glm_nb_resid <- simulateResiduals(fit_insect_glm_nb, n = 1000) testDispersion(insect_glm_nb_resid)
The dispersion issue is solved, but what about the other assumptions?
plot(insect_glm_nb_resid) testTemporalAutocorrelation(insect_glm_nb_resid, time = 1:nrow(model.frame(fit_insect_glm_nb))) testTemporalAutocorrelation(insect_glm_nb_resid, time = order(fitted(fit_insect_glm_nb))) testZeroInflation(insect_glm_nb_resid)
Everything looks great!
anova(fit_insect, update(fit_insect, . ~ 1)) ## F-test for LM anova(fit_insect_bc, update(fit_insect_bc, . ~ 1)) ## F-test for LM with BoxCox transformation lmtest::lrtest(fit_insect_glm, update(fit_insect_glm, . ~ 1)) ## LRT for GLM Poisson anova(fit_insect_glm_nb, update(fit_insect_glm_nb, . ~ 1)) ## LRT for GLM NegBin
No matter the models, the effect of the insecticide is detected!
Let's illustrate how to do that for the two GLMs since they have been fitted with 2 different packages ({stats} and {spaMM} ):
library(multcomp) comp_glm <- glht(fit_insect_glm, linfct = mcp(spray = c("C - A = 0", "C - B = 0", "C - D = 0", "C - E = 0", "C - F = 0"))) summary(comp_glm)
comp_glm_nb <- glht(fit_insect_glm_nb, linfct = mcp(spray = c("C - A = 0", "C - B = 0", "C - D = 0", "C - E = 0", "C - F = 0")), coef. = fixef.HLfit) ## needed for glht to work with spaMM summary(comp_glm_nb) plot(comp_glm_nb)
The sprays C is the most effective one (fewer insects remaining) but we cannot exclude that E is equally effective.
We now compute the predictions and CI for the different models.
data.for.pred <- data.frame(spray = levels(InsectSprays$spray)) pred_lm <- cbind(data.for.pred, predict(fit_insect, newdata = data.for.pred, interval = "confidence")) pred_lm
pred_bc <- cbind(data.for.pred, predict(fit_insect_bc, newdata = data.for.pred, interval = "confidence")) f.inverse <- function(x) car::bcnPowerInverse(x, lambda = bc$lambda, gamma = 0) - 1 ## Note: we removed 1 as we added 1 so to be able to do the BoxCox transformation pred_bc$fit <- f.inverse(x = pred_bc$fit) pred_bc$lwr <- f.inverse(x = pred_bc$lwr) pred_bc$upr <- f.inverse(x = pred_bc$upr) pred_bc
pred_glm_poiss_eta <- predict(fit_insect_glm, newdata = data.for.pred, se.fit = TRUE) pred_glm_poiss <- cbind(data.for.pred, data.frame(fit = family(fit_insect_glm)$linkinv(pred_glm_poiss_eta$fit))) pred_glm_poiss$lwr <- family(fit_insect_glm)$linkinv(pred_glm_poiss_eta$fit + qnorm(0.025)*pred_glm_poiss_eta$se.fit) pred_glm_poiss$upr <- family(fit_insect_glm)$linkinv(pred_glm_poiss_eta$fit + qnorm(0.975)*pred_glm_poiss_eta$se.fit) pred_glm_poiss
pred_glm_nb_table <- cbind(data.for.pred, predict(fit_insect_glm_nb, newdata = data.for.pred), get_intervals(fit_insect_glm_nb, newdata = data.for.pred, intervals = "predVar")) colnames(pred_glm_nb_table)[-1] <- c("fit", "lwr", "upr") pred_glm_nb_table
We combine the predictions so as to be able to plot them easily:
predictions_all <- rbind(cbind(pred_lm, method = "LM"), cbind(pred_bc, method = "LM BoxCox"), cbind(pred_glm_poiss, method = "GLM Poisson"), cbind(pred_glm_nb_table, method = "GLM NegBin")) ## declare method as factor for better display predictions_all$method <- forcats::fct_inorder(predictions_all$method)
library(ggplot2) ggplot() + geom_dotplot(aes(y = count, x = spray), data = InsectSprays, binwidth = 0.5, binaxis = "y", stackdir = "center", fill = "white") + geom_point(aes(y = fit, x = spray, colour = method), data = predictions_all, position = position_dodge(width = 0.5)) + ## for spreading things appart geom_errorbar(aes(y = fit, x = spray, colour = method, ymin = lwr, ymax = upr), data = predictions_all, position = position_dodge(width = 0.5)) + theme_classic()
Conclusions:
r .emo("goal")
Do children value more the type of the present, the cost of the present, or both?
# coplot(happiness ~ price | type, data = Surprise)
r .emo("goal")
Would it be better to limit alcohol consumption or tobacco consumption in order to avoid developing an oesophageal cancer? Use a GLM to find out.
esoph
str(esoph)
The factors are ordered which -- by default -- will select for different contrasts (polynomial contrasts), so we set them as unordered (note: polynomial contrasts are interesting to identify how to best model ordinal variables as quantitative variables):
esoph$agegp_f <- factor(esoph$agegp, ordered = FALSE) esoph$alcgp_f <- factor(esoph$alcgp, ordered = FALSE) esoph$tobgp_f <- factor(esoph$tobgp, ordered = FALSE)
summary(esoph) esoph$cancer_freq <- esoph$ncases / (esoph$ncases + esoph$ncontrols) esoph$cancer_N <- esoph$ncases + esoph$ncontrols summary(esoph$cancer_freq) ## proportion of cancer table(esoph$alcgp_f, esoph$tobgp_f) table(esoph$alcgp_f, esoph$tobgp_f, esoph$agegp_f)
We plot the data:
coplot(cancer_freq ~ tobgp_f | agegp_f, data = esoph, panel = panel.smooth) coplot(cancer_freq ~ alcgp_f | agegp_f, data = esoph, panel = panel.smooth)
Since the outcome is binomial, the only possibilities here are to explore which of the possible link functions fit the data best:
fit_esoph_logit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph, family = binomial(link = "logit")) fit_esoph_cauchit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph, family = binomial(link = "cauchit")) fit_esoph_probit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph, family = binomial(link = "probit")) AIC(fit_esoph_logit, fit_esoph_cauchit, fit_esoph_probit) ## you could also deviance
The probit link function leads to a model fitting the data best, so we will use it.
fit_esoph_probit_H0 <- glm(cbind(ncases, ncontrols) ~ 1, data = esoph, family = binomial(link = "probit")) lmtest::lrtest(fit_esoph_probit_H0, fit_esoph_probit)
The full model fits the data much better than the null model!
Linearity is not an issue since we considered all predictors as factorial variables. Collinearity should also not be an issue due to the neat sampling design.
esoph_probit_resid <- simulateResiduals(fit_esoph_probit, n = 1000) plot(esoph_probit_resid) testTemporalAutocorrelation(esoph_probit_resid, time = 1:nrow(esoph)) testTemporalAutocorrelation(esoph_probit_resid, time = fitted(fit_esoph_probit)) testZeroInflation(esoph_probit_resid) testDispersion(esoph_probit_resid)
Everything seems in order.
car::Anova(fit_esoph_probit)
Let us first try to plot the effects using functions from packages:
library(effects) plot(predictorEffects(fit_esoph_probit)) ## response scale plot(predictorEffects(fit_esoph_probit, transformation = list(link = NULL, inverse = NULL))) ## linear scale
library(visreg) visreg(fit_esoph_probit, scale = "response", ylim = c(0, 1)) visreg2d(fit_esoph_probit, "alcgp_f", "tobgp_f", plot.type = "image", scale = "response") visreg2d(fit_esoph_probit, "alcgp_f", "tobgp_f", plot.type = "persp", scale = "response")
fit_esoph_probit_spaMM <- fitme(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph, family = binomial(link = "probit")) #plot_effects(fit_esoph_probit_spaMM, focal_var = "agegp_f") ## buggy in current spaMM #plot_effects(fit_esoph_probit_spaMM, focal_var = "alcgp_f") #plot_effects(fit_esoph_probit_spaMM, focal_var = "tobgp_f") plot_effects_manual <- function(data) { ggplot(data) + aes(y = pointp, x = forcats::fct_inorder(focal_var), ymax = up, ymin = low) + geom_point() + geom_errorbar(width = 0.1) + scale_y_continuous(limits = c(0, 1)) + theme_classic() } plot_effects_manual(pdep_effects(fit_esoph_probit_spaMM, focal_var = "agegp_f")) plot_effects_manual(pdep_effects(fit_esoph_probit_spaMM, focal_var = "alcgp_f")) plot_effects_manual(pdep_effects(fit_esoph_probit_spaMM, focal_var = "tobgp_f"))
Note: the 3 packages differ in how they compute predictions!
This is not part of the exercise, but let us compare precisely the predictions for the age group 25-34 according to all 3 packages to better characterise and understand the differences.
We first need a little data wrangling to create a table comparing the predictions across methods:
effects_pred <- predictorEffect(fit_esoph_probit, predictor = "agegp_f") effects_pred <- as.data.frame(summary(effects_pred)[c("effect", "lower", "upper")])[1, ] colnames(effects_pred) <- c("fit", "lwr", "upr") visreg_pred <- visreg(fit_esoph_probit, scale = "response", plot = FALSE) visreg_pred <- visreg_pred[[1]]$fit[1, c("visregFit", "visregLwr", "visregUpr")] colnames(visreg_pred) <- c("fit", "lwr", "upr") spaMM_pred <- pdep_effects(fit_esoph_probit_spaMM, focal_var = "agegp_f") spaMM_pred <- spaMM_pred[1, c("pointp", "low", "up")] colnames(spaMM_pred) <- c("fit", "lwr", "upr") rbind(effects = effects_pred, visreg = visreg_pred, spaMM = spaMM_pred)
newdata <- data.frame(agegp_f = "25-34", alcgp_f = model.frame(fit_esoph_probit)$alcgp_f, tobgp_f = model.frame(fit_esoph_probit)$tobgp_f) family(fit_esoph_probit)$linkinv(mean(predict(fit_esoph_probit, newdata = newdata)))
table(esoph$alcgp_f) table(esoph$tobgp_f) newdata <- data.frame(agegp_f = "25-34", alcgp_f = "0-39g/day", tobgp_f = "0-9g/day") predict(fit_esoph_probit, newdata = newdata, type = "response")
pdep_effects()
from {spaMM} (under its default settings) computes partial-dependence effects (i.e. consider all values for non focal predictors, irrespective of the focal predictor, and compute the mean effect) and compute the mean AFTER transforming the data back to the scale of the response:newdata <- data.frame(agegp_f = "25-34", alcgp_f = model.frame(fit_esoph_probit)$alcgp_f, tobgp_f = model.frame(fit_esoph_probit)$tobgp_f) mean(predict(fit_esoph_probit, newdata = newdata, type = "response"))
Note: doing predictions yourself is perhaps the easiest way to know what is actually happening!
Doing things by hand is tedious but it gives you more clarity, control and you are no longer bounded to use a particular package.
Here for example, to plot the partial-dependence effects for all combinations of alcohol and tobacco categories we could do the following:
newdata_focal <- expand.grid(alcgp_f = levels(model.frame(fit_esoph_probit)$alcgp_f), tobgp_f = levels(model.frame(fit_esoph_probit)$tobgp_f)) newdata_nonfocal <- data.frame(agegp_f = model.frame(fit_esoph_probit)$agegp_f) newdata <- merge(newdata_focal, newdata_nonfocal) predicts <- cbind(newdata, fit = predict(fit_esoph_probit, newdata = newdata, type = "response")) d <- aggregate(fit ~ alcgp_f + tobgp_f, data = predicts, mean) d ggplot(d) + aes(y = alcgp_f, x = tobgp_f, fill = fit) + geom_tile() + scale_fill_binned(type = "viridis", breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) + theme_minimal()
To answer the question we may alternatively want to compare extreme situations (fixing all predictors to fixed values):
newdata_esoph_young <- expand.grid(agegp_f = c("25-34"), alcgp_f = c("0-39g/day", "120+"), tobgp_f = c("0-9g/day", "30+")) pred_esoph_young <- predict(fit_esoph_probit, newdata = newdata_esoph_young, type = "link", se.fit = TRUE) newdata_esoph_young$pred <- family(fit_esoph_probit)$linkinv(pred_esoph_young$fit) newdata_esoph_young$lwr <- family(fit_esoph_probit)$linkinv(pred_esoph_young$fit + qnorm(0.025)*pred_esoph_young$se.fit) newdata_esoph_young$upr <- family(fit_esoph_probit)$linkinv(pred_esoph_young$fit + qnorm(0.975)*pred_esoph_young$se.fit) newdata_esoph_young
Same for old people:
newdata_esoph_old <- expand.grid(agegp_f = c("75+"), alcgp_f = c("0-39g/day", "120+"), tobgp_f = c("0-9g/day", "30+")) pred_esoph_old <- predict(fit_esoph_probit, newdata = newdata_esoph_old, type = "link", se.fit = TRUE) newdata_esoph_old$pred <- family(fit_esoph_probit)$linkinv(pred_esoph_old$fit) newdata_esoph_old$lwr <- family(fit_esoph_probit)$linkinv(pred_esoph_old$fit + qnorm(0.025)*pred_esoph_old$se.fit) newdata_esoph_old$upr <- family(fit_esoph_probit)$linkinv(pred_esoph_old$fit + qnorm(0.975)*pred_esoph_old$se.fit) newdata_esoph_old
So again, drinking seems to be a worse than smoking but we are extrapolating a bit since very few individuals correspond to the predicted conditions! (see table()
call above).
One thing is clear, not smoking and not drinking seems to be the way to avoid these cancers.
We could also consider the interaction between alcohol and tobacco.
fit_esoph_probit2a <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f * tobgp_f, data = esoph, family = binomial(link = "probit")) car::Anova(fit_esoph_probit2a) summary(fit_esoph_probit2a)
For the ease of interpretation, when you have 2 factors in interaction you can also directly code their combination:
esoph$drink_smoke <- factor(paste(esoph$alcgp_f, esoph$tobgp_f, sep = "_")) fit_esoph_probit2b <- glm(cbind(ncases, ncontrols) ~ agegp_f + drink_smoke, data = esoph, family = binomial(link = "probit")) summary(fit_esoph_probit2b)
The fits fit_esoph_probit2a
and fit_esoph_probit2b
have the same likelihood and number of degrees of freedom.
They are equivalent, but just use a different parameterisation of the same thing.
The summary table from fit_esoph_probit2b
now directly compares all drink - alcohol combination to the reference group with no cigarette and minimum alcohol consumption.
Ideally we should again check all assumptions (you should do it!).
Let us do the minimum here:
plot(simulateResiduals(fit_esoph_probit2b, n = 1000))
Let us now compare if it is better to drink a lot and not smoke or the reverse:
library(multcomp) posthoc <- glht(fit_esoph_probit2b, linfct = c("`drink_smoke120+_0-9g/day` - `drink_smoke0-39g/day_30+` == 0")) par(oma = c(1, 5, 1, 1)) plot(posthoc) confint(posthoc) summary(posthoc, test = Chisqtest()) ## choose tests the better suited to your case
We have compared two different situations, using a posthoc approach. Drinking a lot seems to be worse than smoking a lot for this particular cancer.
We could also have considered the predictor as quantitative predictors:
fit_esoph_probit3 <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp + tobgp, data = esoph, family = binomial(link = "probit")) summary(fit_esoph_probit3)
The test on the polynomial contrasts (automatic due to the variable being ordered factors) suggest that linear (.L
) effects matter but that quadratic (.Q
) effects don't. For alcohol, there are hints of a slight cubic (.C
) effect, but we will ignore this for the sake of simplicity (the effect is small and in the absence of a quadratic effect, it is a little strange to interpret).
So let's recode the variable as truly quantitative and let's fit linear effects:
esoph$alc_num <- unclass(esoph$alcgp) esoph$tob_num <- unclass(esoph$tobgp) fit_esoph_probit4 <- glm(cbind(ncases, ncontrols) ~ agegp_f + alc_num + tob_num, data = esoph, family = binomial(link = "probit")) plot(simulateResiduals(fit_esoph_probit4, n = 1000)) car::crPlots(fit_esoph_probit4) ## partial residuals (to check linearity, albeit somewhat redundant with polynomial contrasts) AIC(fit_esoph_probit, fit_esoph_probit2b, fit_esoph_probit3, fit_esoph_probit4) confint(fit_esoph_probit4)
This last model seems quite good. Again, our results suggest that alcohol matters most. But do bear in mind that this model now compares the effect of increasing by one category in each predictor.
r .emo("goal")
Find out how the sex, age and passenger class influenced who died during the Titanic disaster of 1912.
head(TitanicSurvival) str(TitanicSurvival) summary(TitanicSurvival) with(data = TitanicSurvival, tapply(survived, sex, function(x) mean(x == "yes"))) with(data = TitanicSurvival, tapply(survived, passengerClass, function(x) mean(x == "yes"))) with(data = TitanicSurvival, table(cut(age, breaks = c(0, 10, 25, 35, 50, Inf), include.lowest = TRUE), ## make groups sex, passengerClass))
coplot(survived ~ age | sex * passengerClass, data = TitanicSurvival, panel = panel.smooth) library(car) scatterplot(survived == "yes" ~ age, data = TitanicSurvival)
The effect of age does not seem linear...
TitanicSurvival$surv <- TitanicSurvival$survived == "yes" TitanicSurvival$age_f <- cut(TitanicSurvival$age, breaks = c(0, 10, 25, 35, 50, Inf), include.lowest = TRUE) table(TitanicSurvival$age_f)
We fit a series of models with different links and considering age as factor or continuous:
fit_logit <- glm(surv ~ age_f + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "logit")) fit_probit <- glm(surv ~ age_f + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "probit")) fit_cauchit <- glm(surv ~ age_f + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "cauchit")) fit_logit2 <- glm(surv ~ age + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "logit")) fit_probit2 <- glm(surv ~ age + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "probit")) fit_cauchit2 <- glm(surv ~ age + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "cauchit")) AIC(fit_logit, fit_probit, fit_cauchit, fit_logit2, fit_probit2, fit_cauchit2)
The cauchit model with age as a factor provides a much better fit than other models, so we will select that one.
library(DHARMa) resid_Titanic <- simulateResiduals(fit_cauchit, n = 1000) plot(resid_Titanic) testTemporalAutocorrelation(resid_Titanic, time = order(fitted(fit_cauchit))) testTemporalAutocorrelation(resid_Titanic, time = 1:nrow(model.frame(fit_cauchit)))
There is a little bit of serial autocorrelation but it is low and hard to understand (observation are sorted by alphabetical order) so we will ignore that for now.
There is no need to test for overdispersion in a binary model!
data.for.pred <- with(data = TitanicSurvival, expand.grid(age_f = levels(age_f), passengerClass = levels(passengerClass), sex = levels(sex))) pred <- predict(fit_cauchit, newdata = data.for.pred, type = "link") data.to.plot <- cbind(data.for.pred, pred = pred) coplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot) library(lattice) ## same, more fancy xyplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot)
Another way to cope with the non-linearity of the age effect would be to do a GAM model:
library(mgcv) fit_cauchit_gam <- gam(surv ~ s(age) + passengerClass + sex, data = TitanicSurvival, family = binomial(link = "cauchit")) AIC(fit_cauchit_gam)
The model is a bit better, we could use it.
cauchit_gam_resid <- simulateResiduals(fit_cauchit_gam) plot(cauchit_gam_resid)
The QQ plot is good (left), but the uniformity plot on the right suggests things are not quite perfect.
Two predictors are factors the third is already modelled quite flexibly using GAM, so the issue is probably not about how we modelled each variable but more about something important we are missing.
Let's see if an interaction between passengerClass
& sex
would improve things:
fit_cauchit_gam2 <- gam(surv ~ s(age) + passengerClass*sex, data = TitanicSurvival, family = binomial(link = "cauchit")) AIC(fit_cauchit_gam2) cauchit_gam_resid2 <- simulateResiduals(fit_cauchit_gam2) plot(cauchit_gam_resid2)
Everything is much better! Ideally we should re-examine what link function is best but let's stick to what we have for simplicity.
For GAM models, we need to use anova()
(which calls mgcv::anova.gam()
and thus does perform an type I anova but a marginal anova -- documentation says type III, so not type II as car::Anova()
unfortunately, but perhaps the documentation is wrong... I did not check):
anova(fit_cauchit_gam2)
data.for.pred2 <- with(data = TitanicSurvival, expand.grid(age_f = levels(TitanicSurvival$age_f), passengerClass = levels(passengerClass), sex = levels(sex))) data.for.pred2$age <- NA data.for.pred2$age[data.for.pred2$age_f == "[0,10]"] <- (10 - 0)/2 data.for.pred2$age[data.for.pred2$age_f == "(10,25]"] <- (25 + 10)/2 data.for.pred2$age[data.for.pred2$age_f == "(25,35]"] <- (25 + 35)/2 data.for.pred2$age[data.for.pred2$age_f == "(35,50]"] <- (35 + 50)/2 data.for.pred2$age[data.for.pred2$age_f == "(50,Inf]"] <- (50 + 81)/2 pred_age_f <- cbind(data.for.pred2, fit = predict(fit_cauchit, newdata = data.for.pred2, type = "response")) pred_age_gam <- cbind(data.for.pred2, fit = predict(fit_cauchit_gam2, newdata = data.for.pred2, type = "response"))
ggplot(pred_age_f) + aes(y = fit, x = age, colour = passengerClass, linetype = sex, shape = sex) + geom_point() + geom_line() + scale_y_continuous(limits = c(0, 1)) + theme_classic() + labs(title = "Age as factor", y = "Predicted survival prob.") -> plot1 ggplot(pred_age_gam) + aes(y = fit, x = age, colour = passengerClass, linetype = sex, shape = sex) + geom_point() + geom_line() + scale_y_continuous(limits = c(0, 1)) + theme_classic() + labs(title = "Age as GAM", y = "Predicted survival prob.") -> plot2 library(patchwork) ## to combine plots and legend plots12 <- plot1 + plot2 & theme(legend.position = "bottom") plots12 + plot_layout(guides = "collect")
Notes:
as the model is not fitted with link = logit
, the odd ratios are not constant with age, so we cannot quantify effects using them.
the two fits plotted also differ by the fact that the one on the right considers an interaction but not the one on the right.
Try to figure out why predictions are so different for the young males from the second class.
r .emo("goal")
What was the probability of an O-ring experiencing thermal distress when the space shuttle Challenger took off on the 28th of January 1986 by 31 degrees F?
str(Challenger) summary(Challenger) table(nb_orings = Challenger$oring_tot, nb_pb_orings = Challenger$oring_dt)
plot(jitter(oring_dt, factor = 0.3) ~ temp, data = Challenger)
Notice the use of jitter!
Challenger$oring_fine <- Challenger$oring_tot - Challenger$oring_dt
fit_binom_logit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger, family = binomial(link = "logit")) fit_binom_probit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger, family = binomial(link = "probit")) fit_binom_cauchit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger, family = binomial(link = "cauchit")) AIC(fit_binom_logit, fit_binom_probit, fit_binom_cauchit)
We could choose the logit or the probit model. It would be easier with the logit to interpret the estimates but we do not need this here, so let us consider the probit model since it is best.
## let's use 2 seeds since it is a small sample size r_probit <- simulateResiduals(fit_binom_probit, n = 1000, seed = 1) plot(r_probit) r_probit2 <- simulateResiduals(fit_binom_probit, n = 1000, seed = 2) plot(r_probit2) testTemporalAutocorrelation(r_probit, time = 1:nrow(Challenger)) testTemporalAutocorrelation(r_probit, time = order(fitted(fit_binom_probit))) testTemporalAutocorrelation(r_probit, time = order(Challenger$temp))
Things seems OK although it is hard to be sure on such a small dataset.
car::Anova(fit_binom_probit) summary(fit_binom_probit)
Let's compare the prediction using both the probit and logit link, since they lead to quite equivalent fits:
pred_probit <- pred_logit <- data.frame(temp = 31:81) ## Predict probit pred_probit_res <- predict(fit_binom_probit, newdata = pred_probit, se.fit = TRUE) pred_probit$fit <- family(fit_binom_probit)$linkinv(pred_probit_res$fit) pred_probit$lwr <- family(fit_binom_probit)$linkinv( pred_probit_res$fit + qnorm(0.025) * pred_probit_res$se.fit) pred_probit$upr <- family(fit_binom_probit)$linkinv( pred_probit_res$fit + qnorm(0.975) * pred_probit_res$se.fit) pred_probit ## Predict logit pred_logit_res <- predict(fit_binom_logit, newdata = pred_logit, se.fit = TRUE) pred_logit$fit <- family(fit_binom_logit)$linkinv(pred_logit_res$fit) pred_logit$lwr <- family(fit_binom_logit)$linkinv( pred_logit_res$fit + qnorm(0.025) * pred_logit_res$se.fit) pred_logit$upr <- family(fit_binom_logit)$linkinv( pred_logit_res$fit + qnorm(0.975) * pred_logit_res$se.fit) pred_logit
Note: we could have done it only for 31 degrees, but computing predictions for a range of temperature allows to better understand the situation.
all_preds <- rbind(cbind(link = "probit", pred_probit), cbind(link = "logit", pred_logit)) ggplot() + geom_line(aes(y = 1 - fit, x = temp, colour = link), data = all_preds) + geom_ribbon(aes(ymin = 1 - upr, ymax = 1 - lwr, x = temp, fill = link), alpha = 0.2, data = all_preds) + geom_jitter(aes(y = oring_dt/oring_tot, x = temp), shape = 1, height = 0.005, data = Challenger) + labs(y = "Probabiltiy that at least one oring will fail") + theme_classic()
Notes:
we do 1 - fit
because our models predict the probability of having no failure.
the uncertainty is large, but both model agree that launching the shuttle at 31 degree F was a very risky...
Perhaps it would be good to test if a non-linear effect of the temperature would fit the data better (using a GAM model) although that may lead to poor extrapolation (the more flexible the function, the more unreliable the extrapolations).
Since a single failure could be fatal, perhaps a binary model would make more sense!
You could try to replicate the original study that excluded all the OK flights! (although this turned out deadly and not just silly).
r .emo("goal")
Try to identify the influential determinants of the smoking behaviour of children.
str(UK) lapply(UK, summary) UK$cigarettes_kid_bin <- UK$cigarettes_kid != "Never" ## we recode the variable to be binary table(UK$cigarettes_kid_bin, UK$cigarettes_friends)
Let's explore the data to see if there are any obvious effects:
scatterplot(cigarettes_kid_bin ~ age + sex, data = UK) scatterplot(cigarettes_kid_bin ~ cigarettes + sex, data = UK)
There could be some small interactions with sex.
It is not clear what model to fit a priori, so let us consider sex, age, and the number of cigarettes smoked by parents and friends.
fit_logit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends), family = binomial(link = "logit"), data = UK) fit_probit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends), family = binomial(link = "probit"), data = UK) fit_cauchit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends), family = binomial(link = "cauchit"), data = UK) AIC(fit_logit, fit_probit, fit_cauchit)
lmtest::lrtest( update(fit_logit, . ~ 1, data = fit_logit$model), fit_logit)
resid_fit_logit <- simulateResiduals(fit_logit, n = 1000) plot(resid_fit_logit) testQuantiles(resid_fit_logit) ## to clarify what's happening in the previous plot on the right. testTemporalAutocorrelation(resid_fit_logit, time = order(fitted(fit_logit))) testTemporalAutocorrelation(resid_fit_logit, time = 1:nrow(model.matrix(fit_logit)))
car::Anova(fit_logit)
Let's test all interactions at once:
lmtest::lrtest(update(fit_logit, . ~ . - sex:(cigarettes + age + cigarettes_friends), data = fit_logit$model), fit_logit)
There is no clear sign of interactions, so we will drop them for simplicity.
fit_logit2 <- update(fit_logit, . ~ . - sex:(cigarettes + age + cigarettes_friends), data = fit_logit$model) summary(fit_logit2) car::Anova(fit_logit2)
All predictors are highly significant but this is not very surprising since the dataset is very large.
Let's refit the model with {spaMM} so as to compute predictions more easily.
fit_logit2_spaMM <- fitme(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends), family = binomial(link = "logit"), data = UK) plot_effects(fit_logit2_spaMM, focal_var = "cigarettes") plot_effects(fit_logit2_spaMM, focal_var = "age")
In the current version of {spaMM} plot_effects()
does not work when the focal variable is a factor, but we can do the plot manually using the function plot_effects_manual()
we wrote above and spaMM::pdep_effects()
for the computation of the predictions:
effect_cigarettes_friends <- pdep_effects(fit_logit2_spaMM, focal_var = "cigarettes_friends") effect_sex <- pdep_effects(fit_logit2_spaMM, focal_var = "sex") plot_effects_manual(effect_cigarettes_friends) plot_effects_manual(effect_sex)
By far, the strongest correlate is whether friends are smoking or not.
You may try different model formulas including different predictors (biological knowledge would help not to go fishing...).
r .emo("goal")
Study the influences of age, body mass index and smoking status upon the probability of a woman being in menopause.
r .emo("goal")
Estimate the proportion of women in menopause in the population, depending on their age (40, 45, 50, 55, 60 yrs).
Note: here, for the sake of simplicity we simply define menopause as the lack of periods in old ages.
str(HSE98women)
We first create the response variable:
HSE98women$menopause <- HSE98women$period == FALSE
We also recode the smoked
variable as a factor (improves some plots):
HSE98women$smoked2 <- factor(HSE98women$smoked, labels = c("non-smoker", "smoker")) table(HSE98women$smoked, HSE98women$smoked2) ## always check you did the right thing!
We fit a first model to help us easily exploring the data used for fitting and not all data (there are many missing values in the dataset).
fit_menop_logit <- glm(menopause ~ age + bmi + smoked2, data = HSE98women, family = binomial()) nrow(HSE98women) ## original number of observations nrow(fit_menop_logit$model) ## number of observations considered in the model
We compare the variables before and after filtering to see if any obvious bias is introduced by the observations we discard:
summary(HSE98women[, c("menopause", "age", "bmi", "smoked2")]) summary(fit_menop_logit$model)
The only big discrepancy between the two datasets is that the minimal age is now 18 yrs old while the original dataset also contains children. That suggests that girls too young to have periods are not being considered, which is not an issue for studying menopause.
Let's check the correlation between our two continuous predictor:
cor.test(HSE98women$age, HSE98women$bmi) cor.test(fit_menop_logit$model$age, fit_menop_logit$model$bmi)
The correlation is not very high, especially in the dataset that is used for the fit, so multicollinearity should not be an issue.
We visualise the relationship between the predictor and the response:
scatterplot(menopause ~ age + smoked2, data = fit_menop_logit$model) scatterplot(menopause ~ bmi + smoked2, data = fit_menop_logit$model)
There don't seem to be any obvious interaction between the age and smoked or between BMI and smoked, so we won't consider any. We will however consider that the effect of age may depend on the BMI. The BMI also seems to exert a non monotonous effect.
fit_menop_logit <- glm(menopause ~ age * bmi + smoked2, data = HSE98women, family = binomial(link = "logit")) fit_menop_probit <- glm(menopause ~ age * bmi + smoked2, data = HSE98women, family = binomial(link = "probit")) fit_menop_cauchit <- glm(menopause ~ age * bmi + smoked2, data = HSE98women, family = binomial(link = "cauchit"))
Because it is very difficult to explore the shape of the relationship between the age and BMI upon the probability of being menopaused, we will also fit general additive models (GAMs), which allow for parameters to vary along the predictors.
fit_menop_logit_gam <- gam(menopause ~ s(age, bmi) + smoked2, data = HSE98women, family = binomial(link = "logit")) fit_menop_probit_gam <- gam(menopause ~ s(age, bmi) + smoked2, data = HSE98women, family = binomial(link = "probit")) fit_menop_cauchit_gam <- gam(menopause ~ s(age, bmi) + smoked2, data = HSE98women, family = binomial(link = "cauchit"))
AIC(fit_menop_logit,
fit_menop_probit,
fit_menop_cauchit,
fit_menop_logit_gam,
fit_menop_probit_gam,
fit_menop_cauchit_gam)
(Note that GAMs have non-integer degrees of freedom. This is normal.)
The GAM logit model seems to be the best model, which will prevent us to express simply the effect of coefficients compared to a usual GLM. That's too bad but since the GAM fits are much better, we will go for those. Among those, we will consider the one with the logit link since it is both more simple to discuss and it here fits the data best.
The best practices for testing effects in GAMs are not quite established.
We are going to test for the effect of the interaction by Likelihood Ratio Test (LRT):
fit_menop_logit_gam_no_int <- gam(menopause ~ s(age) + s(bmi) + smoked2, family = binomial(), data = fit_menop_logit_gam$model) lmtest::lrtest(fit_menop_logit_gam_no_int, fit_menop_logit_gam)
Let's now test the overall effect of BMI by LRT:
fit_menop_logit_gam_no_bmi <- gam(menopause ~ s(age) + smoked2, family = binomial(), data = fit_menop_logit_gam$model) lmtest::lrtest(fit_menop_logit_gam_no_bmi, fit_menop_logit_gam)
Let's now test the overall effect of age by LRT:
fit_menop_logit_gam_no_age <- gam(menopause ~ s(bmi) + smoked2, family = binomial(), data = fit_menop_logit_gam$model) lmtest::lrtest(fit_menop_logit_gam_no_age, fit_menop_logit_gam)
Let's now test the effect of smoking by LRT:
fit_menop_logit_gam_never_smoked <- gam(menopause ~ s(age, bmi), family = binomial(), data = fit_menop_logit_gam$model) lmtest::lrtest(fit_menop_logit_gam_never_smoked, fit_menop_logit_gam)
The package {mgcv} has a plotting function but in this particular case, the output is hard to read in the link scale and the conversion to response scale crashes for some reason....
plot(fit_menop_logit_gam)
Let's thus explore the model output with the package {visreg}, which we saw before and which also works on GAM fits (but remember that it computes predictions with weird meaning):
visreg(fit_menop_logit_gam)
These plots show the patterns identified by the smoothing parameters in the GAM: the effect of age seems to change after 40 yrs and the effect of BMI seem to be changing between 23 and 32 Kg/(m^2). It is difficult to account for such threshold effects using polynomials, so we will have to stick to GAMs here...
Let's do the same plots on the scale of the response variable:
visreg(fit_menop_logit_gam, scale = "response") visreg2d(fit_menop_logit_gam, "age", "bmi", scale = "response") #visreg2d(fit_menop_logit_gam, "age", "bmi", scale = "response", plot.type = "rgl")
We will now plot the predictions for the GLM fit with the logit link, the GLM fit with the cauchit link and the GAM fit with the logit link to explore their differences.
We will predict the influence of age and consider the median BMI and smokers.
age.for.pred <- seq(min(fit_menop_cauchit_gam$model$age), max(fit_menop_cauchit_gam$model$age), length = 100) data.for.pred <- expand.grid( age = age.for.pred, bmi = median(fit_menop_cauchit_gam$model$bmi), smoked2 = "smoker" ) logit.p <- predict(fit_menop_logit, newdata = data.for.pred, se.fit = TRUE) predict_GLM_logit <- data.frame(fit = family(fit_menop_logit)$linkinv(logit.p$fit), upr = family(fit_menop_logit)$linkinv(logit.p$fit + qnorm(0.975) * logit.p$se.fit), lwr = family(fit_menop_logit)$linkinv(logit.p$fit + qnorm(0.025) * logit.p$se.fit)) cauchit.p <- predict(fit_menop_cauchit, newdata = data.for.pred, se.fit = TRUE) predict_GLM_cauchit <- data.frame(fit = family(fit_menop_cauchit)$linkinv(cauchit.p$fit), upr = family(fit_menop_cauchit)$linkinv(cauchit.p$fit + qnorm(0.975) * cauchit.p$se.fit), lwr = family(fit_menop_cauchit)$linkinv(cauchit.p$fit + qnorm(0.025) * cauchit.p$se.fit)) gam.p <- predict(fit_menop_logit_gam, newdata = data.for.pred, se.fit = TRUE) predict_GAM_logit <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p$fit), upr = family(fit_menop_logit_gam)$linkinv(gam.p$fit + qnorm(0.975) * gam.p$se.fit), lwr = family(fit_menop_logit_gam)$linkinv(gam.p$fit + qnorm(0.025) * gam.p$se.fit)) all_predictions_for_age <- rbind( cbind(data.for.pred, method = "GLM_logit", predict_GLM_logit), cbind(data.for.pred, method = "GLM_cauchit", predict_GLM_cauchit), cbind(data.for.pred, method = "GAM_logit", predict_GAM_logit)) head(all_predictions_for_age) ## show the top of the table we created
Let's plot the result:
ggplot(all_predictions_for_age) + aes(y = fit, ymin = lwr, ymax = upr, x = age, colour = method) + geom_ribbon(fill = NA, linetype = "dashed", size = 0.2) + geom_line() + labs(y = "Probability of being in menopause", x = "Age (yrs)") + theme_classic() + theme(legend.position = "top")
Selecting the best model, let's now look at the effect of BMI, still using smokers:
data.for.pred2 <- expand.grid( age = age.for.pred, bmi = c(18.5, 25, 30), smoked2 = "smoker") gam.p2 <- predict(fit_menop_logit_gam, newdata = data.for.pred2, se.fit = TRUE) predict_GAM_logit2 <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p2$fit), upr = family(fit_menop_logit_gam)$linkinv(gam.p2$fit + qnorm(0.975) * gam.p2$se.fit), lwr = family(fit_menop_logit_gam)$linkinv(gam.p2$fit + qnorm(0.025) * gam.p2$se.fit)) predictions_for_age_by_BMI <- cbind(data.for.pred2, predict_GAM_logit2) ggplot(predictions_for_age_by_BMI) + aes(y = fit, ymin = lwr, ymax = upr, x = age, colour = factor(bmi)) + geom_ribbon(fill = NA, linetype = "dashed", size = 0.2) + geom_line() + labs(y = "Probability of being in menopause", x = "Age (yrs)") + theme_classic() + theme(legend.position = "top")
We can see on this plot that obesity seems to accelerate the onset of menopause, but it does not seem to make any difference later on.
We will now predict the proportion of women in menopause at 40, 45, 50, 55 and 60 yrs. To illustrate the effect of the BMI we will perform predictions for a BMI of 21.75 (middle of normal BMI range according to WHO) and 30 (lower limit for obesity) Kg/(m^2). Again, we will perform the predictions considering smoker since we have more data for this category. We will also compute the confidence intervals.
data.for.pred3 <- expand.grid( age = c(40, 45, 50, 55, 60), bmi = c(21.75, 30), smoked2 = "smoker") gam.p3 <- predict(fit_menop_logit_gam, newdata = data.for.pred3, se.fit = TRUE) predict_GAM_logit3 <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p3$fit), upr = family(fit_menop_logit_gam)$linkinv(gam.p3$fit + qnorm(0.975) * gam.p3$se.fit), lwr = family(fit_menop_logit_gam)$linkinv(gam.p3$fit + qnorm(0.025) * gam.p3$se.fit)) predictions_at_specific_ages <- cbind(data.for.pred3, predict_GAM_logit3) predictions_at_specific_ages
These table confirms our expectation concerning the influence of BMI upon the onset of menopause.
For the sake of comparison, let's compare predictions (without CI this time, for simplicity) using the other candidate fits:
data.for.pred4 <- expand.grid(age = c(40, 45, 50, 55, 60), bmi = 21.75, smoked2 = "smoker") pred_GAM4 <- predict(fit_menop_logit_gam, newdata = data.for.pred4, type = "response") pred_logit4 <- predict(fit_menop_logit, newdata = data.for.pred4, type = "response") pred_cauchit4 <- predict(fit_menop_cauchit, newdata = data.for.pred4, type = "response") predictions_at_specific_ages2 <- cbind(data.for.pred4, logit_GAM = pred_GAM4, logit_GLM = pred_logit4, cauchit_GLM = pred_cauchit4) predictions_at_specific_ages2
Let's look at the same proportion in the raw data. We cannot be as strict during selection otherwise we won't get much individuals, so let's enlarge a bit the selection and see that we get a few hundred individuals at least:
tab <- with(HSE98women, data.frame(N = c( sum(age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25, na.rm = TRUE), sum(age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25, na.rm = TRUE), sum(age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25, na.rm = TRUE), sum(age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25, na.rm = TRUE), sum(age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25, na.rm = TRUE) ))) tab
That seems a good selection process (+/- 1 year and the whole WHO normal BMI category), so let's check the proportion of individuals in menopause in each of these categories (keeping both smokers and non smokers as the effect is rather small, see later):
tab <- with(HSE98women, data.frame(observed_freq = c( mean(menopause[age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25], na.rm = TRUE), mean(menopause[age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25], na.rm = TRUE), mean(menopause[age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25], na.rm = TRUE), mean(menopause[age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25], na.rm = TRUE), mean(menopause[age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25], na.rm = TRUE) ))) cbind(predictions_at_specific_ages2, tab)
These results are roughly consistent with the predictions from the GAM model. It confirms that the GLM logit and cauchit are underestimating the prevalence of menopause at old age. All models however may overestimate the prevalence of menaupose at young age (although the problem seems worst in GLM logit). Depending on the goal of the study, other link functions, not implemented in base R, could be tried but it is quite advanced so we won't do that.
We can now turn to the exploration of the smoking effect by comparing predictions between smokers and non smokers at median BMI:
Selecting the best model, let's now look at the effect of BMI, still using smokers:
data.for.pred5 <- expand.grid( age = age.for.pred, bmi = 21.75, smoked2 = (c("smoker", "non-smoker"))) gam.p5 <- predict(fit_menop_logit_gam, newdata = data.for.pred5, se.fit = TRUE) predict_GAM_logit5 <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p5$fit), upr = family(fit_menop_logit_gam)$linkinv(gam.p5$fit + qnorm(0.975) * gam.p5$se.fit), lwr = family(fit_menop_logit_gam)$linkinv(gam.p5$fit + qnorm(0.025) * gam.p5$se.fit)) predictions_for_age_by_smoking <- cbind(data.for.pred5, predict_GAM_logit5) ggplot(predictions_for_age_by_smoking) + aes(y = fit, ymin = lwr, ymax = upr, x = age, colour = factor(smoked2)) + geom_ribbon(fill = NA, linetype = "dashed", size = 0.2) + geom_line() + labs(y = "Probability of being in menopause", x = "Age (yrs)") + theme_classic() + theme(legend.position = "top")
Although significant, the effect of smoking is rather small. Because the interaction between the smoking status and the BMI was not significant, we won't draw predictions to explore that.
We did not compute predictions properly. Ideally instead of considering fixed values for non-focal predictors, we could in some case compute partial-dependence effects. There is no function doing that on GAM fits directly, so it would have to be done by hand...
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.