In courtiol/LM2GLMM: Data science for biologists: Generalized linear modelling with R

library(LM2GLMM)
knitr::opts_chunk$set(cache = TRUE, fig.align = "center", fig.width = 7, fig.height = 6,
                      cache.path = "./cache_knitr/Exo_GLM_solution/", fig.path = "./fig_knitr/Exo_GLM_solution/")
options(width = 100, rmarkdown.html_vignette.check_title = FALSE)

Disclamer

In this vignette I illustrate the key steps required to solve the exercises. By no means I am trying to provide a detailed report of the analysis of each dataset. Any such report may need additional analyses and results would have to be written in good English, not in telegraphic style or lines of codes.

Dataset: InsectSprays (difficulty = 2/5)

r .emo("goal") Test whether insecticide type significantly affects the number of insects in agricultural experimental units.

r .emo("goal") Compare the effect of spray C to all other spray types.

r .emo("goal") What is the mean number of insects we would predict to find on a plot treated with each insecticide spray. Compare results between the LM approach (with and without BoxCox transformation) to those from the GLM approach.

Note: I will not check the assumptions for LM, nor detail the dataset (and plotting variables) since we did it before (see LM exercises).

Fitting the models

We fit the LM:

fit_insect <- lm(count ~ spray, data = InsectSprays)

We fit the LM on the BoxCox transformed response:

InsectSprays$count_bis <- InsectSprays$count + 1
fit_insect_bis <- update(fit_insect, count_bis ~ .)
bc <- car::powerTransform(fit_insect_bis)
InsectSprays$count_bc <- car::bcPower(InsectSprays$count_bis, bc$lambda)
fit_insect_bc <- update(fit_insect_bis, count_bc ~ .)

We fit the GLM:

fit_insect_glm <- glm(count ~ spray, data = InsectSprays, family = poisson())

Checking the assumptions for GLM

All assumptions about model structure have been checked before (see exercises for LM).

We still have to test all assumptions about the errors.

library(DHARMa)
insect_glm_resid <- simulateResiduals(fit_insect_glm, n = 1000)
plot(insect_glm_resid) ## many tests at once

This reveals one issue concerning the dispersion. Let's thus look at this in more details:

testDispersion(insect_glm_resid)

The dispersion value is higher than expected, so there is slight overdispersion.

Let's attempt to solve this issue before testing other assumptions.

Solving the overdispersion issue

Is the overdispersion caused by two many zeros?

table(fit_insect_glm$model$count) ## we count the zeros and other counts
testZeroInflation(insect_glm_resid)

nope... the number of observed zeros matches the expectation of the model.

Let's try to fit an alternative model to account for the overdispersion:

library(spaMM)
fit_insect_glm_nb <- fitme(count ~ spray, data = InsectSprays, family = spaMM::negbin())
AIC(fit_insect_glm)
AIC(fit_insect_glm_nb)

The negative binomial model fits the data much better as shown by its smaller deviance.

Note that the prediction are the same under the 2 models:

plot(predict(fit_insect_glm, type = "response"), fit_insect_glm$model$count, col = "red", lwd = 6)
abline(0, 1, col = "red", lwd = 6)
points(predict(fit_insect_glm_nb), fit_insect_glm_nb$data$count, col = "blue")
abline(0, 1, col = "blue")

The only thing that differs is the residual variance around those predictions (not shown).

Note also that the Gaussian models have a deviance expressed on another scale, which makes the comparison between models particularly difficult when using Gaussian models (but this is not the case here).

Let us stick to the negative binomial model as our best GLM!

Re-testing assumptions for the fit of the negative binomial model

insect_glm_nb_resid   <- simulateResiduals(fit_insect_glm_nb, n = 1000)
testDispersion(insect_glm_nb_resid)

The dispersion issue is solved, but what about the other assumptions?

plot(insect_glm_nb_resid)
testTemporalAutocorrelation(insect_glm_nb_resid, time = 1:nrow(model.frame(fit_insect_glm_nb)))
testTemporalAutocorrelation(insect_glm_nb_resid, time = order(fitted(fit_insect_glm_nb)))
testZeroInflation(insect_glm_nb_resid)

Everything looks great!

Tests comparing the fitted models to their respective null models

anova(fit_insect, update(fit_insect, . ~ 1))  ## F-test for LM
anova(fit_insect_bc, update(fit_insect_bc, . ~ 1))  ## F-test for LM with BoxCox transformation
lmtest::lrtest(fit_insect_glm, update(fit_insect_glm, . ~ 1)) ## LRT for GLM Poisson
anova(fit_insect_glm_nb, update(fit_insect_glm_nb, . ~ 1)) ## LRT for GLM NegBin

No matter the models, the effect of the insecticide is detected!

Comparing effect of spray C to others

Let's illustrate how to do that for the two GLMs since they have been fitted with 2 different packages ({stats} and {spaMM} ):

GLM Poisson

library(multcomp)
comp_glm <- glht(fit_insect_glm, linfct = mcp(spray = c("C - A = 0",
                                                        "C - B = 0",
                                                        "C - D = 0",
                                                        "C - E = 0",
                                                        "C - F = 0")))
summary(comp_glm)

GLM NegBin

comp_glm_nb <- glht(fit_insect_glm_nb, linfct = mcp(spray = c("C - A = 0",
                                                              "C - B = 0",
                                                              "C - D = 0",
                                                              "C - E = 0",
                                                              "C - F = 0")),
                    coef. = fixef.HLfit) ## needed for glht to work with spaMM
summary(comp_glm_nb)
plot(comp_glm_nb)

The sprays C is the most effective one (fewer insects remaining) but we cannot exclude that E is equally effective.

Predictions and method comparison

We now compute the predictions and CI for the different models.

LM

data.for.pred <- data.frame(spray = levels(InsectSprays$spray))
pred_lm <- cbind(data.for.pred,
                 predict(fit_insect, newdata = data.for.pred, interval = "confidence"))
pred_lm

LM with BoxCox

pred_bc <- cbind(data.for.pred,
                 predict(fit_insect_bc, newdata = data.for.pred, interval = "confidence"))
f.inverse <- function(x) car::bcnPowerInverse(x, lambda = bc$lambda, gamma = 0) - 1
## Note: we removed 1 as we added 1 so to be able to do the BoxCox transformation
pred_bc$fit <- f.inverse(x = pred_bc$fit)
pred_bc$lwr <- f.inverse(x = pred_bc$lwr) 
pred_bc$upr <- f.inverse(x = pred_bc$upr)
pred_bc

GLM Poisson

pred_glm_poiss_eta <- predict(fit_insect_glm, newdata = data.for.pred, se.fit = TRUE)
pred_glm_poiss <- cbind(data.for.pred,
                     data.frame(fit = family(fit_insect_glm)$linkinv(pred_glm_poiss_eta$fit)))
pred_glm_poiss$lwr <- family(fit_insect_glm)$linkinv(pred_glm_poiss_eta$fit +
                                                     qnorm(0.025)*pred_glm_poiss_eta$se.fit)
pred_glm_poiss$upr <- family(fit_insect_glm)$linkinv(pred_glm_poiss_eta$fit +
                                                     qnorm(0.975)*pred_glm_poiss_eta$se.fit)
pred_glm_poiss

GLM NegBin

pred_glm_nb_table <- cbind(data.for.pred,
                           predict(fit_insect_glm_nb, newdata = data.for.pred),
                           get_intervals(fit_insect_glm_nb, newdata = data.for.pred,
                                         intervals = "predVar"))
colnames(pred_glm_nb_table)[-1] <- c("fit", "lwr", "upr")
pred_glm_nb_table

Combining predicitions

We combine the predictions so as to be able to plot them easily:

predictions_all <- rbind(cbind(pred_lm, method = "LM"),
                         cbind(pred_bc, method = "LM BoxCox"),
                         cbind(pred_glm_poiss, method = "GLM Poisson"),
                         cbind(pred_glm_nb_table, method = "GLM NegBin"))

## declare method as factor for better display
predictions_all$method <- forcats::fct_inorder(predictions_all$method)

Plot

library(ggplot2)
ggplot() +
  geom_dotplot(aes(y = count, x = spray),
               data = InsectSprays,
               binwidth = 0.5,
               binaxis = "y", stackdir = "center",
               fill = "white") +
  geom_point(aes(y = fit, x = spray, colour = method),
             data = predictions_all,
             position = position_dodge(width = 0.5)) + ## for spreading things appart
  geom_errorbar(aes(y = fit, x = spray, colour = method, ymin = lwr, ymax = upr),
                data = predictions_all, 
                position = position_dodge(width = 0.5)) +
  theme_classic()

Conclusions:

GLM and LM not BoxCoxed fits yield similar point predictions.
GLM and LM BoxCoxed fits yield similar CI.

Dataset: LM2GLMM::Surprise (difficulty = 0/5)

r .emo("goal") Do children value more the type of the present, the cost of the present, or both?

Solution

# coplot(happiness ~ price | type, data = Surprise)

Dataset: esoph (difficulty = 3/5)

r .emo("goal") Would it be better to limit alcohol consumption or tobacco consumption in order to avoid developing an oesophageal cancer? Use a GLM to find out.

Exploration of the data

esoph
str(esoph)

The factors are ordered which -- by default -- will select for different contrasts (polynomial contrasts), so we set them as unordered (note: polynomial contrasts are interesting to identify how to best model ordinal variables as quantitative variables):

esoph$agegp_f <- factor(esoph$agegp, ordered = FALSE)
esoph$alcgp_f <- factor(esoph$alcgp, ordered = FALSE)
esoph$tobgp_f <- factor(esoph$tobgp, ordered = FALSE)

summary(esoph)
esoph$cancer_freq <- esoph$ncases / (esoph$ncases + esoph$ncontrols)
esoph$cancer_N <- esoph$ncases + esoph$ncontrols
summary(esoph$cancer_freq) ## proportion of cancer
table(esoph$alcgp_f, esoph$tobgp_f)
table(esoph$alcgp_f, esoph$tobgp_f, esoph$agegp_f)

We plot the data:

coplot(cancer_freq ~ tobgp_f | agegp_f, data = esoph, panel = panel.smooth)
coplot(cancer_freq ~ alcgp_f | agegp_f, data = esoph, panel = panel.smooth)

We fit the models

Since the outcome is binomial, the only possibilities here are to explore which of the possible link functions fit the data best:

fit_esoph_logit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
                       family = binomial(link = "logit"))
fit_esoph_cauchit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
                         family = binomial(link = "cauchit"))
fit_esoph_probit <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f, data = esoph,
                        family = binomial(link = "probit"))
AIC(fit_esoph_logit, fit_esoph_cauchit, fit_esoph_probit) ## you could also deviance

The probit link function leads to a model fitting the data best, so we will use it.

Comparison to null model

fit_esoph_probit_H0 <- glm(cbind(ncases, ncontrols) ~ 1, data = esoph,
                           family = binomial(link = "probit"))
lmtest::lrtest(fit_esoph_probit_H0, fit_esoph_probit)

The full model fits the data much better than the null model!

Testing the model assumptions

Linearity is not an issue since we considered all predictors as factorial variables. Collinearity should also not be an issue due to the neat sampling design.

esoph_probit_resid <- simulateResiduals(fit_esoph_probit, n = 1000)
plot(esoph_probit_resid)
testTemporalAutocorrelation(esoph_probit_resid, time = 1:nrow(esoph))
testTemporalAutocorrelation(esoph_probit_resid, time = fitted(fit_esoph_probit))
testZeroInflation(esoph_probit_resid)
testDispersion(esoph_probit_resid)

Everything seems in order.

Testing the model effects

car::Anova(fit_esoph_probit)

Plotting the effects

Let us first try to plot the effects using functions from packages:

library(effects)
plot(predictorEffects(fit_esoph_probit)) ## response scale
plot(predictorEffects(fit_esoph_probit, transformation = list(link = NULL, inverse = NULL))) ## linear scale

library(visreg)
visreg(fit_esoph_probit, scale = "response", ylim  = c(0, 1))
visreg2d(fit_esoph_probit, "alcgp_f", "tobgp_f", plot.type = "image", scale = "response")
visreg2d(fit_esoph_probit, "alcgp_f", "tobgp_f", plot.type = "persp", scale = "response")

fit_esoph_probit_spaMM <- fitme(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f + tobgp_f,
                                data = esoph,
                                family = binomial(link = "probit"))
#plot_effects(fit_esoph_probit_spaMM, focal_var = "agegp_f") ## buggy in current spaMM
#plot_effects(fit_esoph_probit_spaMM, focal_var = "alcgp_f")
#plot_effects(fit_esoph_probit_spaMM, focal_var = "tobgp_f")

plot_effects_manual <- function(data) { 
  ggplot(data) +
     aes(y = pointp, x = forcats::fct_inorder(focal_var), ymax = up, ymin = low) +
     geom_point() +
     geom_errorbar(width = 0.1) +
     scale_y_continuous(limits = c(0, 1)) +
     theme_classic()
}

plot_effects_manual(pdep_effects(fit_esoph_probit_spaMM, focal_var = "agegp_f"))
plot_effects_manual(pdep_effects(fit_esoph_probit_spaMM, focal_var = "alcgp_f"))
plot_effects_manual(pdep_effects(fit_esoph_probit_spaMM, focal_var = "tobgp_f"))

Note: the 3 packages differ in how they compute predictions!

This is not part of the exercise, but let us compare precisely the predictions for the age group 25-34 according to all 3 packages to better characterise and understand the differences.

We first need a little data wrangling to create a table comparing the predictions across methods:

effects_pred <- predictorEffect(fit_esoph_probit, predictor = "agegp_f")
effects_pred <- as.data.frame(summary(effects_pred)[c("effect", "lower", "upper")])[1, ]
colnames(effects_pred) <- c("fit", "lwr", "upr")

visreg_pred <- visreg(fit_esoph_probit, scale = "response", plot = FALSE)
visreg_pred <- visreg_pred[[1]]$fit[1, c("visregFit", "visregLwr", "visregUpr")]
colnames(visreg_pred) <- c("fit", "lwr", "upr")

spaMM_pred <- pdep_effects(fit_esoph_probit_spaMM, focal_var = "agegp_f")
spaMM_pred <- spaMM_pred[1, c("pointp", "low", "up")]
colnames(spaMM_pred) <- c("fit", "lwr", "upr")

rbind(effects = effects_pred,
      visreg = visreg_pred,
      spaMM = spaMM_pred)

{effects} (under its default settings) computes partial-dependence effects (i.e. consider all values for non focal predictors, irrespective of the focal predictor, and compute the mean effect) and compute the mean BEFORE transforming the data back to the scale of the response:

newdata <- data.frame(agegp_f = "25-34",
                      alcgp_f = model.frame(fit_esoph_probit)$alcgp_f,
                      tobgp_f = model.frame(fit_esoph_probit)$tobgp_f)
family(fit_esoph_probit)$linkinv(mean(predict(fit_esoph_probit, newdata = newdata)))

{visreg} (under its default settings) selects the level with the highest number of observation for the non focal factor (and when there are equal group sizes, it seems to pick the first level that comes out of the candidate ones; I did not check if that follows the order of the levels or their position in the data...).

table(esoph$alcgp_f)
table(esoph$tobgp_f)
newdata <- data.frame(agegp_f = "25-34",
                      alcgp_f = "0-39g/day",
                      tobgp_f = "0-9g/day")
predict(fit_esoph_probit, newdata = newdata, type = "response")

the function pdep_effects() from {spaMM} (under its default settings) computes partial-dependence effects (i.e. consider all values for non focal predictors, irrespective of the focal predictor, and compute the mean effect) and compute the mean AFTER transforming the data back to the scale of the response:

newdata <- data.frame(agegp_f = "25-34",
                      alcgp_f = model.frame(fit_esoph_probit)$alcgp_f,
                      tobgp_f = model.frame(fit_esoph_probit)$tobgp_f)
mean(predict(fit_esoph_probit, newdata = newdata, type = "response"))

Note: doing predictions yourself is perhaps the easiest way to know what is actually happening!

Predictions by hand

Doing things by hand is tedious but it gives you more clarity, control and you are no longer bounded to use a particular package.

Here for example, to plot the partial-dependence effects for all combinations of alcohol and tobacco categories we could do the following:

newdata_focal <- expand.grid(alcgp_f = levels(model.frame(fit_esoph_probit)$alcgp_f),
                             tobgp_f = levels(model.frame(fit_esoph_probit)$tobgp_f))

newdata_nonfocal <- data.frame(agegp_f = model.frame(fit_esoph_probit)$agegp_f)

newdata <- merge(newdata_focal, newdata_nonfocal)

predicts <- cbind(newdata, fit = predict(fit_esoph_probit, newdata = newdata, type = "response"))
d <- aggregate(fit ~ alcgp_f + tobgp_f, data = predicts, mean)
d

ggplot(d) +
  aes(y = alcgp_f, x = tobgp_f, fill = fit) +
  geom_tile() +
  scale_fill_binned(type = "viridis", breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) +
  theme_minimal()

To answer the question we may alternatively want to compare extreme situations (fixing all predictors to fixed values):

newdata_esoph_young <- expand.grid(agegp_f = c("25-34"),
                                   alcgp_f = c("0-39g/day", "120+"),
                                   tobgp_f = c("0-9g/day", "30+"))
pred_esoph_young <- predict(fit_esoph_probit, newdata = newdata_esoph_young,
                            type = "link", se.fit = TRUE)
newdata_esoph_young$pred <- family(fit_esoph_probit)$linkinv(pred_esoph_young$fit)
newdata_esoph_young$lwr <- family(fit_esoph_probit)$linkinv(pred_esoph_young$fit +
                            qnorm(0.025)*pred_esoph_young$se.fit)
newdata_esoph_young$upr <- family(fit_esoph_probit)$linkinv(pred_esoph_young$fit +
                            qnorm(0.975)*pred_esoph_young$se.fit)
newdata_esoph_young

Same for old people:

newdata_esoph_old <- expand.grid(agegp_f = c("75+"),
                                 alcgp_f = c("0-39g/day", "120+"),
                                 tobgp_f = c("0-9g/day", "30+"))
pred_esoph_old <- predict(fit_esoph_probit, newdata = newdata_esoph_old,
                          type = "link", se.fit = TRUE)
newdata_esoph_old$pred <- family(fit_esoph_probit)$linkinv(pred_esoph_old$fit)
newdata_esoph_old$lwr <- family(fit_esoph_probit)$linkinv(pred_esoph_old$fit +
                            qnorm(0.025)*pred_esoph_old$se.fit)
newdata_esoph_old$upr <- family(fit_esoph_probit)$linkinv(pred_esoph_old$fit +
                            qnorm(0.975)*pred_esoph_old$se.fit)
newdata_esoph_old

So again, drinking seems to be a worse than smoking but we are extrapolating a bit since very few individuals correspond to the predicted conditions! (see table() call above).

One thing is clear, not smoking and not drinking seems to be the way to avoid these cancers.

Interaction alcohol - tobacco?

We could also consider the interaction between alcohol and tobacco.

fit_esoph_probit2a <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp_f * tobgp_f,
                          data = esoph,
                          family = binomial(link = "probit"))
car::Anova(fit_esoph_probit2a)
summary(fit_esoph_probit2a)

For the ease of interpretation, when you have 2 factors in interaction you can also directly code their combination:

esoph$drink_smoke <- factor(paste(esoph$alcgp_f, esoph$tobgp_f, sep = "_"))
fit_esoph_probit2b <- glm(cbind(ncases, ncontrols) ~ agegp_f + drink_smoke, data = esoph,
                          family = binomial(link = "probit"))
summary(fit_esoph_probit2b)

The fits fit_esoph_probit2a and fit_esoph_probit2b have the same likelihood and number of degrees of freedom. They are equivalent, but just use a different parameterisation of the same thing.

The summary table from fit_esoph_probit2b now directly compares all drink - alcohol combination to the reference group with no cigarette and minimum alcohol consumption.

Ideally we should again check all assumptions (you should do it!).

Let us do the minimum here:

plot(simulateResiduals(fit_esoph_probit2b, n = 1000))

Let us now compare if it is better to drink a lot and not smoke or the reverse:

library(multcomp)
posthoc <- glht(fit_esoph_probit2b,
                linfct = c("`drink_smoke120+_0-9g/day` - `drink_smoke0-39g/day_30+` == 0"))
par(oma = c(1, 5, 1, 1))
plot(posthoc)
confint(posthoc)
summary(posthoc, test = Chisqtest()) ## choose tests the better suited to your case

We have compared two different situations, using a posthoc approach. Drinking a lot seems to be worse than smoking a lot for this particular cancer.

Alternative: effects considered as linear

We could also have considered the predictor as quantitative predictors:

fit_esoph_probit3 <- glm(cbind(ncases, ncontrols) ~ agegp_f + alcgp + tobgp, data = esoph,
                         family = binomial(link = "probit"))
summary(fit_esoph_probit3)

The test on the polynomial contrasts (automatic due to the variable being ordered factors) suggest that linear (.L) effects matter but that quadratic (.Q) effects don't. For alcohol, there are hints of a slight cubic (.C) effect, but we will ignore this for the sake of simplicity (the effect is small and in the absence of a quadratic effect, it is a little strange to interpret).

So let's recode the variable as truly quantitative and let's fit linear effects:

esoph$alc_num <- unclass(esoph$alcgp)
esoph$tob_num <- unclass(esoph$tobgp)
fit_esoph_probit4 <- glm(cbind(ncases, ncontrols) ~ agegp_f + alc_num + tob_num, data = esoph,
                         family = binomial(link = "probit"))
plot(simulateResiduals(fit_esoph_probit4, n = 1000))
car::crPlots(fit_esoph_probit4) ## partial residuals (to check linearity, albeit somewhat redundant with polynomial contrasts)
AIC(fit_esoph_probit,
    fit_esoph_probit2b,
    fit_esoph_probit3,
    fit_esoph_probit4)
confint(fit_esoph_probit4)

This last model seems quite good. Again, our results suggest that alcohol matters most. But do bear in mind that this model now compares the effect of increasing by one category in each predictor.

More?

One could also do a model with only age and tobacco, one with only age and alcohol and compare those by parametric bootstrap to get a p-value. But let us stop here for this exercise.
One could explore interactions with the quantitative predictors.

Dataset: TitanicSurvival (difficulty = 3/5)

r .emo("goal") Find out how the sex, age and passenger class influenced who died during the Titanic disaster of 1912.

Exploration of the data

head(TitanicSurvival)
str(TitanicSurvival)
summary(TitanicSurvival)
with(data = TitanicSurvival,
     tapply(survived, sex, function(x) mean(x == "yes")))
with(data = TitanicSurvival,
     tapply(survived, passengerClass, function(x) mean(x == "yes")))
with(data = TitanicSurvival,
     table(cut(age, breaks = c(0, 10, 25, 35, 50, Inf), include.lowest = TRUE), ## make groups
           sex, passengerClass))

Visualization of the data

coplot(survived ~ age | sex * passengerClass, data = TitanicSurvival, panel = panel.smooth)
library(car)
scatterplot(survived == "yes" ~ age, data = TitanicSurvival)

The effect of age does not seem linear...

We recode some variables

TitanicSurvival$surv <-  TitanicSurvival$survived == "yes"
TitanicSurvival$age_f <- cut(TitanicSurvival$age,
                             breaks = c(0, 10, 25, 35, 50, Inf), include.lowest = TRUE)
table(TitanicSurvival$age_f)

Fitting the models

We fit a series of models with different links and considering age as factor or continuous:

fit_logit <-   glm(surv ~ age_f + passengerClass + sex, data = TitanicSurvival,
                   family = binomial(link = "logit"))
fit_probit <-  glm(surv ~ age_f + passengerClass + sex, data = TitanicSurvival,
                   family = binomial(link = "probit"))
fit_cauchit <- glm(surv ~ age_f + passengerClass + sex, data = TitanicSurvival,
                   family = binomial(link = "cauchit"))

fit_logit2 <-   glm(surv ~ age + passengerClass + sex, data = TitanicSurvival,
                   family = binomial(link = "logit"))
fit_probit2 <-  glm(surv ~ age + passengerClass + sex, data = TitanicSurvival,
                   family = binomial(link = "probit"))
fit_cauchit2 <- glm(surv ~ age + passengerClass + sex, data = TitanicSurvival,
                   family = binomial(link = "cauchit"))

AIC(fit_logit, fit_probit, fit_cauchit, fit_logit2, fit_probit2, fit_cauchit2)

The cauchit model with age as a factor provides a much better fit than other models, so we will select that one.

Checking the assumptions

library(DHARMa)
resid_Titanic <- simulateResiduals(fit_cauchit, n = 1000)
plot(resid_Titanic)
testTemporalAutocorrelation(resid_Titanic, time = order(fitted(fit_cauchit)))
testTemporalAutocorrelation(resid_Titanic, time = 1:nrow(model.frame(fit_cauchit)))

There is a little bit of serial autocorrelation but it is low and hard to understand (observation are sorted by alphabetical order) so we will ignore that for now.

There is no need to test for overdispersion in a binary model!

We look at predictions

data.for.pred <- with(data = TitanicSurvival,
                 expand.grid(age_f = levels(age_f),
                             passengerClass = levels(passengerClass),
                             sex = levels(sex)))
pred <- predict(fit_cauchit, newdata = data.for.pred, type = "link")
data.to.plot <- cbind(data.for.pred, pred = pred)
coplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot)
library(lattice) ## same, more fancy
xyplot(pred ~ unclass(age_f) | passengerClass + sex, data = data.to.plot)

GAM?

Another way to cope with the non-linearity of the age effect would be to do a GAM model:

library(mgcv)
fit_cauchit_gam <- gam(surv ~ s(age) + passengerClass + sex,
                       data = TitanicSurvival,
                       family = binomial(link = "cauchit"))
AIC(fit_cauchit_gam)

The model is a bit better, we could use it.

GAM model assumptions:

cauchit_gam_resid <- simulateResiduals(fit_cauchit_gam)
plot(cauchit_gam_resid)

The QQ plot is good (left), but the uniformity plot on the right suggests things are not quite perfect.

Two predictors are factors the third is already modelled quite flexibly using GAM, so the issue is probably not about how we modelled each variable but more about something important we are missing.

Let's see if an interaction between passengerClass & sex would improve things:

fit_cauchit_gam2 <- gam(surv ~ s(age) + passengerClass*sex,
                        data = TitanicSurvival,
                        family = binomial(link = "cauchit"))
AIC(fit_cauchit_gam2)
cauchit_gam_resid2 <- simulateResiduals(fit_cauchit_gam2)
plot(cauchit_gam_resid2)

Everything is much better! Ideally we should re-examine what link function is best but let's stick to what we have for simplicity.

Tests

For GAM models, we need to use anova() (which calls mgcv::anova.gam() and thus does perform an type I anova but a marginal anova -- documentation says type III, so not type II as car::Anova() unfortunately, but perhaps the documentation is wrong... I did not check):

anova(fit_cauchit_gam2)

Predictions

data.for.pred2 <- with(data = TitanicSurvival,
                      expand.grid(age_f = levels(TitanicSurvival$age_f),
                                  passengerClass = levels(passengerClass),
                                  sex = levels(sex)))
data.for.pred2$age <- NA
data.for.pred2$age[data.for.pred2$age_f == "[0,10]"] <- (10 - 0)/2
data.for.pred2$age[data.for.pred2$age_f == "(10,25]"] <- (25 + 10)/2
data.for.pred2$age[data.for.pred2$age_f == "(25,35]"] <- (25 + 35)/2
data.for.pred2$age[data.for.pred2$age_f == "(35,50]"] <- (35 + 50)/2
data.for.pred2$age[data.for.pred2$age_f == "(50,Inf]"] <- (50 + 81)/2

pred_age_f <- cbind(data.for.pred2,
                    fit = predict(fit_cauchit, newdata = data.for.pred2,
                                  type = "response"))
pred_age_gam <- cbind(data.for.pred2,
                      fit = predict(fit_cauchit_gam2, newdata = data.for.pred2,
                                    type = "response"))

ggplot(pred_age_f) +
  aes(y = fit, x = age, colour = passengerClass, linetype = sex, shape = sex) +
  geom_point() +
  geom_line() +
  scale_y_continuous(limits = c(0, 1)) +
  theme_classic() +
  labs(title = "Age as factor", y = "Predicted survival prob.") -> plot1

ggplot(pred_age_gam) +
  aes(y = fit, x = age, colour = passengerClass, linetype = sex, shape = sex) +
  geom_point() +
  geom_line() +
  scale_y_continuous(limits = c(0, 1)) +
  theme_classic() +
  labs(title = "Age as GAM", y = "Predicted survival prob.") -> plot2

library(patchwork) ## to combine plots and legend

plots12 <- plot1 + plot2 & theme(legend.position = "bottom")
plots12 + plot_layout(guides = "collect")

Notes:

as the model is not fitted with link = logit, the odd ratios are not constant with age, so we cannot quantify effects using them.
the two fits plotted also differ by the fact that the one on the right considers an interaction but not the one on the right.

More?

Try to figure out why predictions are so different for the young males from the second class.

Dataset: Challenger (difficulty = 2/5)

r .emo("goal") What was the probability of an O-ring experiencing thermal distress when the space shuttle Challenger took off on the 28th of January 1986 by 31 degrees F?

Let's explore the data

str(Challenger)
summary(Challenger)
table(nb_orings = Challenger$oring_tot, nb_pb_orings = Challenger$oring_dt)

Let's visualize the data

plot(jitter(oring_dt, factor = 0.3) ~ temp, data = Challenger)

Notice the use of jitter!

We recode the data

Challenger$oring_fine <- Challenger$oring_tot - Challenger$oring_dt

We fit the models

fit_binom_logit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
                       family = binomial(link = "logit"))
fit_binom_probit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
                        family = binomial(link = "probit"))
fit_binom_cauchit <- glm(cbind(oring_fine, oring_dt) ~ temp, data = Challenger,
                         family = binomial(link = "cauchit"))
AIC(fit_binom_logit, fit_binom_probit, fit_binom_cauchit)

We could choose the logit or the probit model. It would be easier with the logit to interpret the estimates but we do not need this here, so let us consider the probit model since it is best.

Let's check assumptions

## let's use 2 seeds since it is a small sample size
r_probit <- simulateResiduals(fit_binom_probit, n = 1000, seed = 1)
plot(r_probit)
r_probit2 <- simulateResiduals(fit_binom_probit, n = 1000, seed = 2)
plot(r_probit2)
testTemporalAutocorrelation(r_probit, time = 1:nrow(Challenger))
testTemporalAutocorrelation(r_probit, time = order(fitted(fit_binom_probit)))
testTemporalAutocorrelation(r_probit, time = order(Challenger$temp))

Things seems OK although it is hard to be sure on such a small dataset.

Tests

car::Anova(fit_binom_probit)
summary(fit_binom_probit)

Predictions

Let's compare the prediction using both the probit and logit link, since they lead to quite equivalent fits:

pred_probit <- pred_logit <- data.frame(temp = 31:81)

## Predict probit
pred_probit_res <- predict(fit_binom_probit, newdata = pred_probit, se.fit = TRUE)

pred_probit$fit <- family(fit_binom_probit)$linkinv(pred_probit_res$fit)
pred_probit$lwr <- family(fit_binom_probit)$linkinv(
  pred_probit_res$fit + qnorm(0.025) * pred_probit_res$se.fit)
pred_probit$upr <- family(fit_binom_probit)$linkinv(
  pred_probit_res$fit + qnorm(0.975) * pred_probit_res$se.fit)
pred_probit

## Predict logit
pred_logit_res <- predict(fit_binom_logit, newdata = pred_logit, se.fit = TRUE)

pred_logit$fit <- family(fit_binom_logit)$linkinv(pred_logit_res$fit)
pred_logit$lwr <- family(fit_binom_logit)$linkinv(
  pred_logit_res$fit + qnorm(0.025) * pred_logit_res$se.fit)
pred_logit$upr <- family(fit_binom_logit)$linkinv(
  pred_logit_res$fit + qnorm(0.975) * pred_logit_res$se.fit)
pred_logit

Note: we could have done it only for 31 degrees, but computing predictions for a range of temperature allows to better understand the situation.

Plotting predictions

all_preds <- rbind(cbind(link = "probit", pred_probit),
                   cbind(link = "logit", pred_logit))

ggplot() +
  geom_line(aes(y = 1 - fit, x = temp, colour = link), data = all_preds) +
  geom_ribbon(aes(ymin = 1 - upr, ymax = 1 - lwr, x = temp, fill = link), alpha = 0.2, data = all_preds) +
  geom_jitter(aes(y = oring_dt/oring_tot, x = temp), shape = 1, height = 0.005, data = Challenger) +
  labs(y = "Probabiltiy that at least one oring will fail") +
  theme_classic()

Notes:

we do 1 - fit because our models predict the probability of having no failure.
the uncertainty is large, but both model agree that launching the shuttle at 31 degree F was a very risky...

More?

Perhaps it would be good to test if a non-linear effect of the temperature would fit the data better (using a GAM model) although that may lead to poor extrapolation (the more flexible the function, the more unreliable the extrapolations).
Since a single failure could be fatal, perhaps a binary model would make more sense!
You could try to replicate the original study that excluded all the OK flights! (although this turned out deadly and not just silly).

Dataset: UK (difficulty = 2/5)

r .emo("goal") Try to identify the influential determinants of the smoking behaviour of children.

Smoking behaviour

Looking at the data

str(UK)
lapply(UK, summary)
UK$cigarettes_kid_bin <- UK$cigarettes_kid != "Never" ## we recode the variable to be binary
table(UK$cigarettes_kid_bin, UK$cigarettes_friends)

Let's explore the data to see if there are any obvious effects:

scatterplot(cigarettes_kid_bin ~ age + sex, data = UK)
scatterplot(cigarettes_kid_bin ~ cigarettes + sex, data = UK)

There could be some small interactions with sex.

Fiting the model

It is not clear what model to fit a priori, so let us consider sex, age, and the number of cigarettes smoked by parents and friends.

fit_logit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
                 family = binomial(link = "logit"), data = UK)
fit_probit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
                  family = binomial(link = "probit"), data = UK)
fit_cauchit <- glm(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
                   family = binomial(link = "cauchit"), data = UK)
AIC(fit_logit, fit_probit, fit_cauchit)

Comparison to null model

lmtest::lrtest(
  update(fit_logit, . ~ 1, data = fit_logit$model),
  fit_logit)

Test of the model assumptions

resid_fit_logit <- simulateResiduals(fit_logit, n = 1000)
plot(resid_fit_logit)
testQuantiles(resid_fit_logit) ## to clarify what's happening in the previous plot on the right.
testTemporalAutocorrelation(resid_fit_logit, time = order(fitted(fit_logit)))
testTemporalAutocorrelation(resid_fit_logit, time = 1:nrow(model.matrix(fit_logit)))

Tests

car::Anova(fit_logit)

Let's test all interactions at once:

lmtest::lrtest(update(fit_logit, . ~ . - sex:(cigarettes + age + cigarettes_friends),
                      data = fit_logit$model),
               fit_logit)

There is no clear sign of interactions, so we will drop them for simplicity.

fit_logit2 <- update(fit_logit, . ~ . - sex:(cigarettes + age + cigarettes_friends),
                     data = fit_logit$model)
summary(fit_logit2)
car::Anova(fit_logit2)

All predictors are highly significant but this is not very surprising since the dataset is very large.

Predictions

Let's refit the model with {spaMM} so as to compute predictions more easily.

fit_logit2_spaMM <- fitme(cigarettes_kid_bin ~ sex*(cigarettes + age + cigarettes_friends),
                    family = binomial(link = "logit"), data = UK)
plot_effects(fit_logit2_spaMM, focal_var = "cigarettes")
plot_effects(fit_logit2_spaMM, focal_var = "age")

In the current version of {spaMM} plot_effects() does not work when the focal variable is a factor, but we can do the plot manually using the function plot_effects_manual() we wrote above and spaMM::pdep_effects() for the computation of the predictions:

effect_cigarettes_friends <- pdep_effects(fit_logit2_spaMM, focal_var = "cigarettes_friends")
effect_sex <- pdep_effects(fit_logit2_spaMM, focal_var = "sex")
plot_effects_manual(effect_cigarettes_friends)
plot_effects_manual(effect_sex)

By far, the strongest correlate is whether friends are smoking or not.

More?

You may try different model formulas including different predictors (biological knowledge would help not to go fishing...).

Dataset: HSE98women, exercise menopause (difficulty = 2/5)

r .emo("goal") Study the influences of age, body mass index and smoking status upon the probability of a woman being in menopause.

r .emo("goal") Estimate the proportion of women in menopause in the population, depending on their age (40, 45, 50, 55, 60 yrs).

Note: here, for the sake of simplicity we simply define menopause as the lack of periods in old ages.

Data preparation and exploration

str(HSE98women)

We first create the response variable:

HSE98women$menopause <- HSE98women$period == FALSE

We also recode the smoked variable as a factor (improves some plots):

HSE98women$smoked2 <- factor(HSE98women$smoked, labels = c("non-smoker", "smoker"))
table(HSE98women$smoked, HSE98women$smoked2) ## always check you did the right thing!

We fit a first model to help us easily exploring the data used for fitting and not all data (there are many missing values in the dataset).

fit_menop_logit <- glm(menopause ~ age + bmi + smoked2, data = HSE98women, family = binomial())
nrow(HSE98women) ## original number of observations
nrow(fit_menop_logit$model)  ## number of observations considered in the model

We compare the variables before and after filtering to see if any obvious bias is introduced by the observations we discard:

summary(HSE98women[, c("menopause", "age", "bmi", "smoked2")])
summary(fit_menop_logit$model)

The only big discrepancy between the two datasets is that the minimal age is now 18 yrs old while the original dataset also contains children. That suggests that girls too young to have periods are not being considered, which is not an issue for studying menopause.

Let's check the correlation between our two continuous predictor:

cor.test(HSE98women$age, HSE98women$bmi)
cor.test(fit_menop_logit$model$age, fit_menop_logit$model$bmi)

The correlation is not very high, especially in the dataset that is used for the fit, so multicollinearity should not be an issue.

We visualise the relationship between the predictor and the response:

scatterplot(menopause ~ age + smoked2, data = fit_menop_logit$model)
scatterplot(menopause ~ bmi + smoked2, data = fit_menop_logit$model)

There don't seem to be any obvious interaction between the age and smoked or between BMI and smoked, so we won't consider any. We will however consider that the effect of age may depend on the BMI. The BMI also seems to exert a non monotonous effect.

Fitting the models

fit_menop_logit   <- glm(menopause ~ age * bmi + smoked2,
                         data = HSE98women, family = binomial(link = "logit"))
fit_menop_probit  <- glm(menopause ~ age * bmi + smoked2,
                         data = HSE98women, family = binomial(link = "probit"))
fit_menop_cauchit <- glm(menopause ~ age * bmi + smoked2,
                         data = HSE98women, family = binomial(link = "cauchit"))

Because it is very difficult to explore the shape of the relationship between the age and BMI upon the probability of being menopaused, we will also fit general additive models (GAMs), which allow for parameters to vary along the predictors.

fit_menop_logit_gam   <- gam(menopause ~ s(age, bmi) + smoked2,
                             data = HSE98women, family = binomial(link = "logit"))
fit_menop_probit_gam  <- gam(menopause ~ s(age, bmi) + smoked2,
                             data = HSE98women, family = binomial(link = "probit"))
fit_menop_cauchit_gam <- gam(menopause ~ s(age, bmi) + smoked2,
                             data = HSE98women, family = binomial(link = "cauchit"))

AIC(fit_menop_logit,
    fit_menop_probit,
    fit_menop_cauchit,
    fit_menop_logit_gam,
    fit_menop_probit_gam,
    fit_menop_cauchit_gam)

(Note that GAMs have non-integer degrees of freedom. This is normal.)

The GAM logit model seems to be the best model, which will prevent us to express simply the effect of coefficients compared to a usual GLM. That's too bad but since the GAM fits are much better, we will go for those. Among those, we will consider the one with the logit link since it is both more simple to discuss and it here fits the data best.

Tests

The best practices for testing effects in GAMs are not quite established.

We are going to test for the effect of the interaction by Likelihood Ratio Test (LRT):

fit_menop_logit_gam_no_int <- gam(menopause ~ s(age) + s(bmi) + smoked2,
                                  family = binomial(), data = fit_menop_logit_gam$model)
lmtest::lrtest(fit_menop_logit_gam_no_int, fit_menop_logit_gam)

Let's now test the overall effect of BMI by LRT:

fit_menop_logit_gam_no_bmi <- gam(menopause ~ s(age) + smoked2,
                                  family = binomial(), data = fit_menop_logit_gam$model)
lmtest::lrtest(fit_menop_logit_gam_no_bmi, fit_menop_logit_gam)

Let's now test the overall effect of age by LRT:

fit_menop_logit_gam_no_age <- gam(menopause ~ s(bmi) + smoked2,
                                  family = binomial(), data = fit_menop_logit_gam$model)
lmtest::lrtest(fit_menop_logit_gam_no_age, fit_menop_logit_gam)

Let's now test the effect of smoking by LRT:

fit_menop_logit_gam_never_smoked <- gam(menopause ~ s(age, bmi),
                                        family = binomial(), data = fit_menop_logit_gam$model)
lmtest::lrtest(fit_menop_logit_gam_never_smoked, fit_menop_logit_gam)

Predictions

The package {mgcv} has a plotting function but in this particular case, the output is hard to read in the link scale and the conversion to response scale crashes for some reason....

plot(fit_menop_logit_gam)

Let's thus explore the model output with the package {visreg}, which we saw before and which also works on GAM fits (but remember that it computes predictions with weird meaning):

visreg(fit_menop_logit_gam)

These plots show the patterns identified by the smoothing parameters in the GAM: the effect of age seems to change after 40 yrs and the effect of BMI seem to be changing between 23 and 32 Kg/(m^2). It is difficult to account for such threshold effects using polynomials, so we will have to stick to GAMs here...

Let's do the same plots on the scale of the response variable:

visreg(fit_menop_logit_gam, scale = "response")
visreg2d(fit_menop_logit_gam, "age", "bmi", scale = "response")
#visreg2d(fit_menop_logit_gam, "age", "bmi", scale = "response", plot.type = "rgl")

We will now plot the predictions for the GLM fit with the logit link, the GLM fit with the cauchit link and the GAM fit with the logit link to explore their differences.

We will predict the influence of age and consider the median BMI and smokers.

age.for.pred <- seq(min(fit_menop_cauchit_gam$model$age),
                    max(fit_menop_cauchit_gam$model$age), length = 100)

data.for.pred <- expand.grid(
  age = age.for.pred,
  bmi = median(fit_menop_cauchit_gam$model$bmi),
  smoked2 = "smoker"
  )

logit.p <- predict(fit_menop_logit, newdata = data.for.pred, se.fit = TRUE)

predict_GLM_logit <- data.frame(fit = family(fit_menop_logit)$linkinv(logit.p$fit),
                                upr = family(fit_menop_logit)$linkinv(logit.p$fit + 
                                        qnorm(0.975) * logit.p$se.fit),
                                lwr = family(fit_menop_logit)$linkinv(logit.p$fit +
                                        qnorm(0.025) * logit.p$se.fit))

cauchit.p <- predict(fit_menop_cauchit, newdata = data.for.pred, se.fit = TRUE)

predict_GLM_cauchit <- data.frame(fit = family(fit_menop_cauchit)$linkinv(cauchit.p$fit),
                                  upr = family(fit_menop_cauchit)$linkinv(cauchit.p$fit +
                                        qnorm(0.975) * cauchit.p$se.fit),
                                  lwr = family(fit_menop_cauchit)$linkinv(cauchit.p$fit + 
                                        qnorm(0.025) * cauchit.p$se.fit))

gam.p <- predict(fit_menop_logit_gam, newdata = data.for.pred, se.fit = TRUE)

predict_GAM_logit <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p$fit),
                                upr = family(fit_menop_logit_gam)$linkinv(gam.p$fit +
                                        qnorm(0.975) * gam.p$se.fit),
                                lwr = family(fit_menop_logit_gam)$linkinv(gam.p$fit + 
                                        qnorm(0.025) * gam.p$se.fit))

all_predictions_for_age <- rbind(
  cbind(data.for.pred, method = "GLM_logit", predict_GLM_logit),
  cbind(data.for.pred, method = "GLM_cauchit", predict_GLM_cauchit),
  cbind(data.for.pred, method = "GAM_logit", predict_GAM_logit))

head(all_predictions_for_age) ## show the top of the table we created

Let's plot the result:

ggplot(all_predictions_for_age) +
  aes(y = fit, ymin = lwr, ymax = upr, x = age, colour = method) +
  geom_ribbon(fill = NA, linetype = "dashed", size = 0.2) +
  geom_line() +
  labs(y = "Probability of being in menopause", x = "Age (yrs)") +
  theme_classic() +
  theme(legend.position = "top")

Selecting the best model, let's now look at the effect of BMI, still using smokers:

data.for.pred2 <- expand.grid(
  age = age.for.pred,
  bmi = c(18.5, 25, 30),
  smoked2 = "smoker")

gam.p2 <- predict(fit_menop_logit_gam, newdata = data.for.pred2, se.fit = TRUE)

predict_GAM_logit2 <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p2$fit),
                                 upr = family(fit_menop_logit_gam)$linkinv(gam.p2$fit +
                                         qnorm(0.975) * gam.p2$se.fit),
                                 lwr = family(fit_menop_logit_gam)$linkinv(gam.p2$fit + 
                                         qnorm(0.025) * gam.p2$se.fit))

predictions_for_age_by_BMI <- cbind(data.for.pred2, predict_GAM_logit2)

ggplot(predictions_for_age_by_BMI) +
  aes(y = fit, ymin = lwr, ymax = upr, x = age, colour = factor(bmi)) +
  geom_ribbon(fill = NA, linetype = "dashed", size = 0.2) +
  geom_line() +
  labs(y = "Probability of being in menopause", x = "Age (yrs)") +
  theme_classic() +
  theme(legend.position = "top")

We can see on this plot that obesity seems to accelerate the onset of menopause, but it does not seem to make any difference later on.

We will now predict the proportion of women in menopause at 40, 45, 50, 55 and 60 yrs. To illustrate the effect of the BMI we will perform predictions for a BMI of 21.75 (middle of normal BMI range according to WHO) and 30 (lower limit for obesity) Kg/(m^2). Again, we will perform the predictions considering smoker since we have more data for this category. We will also compute the confidence intervals.

data.for.pred3 <- expand.grid(
  age = c(40, 45, 50, 55, 60),
  bmi = c(21.75, 30),
  smoked2 = "smoker")

gam.p3 <- predict(fit_menop_logit_gam, newdata = data.for.pred3, se.fit = TRUE)

predict_GAM_logit3 <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p3$fit),
                                 upr = family(fit_menop_logit_gam)$linkinv(gam.p3$fit +
                                         qnorm(0.975) * gam.p3$se.fit),
                                 lwr = family(fit_menop_logit_gam)$linkinv(gam.p3$fit + 
                                         qnorm(0.025) * gam.p3$se.fit))

predictions_at_specific_ages <- cbind(data.for.pred3, predict_GAM_logit3)
predictions_at_specific_ages

These table confirms our expectation concerning the influence of BMI upon the onset of menopause.

For the sake of comparison, let's compare predictions (without CI this time, for simplicity) using the other candidate fits:

data.for.pred4 <- expand.grid(age = c(40, 45, 50, 55, 60), bmi = 21.75, smoked2 = "smoker")

pred_GAM4     <- predict(fit_menop_logit_gam, newdata = data.for.pred4, type = "response")
pred_logit4   <- predict(fit_menop_logit, newdata = data.for.pred4, type = "response")
pred_cauchit4 <- predict(fit_menop_cauchit, newdata = data.for.pred4, type = "response")

predictions_at_specific_ages2 <- cbind(data.for.pred4,
                                       logit_GAM = pred_GAM4,
                                       logit_GLM = pred_logit4,
                                       cauchit_GLM = pred_cauchit4)
predictions_at_specific_ages2

Let's look at the same proportion in the raw data. We cannot be as strict during selection otherwise we won't get much individuals, so let's enlarge a bit the selection and see that we get a few hundred individuals at least:

tab <- with(HSE98women,
     data.frame(N = c(
       sum(age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
       sum(age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
       sum(age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
       sum(age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25, na.rm = TRUE),
       sum(age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25, na.rm = TRUE)
           )))
tab

That seems a good selection process (+/- 1 year and the whole WHO normal BMI category), so let's check the proportion of individuals in menopause in each of these categories (keeping both smokers and non smokers as the effect is rather small, see later):

tab <- with(HSE98women, 
     data.frame(observed_freq = c(
       mean(menopause[age >= 39 & age <= 41 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 44 & age <= 46 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 49 & age <= 51 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 54 & age <= 56 & bmi > 18.5 & bmi < 25], na.rm = TRUE),
       mean(menopause[age >= 59 & age <= 61 & bmi > 18.5 & bmi < 25], na.rm = TRUE)
     )))

cbind(predictions_at_specific_ages2, tab)

These results are roughly consistent with the predictions from the GAM model. It confirms that the GLM logit and cauchit are underestimating the prevalence of menopause at old age. All models however may overestimate the prevalence of menaupose at young age (although the problem seems worst in GLM logit). Depending on the goal of the study, other link functions, not implemented in base R, could be tried but it is quite advanced so we won't do that.

We can now turn to the exploration of the smoking effect by comparing predictions between smokers and non smokers at median BMI:

Selecting the best model, let's now look at the effect of BMI, still using smokers:

data.for.pred5 <- expand.grid(
  age = age.for.pred,
  bmi = 21.75,
  smoked2 = (c("smoker", "non-smoker")))

gam.p5 <- predict(fit_menop_logit_gam, newdata = data.for.pred5, se.fit = TRUE)

predict_GAM_logit5 <- data.frame(fit = family(fit_menop_logit_gam)$linkinv(gam.p5$fit),
                                 upr = family(fit_menop_logit_gam)$linkinv(gam.p5$fit +
                                         qnorm(0.975) * gam.p5$se.fit),
                                 lwr = family(fit_menop_logit_gam)$linkinv(gam.p5$fit + 
                                         qnorm(0.025) * gam.p5$se.fit))

predictions_for_age_by_smoking <- cbind(data.for.pred5, predict_GAM_logit5)

ggplot(predictions_for_age_by_smoking) +
  aes(y = fit, ymin = lwr, ymax = upr, x = age, colour = factor(smoked2)) +
  geom_ribbon(fill = NA, linetype = "dashed", size = 0.2) +
  geom_line() +
  labs(y = "Probability of being in menopause", x = "Age (yrs)") +
  theme_classic() +
  theme(legend.position = "top")

Although significant, the effect of smoking is rather small. Because the interaction between the smoking status and the BMI was not significant, we won't draw predictions to explore that.

More?

We did not compute predictions properly. Ideally instead of considering fixed values for non-focal predictors, we could in some case compute partial-dependence effects. There is no function doing that on GAM fits directly, so it would have to be done by hand...

courtiol/LM2GLMM documentation built on Oct. 10, 2024, 7:22 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

courtiol/LM2GLMM Data science for biologists: Generalized linear modelling with R

In courtiol/LM2GLMM: Data science for biologists: Generalized linear modelling with R

Disclamer

Dataset: InsectSprays (difficulty = 2/5)

Fitting the models

Checking the assumptions for GLM

Solving the overdispersion issue

Re-testing assumptions for the fit of the negative binomial model

Tests comparing the fitted models to their respective null models

Comparing effect of spray C to others

GLM Poisson

GLM NegBin

Predictions and method comparison

LM

LM with BoxCox

GLM Poisson

GLM NegBin

Combining predicitions

Plot

Dataset: LM2GLMM::Surprise (difficulty = 0/5)

Solution

Dataset: esoph (difficulty = 3/5)

Exploration of the data

We fit the models

Comparison to null model

Testing the model assumptions

Testing the model effects

Plotting the effects

Predictions by hand

Interaction alcohol - tobacco?

Alternative: effects considered as linear

More?

Dataset: TitanicSurvival (difficulty = 3/5)

Exploration of the data

Visualization of the data

We recode some variables

Fitting the models

Checking the assumptions

We look at predictions

GAM?

GAM model assumptions:

Tests

Predictions

More?

Dataset: Challenger (difficulty = 2/5)

Let's explore the data

Let's visualize the data

We recode the data

We fit the models

Let's check assumptions

Tests

Predictions

Plotting predictions

More?

Dataset: UK (difficulty = 2/5)

Smoking behaviour

Looking at the data

Fiting the model

Comparison to null model

Test of the model assumptions

Tests

Predictions

More?

Dataset: HSE98women, exercise menopause (difficulty = 2/5)

Data preparation and exploration

Fitting the models

Tests

Predictions

More?

R Package Documentation

Browse R Packages

We want your feedback!

courtiol/LM2GLMM
Data science for biologists: Generalized linear modelling with R