discovr categorical outcomes
In discovr: Interactive Tutorials and Data for "Discovering Statistics Using R and RStudio"

knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

#necessary to render tutorial correctly
library(learnr) 
library(htmltools)
#easystats
library(datawizard)
library(insight)
library(modelbased)
library(parameters)
library(performance)
#tidyverse
library(dplyr)
library(forcats)
library(ggplot2)
#non tidyverse/easystats
library(robustbase)
#students don't use
library(knitr)

source("./www/discovr_helpers.R")

# Read data files needed for the tutorial

santa_tib <- discovr::santas_log

discovr categorical outcomes (logistic regression)

Packages {data-progressive=FALSE}

Data

Fitting models

`r user_visor()` A Christmas disaster [(B)]{.lbl}

Let's begin with a Christmas tale. A year ago Santa was resting in his workshop studying his nice and naughty lists. He noticed a name on the naughty list in bold, upper case letters. It said ANDY FIELD OF UNIVERSITY OF SUSSEX. He went to look up the file of this Andy Field character. He stared into his snow globe, and as the mists cleared he saw a sad, lonely, friend-less character walking across campus. Under one arm a box of chocolates, under the other a small pink Hippo. As he walked the campus he enticed the young students around him to follow him by offering chocolate. Like the Pied Piper, he led them to a large hall. Once inside, the boys and girls' eyes glistened in anticipation of more chocolate. Instead he unleashed a monologue about the general linear model of such fearsome tedium that Santa began to wonder how anyone could have grown to be so soulless and cruel.

Santa dusted off his sleigh and whizzed through the night sky to the Sussex campus. Once there he confronted the evil fiend that he had seen in his globe. "You've been a naughty boy," he said. "I give you a choice. Give up teaching statistics, or I will be forced to let the Krampus pay you a visit."

Andy looked sad, "But I love statistics," he said to Santa, "It's cool."

Santa pulled out a candy cane, from it emerged a screen. Just as he was about to instruct the screen to call the Krampus, an incoming message appeared. It was Eddie the Elf. Eddie said that last year 10.6% of presents weren't delivered.

What was Santa to do? How could he find out what determines whether presents get delivered or not? He panicked.

Just then, Santa heard a sad little voice. It said, "I can help you".

"How? replied Santa.

"My students," he replied, "they can save Christmas. All they need are data."

With that, Santa looked into his candy screen at the elves who had called him, and turned to Andy. "Tell them what you need."

Andy discovered that to deliver presents Santa uses a large team of elves, and that at each house they usually consume treats. The treats might be Christmas pudding, or sometimes mulled wine. He also discovered that they consume different quantities. Sometimes nothing is left, but other times there might be 1, 2, 3 or even 4 pieces of pudding or glasses of mulled wine. The Elves transmitted a log file of 400 of the previous year's deliveries. It was called [santa_tib]{.alt}.

id: Name of the elf doing the delivery quantity: How many treats the elf ate before attempting the delivery treat: which kind of treats were consumed (Christmas pudding or Mulled wine) delivered: were the presents delivered (Delivered or not delivered)

Andy hypothesised that

H~1~: The probability that presents get delivered is affected by the treat consumed and that this effect is moderated by how many treats are consumed.

As ever, we will follow our 5-step process for fitting models. Ultimately, we are working towards a model that includes treat, quantity and their interaction as predictors. We will build the model sequentially starting by predicting the log odds of delivery from treat:

$$ \ln\left(\frac{P(\text{delivery})}{1- P(\text{delivery})}\right) = \hat{b}_0 + \hat{b}_1\text{treat} + e_i $$

We will use this model to help us get to grips with odds ratios before moving onto adding quantity as a predictor:

$$ \ln\left(\frac{P(\text{delivery})}{1- P(\text{delivery})}\right) = \hat{b}_0 + \hat{b}_1\text{treat} + \hat{b}_2\text{quantity}_i + e_i $$

and finally, adding the interaction::

$$ \ln\left(\frac{P(\text{delivery})}{1- P(\text{delivery})}\right) = \hat{b}_0 + \hat{b}_1\text{treat} + \hat{b}_2\text{quantity}_i + \hat{b}_3\left(\text{treat} \times \text{quantity}\right)_i + e_i $$

In other words, we predict the log odds of presents being delivered from the type of treat consumed, the quantity consumed, and the interaction of those two variables. We can re-arrange this equation to express the outcome as a probability:

$$ P(\text{delivery}) = \frac{1}{1+ e^{-\left(\hat{b}_0 + \hat{b}_1\text{treat} + \hat{b}_2\text{quantity}_i + \hat{b}_3\left(\text{treat} \times \text{quantity}\right)_i + e_i\right)}}
$$

In this version we predict the probability of delivery.

`r bmu()` Step 1: summarize [(A)]{.lbl}

`r step()` **Step 1** Get your data into `r rproj()` and pre-process using [tidyverse]{.pkg} packages or the [datawizard]{.pkg} package from [easystats]{.pkg}.

`r alien()` Alien coding challenge

View the data in [santa_tib]{.alt}.

santa_tib

Note that there are four variables:

id: a character variable (note the <chr> under the name) containing the name of the elf.
quantity: a numeric variable (note the <dbl> under the name) indicating how many treats the elf ate before attempting the delivery.
treat: a factor variable (note the <fct> under the name) indicating the type of treat(s) that the elf consumed (Christmas [pudding]{.alt} or [Mulled wine]{.alt}).
delivered: a factor variable (note the <fct> under the name) whether the presents were delivered by the elf ([Delivered]{.alt} or [Not delivered]{.alt}).

The variables treat and delivered are factors (categorical variable), so having read the data file and converted these variables to factors it's a good idea to check the order of the levels for each one.

`r alien()` Alien coding challenge

Using what you've learnt in previous tutorials check the order of the levels of the variables treat and delivered.

# use this function:
levels()

# Remember that to access a variable you use:
name_of_tibble$name_of_variable

# solution:
levels(santa_tib$treat)
levels(santa_tib$delivered)

You'll find that the factor levels are:

For treat, the baseline category is 'pudding'.
For delivered, the factor levels are ordered 'not delivered' and 'delivered' so an 'increase' in the outcome corresponds to delivery becoming more probable.

These category orders are as they are because I set the data up within this tutorial. Working outside of the tutorial you might need to manually set the order of levels for each factor using fct_relevel() as described in the [Data]{.alt} part of this tutorial.

`r robot()` Code example

One predictor (treat) is categorical, so it might be useful to look at a contingency table of both the raw scores and proportions like we did in the previous chapter. We can do this using the data_tabulate() function within [datawizard]{.pkg} (see discovr_18):

The code to create a basic contingency table is

santa_xtbl <- santa_tib |>
  data_tabulate(select = "treat", by = "delivered", remove_na = TRUE)

This code pipes the data into data_tabulate() within which we specify the variables that define the rows and columns of the table and [remove_na = TRUE]{.alt} prevents missing values from being tabulate. We can view this table using display()

santa_xtbl |> 
  display()

`r alien()` Alien coding challenge

Use the code box to create a table of frequencies for the variables of treat and delivered.

# Create the table
santa_xtbl <- santa_tib |>
  data_tabulate(select = "treat", by = "delivered", remove_na = TRUE)
# View it
santa_xtbl |> 
  display()

We can use these frequencies to look at the [odds]{.kt} and [odds ratio]{.kt}.

`r user_visor()` Odds and the odds ratio [(B)]{.lbl}

Odds

The odds of an event (e.g. delivery) is the ratio of the number of times an event occurs compared to the number of times it doesn't occur:

$$ \begin{aligned} \text{odds} &= \frac{\text{number of times event occurs}}{\text{number of times even does not occur}} \ \text{odds}_\text{delivery} &= \frac{\text{number of times presents are delivered}}{\text{number of times presents are not delivered}} \end{aligned} $$

The contingency table (which we generated in the previous section) shows the number of successful and unsuccessful deliveries split by whether pudding or mulled wine was consumed (we're ignoring quantity to keep things simple).

santa_tib |>
  data_tabulate(select = "treat", by = "delivered", remove_na = TRUE) |> 
  display()

quiz(caption = "Odds quiz (level 2)",
     question("Using the contingency table, the odds of delivery are?",
         answer("1.67", correct = TRUE),
         answer("0.6", message = "These are the odds of non-delivery. You've got the equation upside down!"),
         answer("0.82", message = "These are the odds of delivery after wine, not overall"),
         answer("5.36", message = "These are the odds of delivery after pudding, not overall"),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         )
     )

The odds ratio

The odds ratio expresses the change in odds. In this case, we could compute the change in odds as we change from wine as a treat to pudding. To do this we would compute the odds of delivery in the wine condition, and then do the same for the pudding condition. The resulting odds ratio would be:

$$ \text{odds ratio} = \frac{\text{odds}\text{delivery after wine}}{\text{odds}\text{delivery after pudding}} $$

quiz(caption = "Odds ratio quiz (level 2)",
  question("Using the contingency table, what are the odds of delivery after wine?",
         answer("1.67", message = "These are the odds of delivery overall, not for wine only"),
         answer("1.22", message = "These are the odds of non-delivery after wine. You've got the equation upside down!"),
         answer("0.82", correct = TRUE),
         answer("5.36", message = "These are the odds of delivery after pudding, not wine"),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         ),
  question("Using the contingency table, what are the odds of delivery after pudding?",
         answer("1.67", message = "These are the odds of delivery overall, not for wine only"),
         answer("0.19", message = "These are the odds of non-delivery after pudding. You've got the equation upside down!"),
         answer("0.82", message = "These are the odds of delivery after wine, not pudding"),
         answer("5.36", correct = TRUE),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         ),
  question("Using the answers to the previous two questions, what is the odds ratio for wine compared to pudding?",
         answer("6.18", message = "The odds ratio is the odds in one condition divided by the odds in the other."),
         answer("6.67", message = "This is the odds ratio for pudding relative to wine not wine relative to pudding. You've got the equation upside down!"),
         answer("0.82", message = "The odds ratio is the odds in one condition divided by the odds in the other."),
         answer("0.15", correct = TRUE),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         )
)

We will interpret these values as we encounter them later on.

`r bmu()` Step 2: Visualize [(A)]{.lbl}

`r step()` **Step 2** Visualise the data using the [ggplot2]{.pkg} package from [tidyverse]{.pkg}.

For this example, the most useful visualisation is a plot of the predicted probabilities for combinations of treat and quantity. To get these, we need to first fit the models, so we'll cycle back to visualisations after we have fitted the model.

`r user_visor()` Step 3: Fit the model [(B)]{.lbl}

`r user_visor()` The `glm()` function [(B)]{.lbl}

To fit models with categorical outcomes we use the glm() function which stands for 'generalized linear model'. The function takes the same form as lm(), which we have used many times before. It differs in that it allows you to specify a specific distribution for the errors in the model. When the outcome is a dichotomy (categorical with two categories) we need to specify the distribution as binomial() giving us a general form of the function as follows:

glm(outcome ~ predictor(s),
    data = my_tib,
    family = binomial(),
    na.action = na.fail)

In which you replace [outcome ~ predictor(s)]{.alt} with the formula for your model in exactly the same you would for lm(). For our initial model, we want to predict delivered from treat so our formula is [delivered ~ treat]{.alt}. We replace [my_tib]{.alt} with the name of our tibble (in this case [santa_tib]{.alt}). The argument [na.action = na.fail]{.alt} determines how to treat missing values. By default it is set to [na.fail]{.alt} which means that the model fails (it is not fit). If you have missing values then set this option to [na.omit]{.alt} which removes cases, or [na.exclude]{.alt} which excludes cases from the model (but does not remove them).

In addition to these familiar arguments, there is

[family = gaussian()]{.alt}, which determines the distribution of errors. By default the family is set to gaussian() (a posh word for 'normal') and the function will fit a standard linear model. By changing it to [family = binomial()]{.alt} we get a logistic linear model.

`r robot()` Code example

We're going to build up the model sequentially:

Intercept only ([int_glm]{.alt}): predict delivery using only the intercept (delivered ~ 1)
Treat [treat_glm]{.alt}): predict delivery using the intercept and the treat consumed (delivered ~ treat)
Quantity [quantity_glm]{.alt}): predict delivery using the intercept, the treat consumed and the quantity of treats (delivered ~ treat + quantity)
Final model [santa_glm]{.alt}): predict delivery using the intercept, the treat consumed, the quantity of treats and their interaction (delivered ~ treat + quantity + treat:quantity or delivered ~ treat*quantity)

We could fit these models by executing:

int_glm <- glm(delivered ~ 1, data = santa_tib, family = binomial())
treat_glm <- glm(delivered ~ treat, data = santa_tib, family = binomial())
quantity_glm <- glm(delivered ~ treat + quantity, data = santa_tib, family = binomial())
santa_glm <- glm(delivered ~ treat*quantity, data = santa_tib, family = binomial())

However, it's useful to use the update() function instead (see the box for a refresher):

int_glm <- glm(delivered ~ 1, data = santa_tib, family = binomial())
treat_glm <- update(int_glm, .~. + treat)
quantity_glm <- update(treat_glm, .~. + quantity)
santa_glm <- update(quantity_glm, .~. + treat:quantity)

`r info()` **Refresher of the `update()` function** The `update()` function takes the general form: wzxhzdk:21 Where [new_model]{.alt} is the name assigned to the updated model, [old_model]{.alt} is the name of the model you're updating, and [update_instructions]{.alt} specify how to update the old model. The update instructions use a tilde (`~`) to represent the tilde in the original formula, a dot on the left to indicate 'the existing outcome', and a dot on the right to indicate 'the existing predictors'. From this we get the following possibilities: - `.~.` means keep the original outcome and any existing predictors. - `.~` means keep the original outcome but replace all existing predictors. - `~.` means replace the original outcome but keep all existing predictors. We can add or subtract predictors using the `+` and `–` symbols.

`r alien()` Alien coding challenge

Use the code box below to create all of the models just described.

int_glm <- glm(delivered ~ 1, data = santa_tib, family = binomial())
treat_glm <- update(int_glm, .~. + treat)
quantity_glm <- update(treat_glm, .~. + quantity)
santa_glm <- update(quantity_glm, .~. + treat:quantity)

`r bug()` **De-bug** This code creates the models but doesn't display them so it will appear as though nothing has happened, but it has.

We can use the test_lrt() function from [performance]{.pkg} to compare these models, which we have used before (e.g., discovr_14).

`r alien()` Alien coding challenge

Use the code box below to compare the models we have just created.

int_glm <- glm(delivered ~ 1, data = santa_tib, family = binomial())
treat_glm <- update(int_glm, .~. + treat)
quantity_glm <- update(treat_glm, .~. + quantity)
santa_glm <- update(quantity_glm, .~. + treat:quantity)

test_lrt(int_glm, treat_glm, quantity_glm, santa_glm) |> 
  display()

int_glm <- glm(delivered ~ 1, data = santa_tib, family = binomial())
treat_glm <- update(int_glm, .~. + treat)
quantity_glm <- update(treat_glm, .~. + quantity)
santa_glm <- update(quantity_glm, .~. + treat:quantity)

santa_lrt <- test_lrt(int_glm, treat_glm, quantity_glm, santa_glm)
santa_fit <-compare_performance(int_glm, treat_glm, quantity_glm, santa_glm,
                    metrics = "common")
santa_par <- model_parameters(santa_glm)
santa_exp <- model_parameters(santa_glm, exponentiate = TRUE)
santa_slopes <- estimate_slopes(santa_glm, trend = "quantity", by = "treat", predict = "link", transform = exp)

`r pencil()` **Report**`r rproj()` Adding the treat consumed to the intercept only model significantly improved the fit, `r report_lrt(santa_lrt, row = 2)`, adding the quantity of treats consumed also significantly improved the fit, `r report_lrt(santa_lrt, row = 3)`, as did the interaction of the type of treat and the quantity, `r report_lrt(santa_lrt, row = 4)`.

`r user_visor()` Step 4: Evaluate the model [(B)]{.lbl}

`r step()` **Step 4** Evaluate the model using a small number of functions from the [performance]{.pkg} package from [easystats]{.pkg}. Specially, we evaluate - How well the model fits the data - Whether the underlying assumptions of the model are met. If not, we refit the model using robust methods (see Figure 3).

`r robot()` Code example

Having fit the models we can compare the fit statistics of the models using compare_performance() from the [performance]{.pkg} package. We put the models into the function and use display() to format when we render our document. We include [metrics = "common")]{.alt} to limit the output to the key metrics.

compare_performance(int_glm, treat_glm, quantity_glm, santa_glm,
                    metrics = "common") |> 
  display()

`r alien()` Alien coding challenge

Use the code box below to compare fit statistics for the models.

compare_performance(int_glm, treat_glm, quantity_glm, santa_glm,
                    metrics = "common") |> 
  display()

`r pencil()` **Report**`r rproj()` Tjur's *R*^2^ tells us that the model including the treat consumed accounted for `r percent_from_ez(santa_fit, value = "R2_Tjur", row = 2)` of the variance in deliveries. Adding the quantity consumed to the model increased this to `r percent_from_ez(santa_fit, value = "R2_Tjur", row = 3)`, and including the interaction increased this to `r percent_from_ez(santa_fit, value = "R2_Tjur", row = 4)`.

`r alien()` Alien coding challenge

As with any linear model, we can use the check_model() function to produce diagnostic plots. Use the code box to do this.

check_model(santa_glm)

Top left we have a check of whether the predicted values from the model match the observed value of the outcome. With binary outcomes we get our two outcome categories on the x-axis and within each the observed number of counts and the number predicted by the model (along with a prediction interval). If the observed values fall with the corresponding prediction interval (which they do) then the model is capturing what we observed. So, all is fine here.

The top left is the equivalent of the residual vs predicted values plot for models with continuous outcomes. We're looking for the residuals and their intervals to fall within the error bounds, which they do. Again, no problems here.

The middle-left plot shows cases that might be influencing the model. It plots standardized residuals against leverage (an indicator of influence). Again, no problems here because all points fall within the contour lines.

The middle-right plot check for collinearity between predictors. The interaction has moderate collinearity but we'd expect this because it is the product of the two other predictors. In actual fact, all values are below 10, so all is fine.

The bottom-left plot shows the distribution of residuals and we're looking for the dots to fall along the diagonal line (which they do), so no problems here either.

`r user_visor()` Step 5: Interpret the model [(B)]{.lbl}

`r step()` **Step 5** Interpret the model - Interpret the values that define the model (the [parameter estimates]{.alt}) using the [parameters]{.pkg} package from [easystats]{.pkg}. - Make specific predictions from our model using the [modelbased]{.pkg} package from [easystats]{.pkg}.

Before we interpret the final model, we will take a step back and interpret the model with only treat as a predictor ([treat_glm]{.alt}). This is purely for pedagogic reasons as it will help us to understand what the parameter estimates in logistic regression models represent.

`r alien()` Alien coding challenge

Use the code box below to obtain the table of parameters from [treat_glm]{.alt} using the same code you have used throughout these tutorials.

model_parameters(treat_glm) |> 
  display()

treat_par <- model_parameters(treat_glm)
treat_exp <- model_parameters(treat_glm, exponentiate = TRUE)

delivery_xtab <- xtabs(~ treat + delivered, data = santa_tib)
odds_p <- delivery_xtab[1, 2]/delivery_xtab[1, 1]
odds_w <- delivery_xtab[2, 2]/delivery_xtab[2, 1]

We can see that the intercept was r report_pe(treat_par, row = 1, symbol = "\\hat{b}_0"), which means that in the condition coded as zero (i.e. the pudding condition) the log odds of delivery were r value_from_ez(treat_par, row = 1). What does this mean? Earlier we calculated the odds of delivery after pudding as being r report_value(odds_p). The log odds, is the natural logarithm of this value, ln(r report_value(odds_p)) = r report_value(log(odds_p)).

The effect of treat, r report_pe(treat_par, row = 2), tells us how the log odds in the baseline (pudding) group change as the variable treat changes by 1 unit. A change of 1 unit in this case would be a change from the pudding category to the mulled wine category. The value of r value_from_ez(treat_par, row = 2) tells us that as we move from pudding to mulled wine the log odds of delivery changes by r value_from_ez(treat_par, row = 2). In other words, the log odds of delivery are going down (they are worse after wine than pudding).

The problem that most of us have is that our brain cannot think in terms of logs - it's difficult enough to think in terms of odds let alone log odds. Consequently, it makes life (a little) easier if we convert the log odds back into odds by taking the exponent of them. For example, instead of trying to interpret r value_from_ez(treat_par, row = 2), the change in the log odds, we instead interpret e^r value_from_ez(treat_par, row = 2)^ = r value_from_ez(treat_exp, row = 2), the change in the odds.

`r robot()` Code example

We can get model_parameters() to do this conversion for us by including the argument [exponentiate = TRUE]{.alt}.

model_parameters(treat_glm, exponentiate = TRUE)

`r alien()` Alien coding challenge

Use model_parameters() to view the exponentiated model parameters and display() to round to 2 decimal places.

model_parameters(treat_glm, exponentiate = TRUE) |>
  display()

Th output has changed: it now contains the model parameters expressed as odds rather than log odds. For the intercept, then, the odds of delivery in the baseline (pudding) group were r value_from_ez(treat_exp, row = 1). We calculated this value earlier, it means that r value_from_ez(treat_exp, row = 1) times more presents were delivered than not after consuming Christmas pudding.

The parameter estimate for the variable treat is $\hat{b}$ = r value_from_ez(treat_exp, row = 2), and this matches the value of the odds ratio that we calculated earlier. Look back to that section and you'll see that this value means that the odds of delivery after wine are r value_from_ez(treat_exp, row = 2) the odds of delivery after pudding. Put another way, the odds of delivery are about 1/r value_from_ez(treat_exp, row = 2) = r report_value(1/round(value_from_ez(treat_exp, row = 2, as_is = T), 2)) times smaller after wine than pudding.

Assuming the current sample is one of the 95% for which the confidence interval contains the true value, then the population value of the odds ratio for treat lies between r value_from_ez(treat_exp, row = 2, value = "CI_low") and r value_from_ez(treat_exp, row = 2, value = "CI_high"). However, our sample could be one of the 5% that produces a confidence interval that 'misses' the population value. The important thing is that the interval does not contain 1 (both values are less than 1). The value of 1 is important because it is the threshold at which the direction of the effect changes. Think about what the odds ratio represents: values greater than 1 mean that as the predictor variable increases, so do the odds of (in this case) delivery, but values less than 1 mean that as the predictor variable increases, the odds of delivery decrease. If the value is 1 it means that the odds of delivery are identical for pudding and wine. In other words, there is no effect at all of treat.

If the confidence interval contains 1 then the population value might be one that suggests that the type of treat increases the odds of delivery, or decreases it or doesn't change it. For our confidence interval, the fact that both limits are below 1 suggests that the direction of the relationship that we have observed reflects that of the population (i.e., it's likely that the odds of delivery after wine really are worse than after pudding). If the upper limit had been above 1 then it would tell us that there is a chance that in the population the odds of delivery are actually higher after wine than pudding, or that the type of treat makes no difference at all.

Now we have an intuition about what the parameters mean, let's now interpret the final model.

`r alien()` Alien coding challenge

Use model_parameters() to view the model parameters for the full model ([santa_glm]{.alt}).

model_parameters(santa_glm) |> 
  display()

From the resulting table we can see

The intercept is $\hat{b}_0$ = r value_from_ez(santa_par, row = 1), which means that when all predictors are zero (i.e. the situation where zero puddings were consumed, and the interaction term is zero too) the log odds of delivery were r value_from_ez(santa_par, row = 1).
The $\hat{b}$s for the main effects of treat and quantity are both quite close to zero indicating very small effects.
The interaction effect is significant, r report_pe(santa_par, row = 4). The $\hat{b}$-value suggests that as the combined effect of treat and quantity increases by one unit, the log odds of delivery change by r value_from_ez(santa_par, row = 4) units. Whatever that means. To make (more) sense of these effects lets convert the parameters back to odds.

`r alien()` Alien coding challenge

View the exponentiated model parameters.

model_parameters(santa_glm, exponentiate = TRUE) |> 
  display()

From the resulting table we can see that:

The intercept is $\hat{b}_0$ = r value_from_ez(santa_exp, row = 1), which means that when all predictors are zero (i.e. the situation where zero puddings were consumed, and the interaction term is zero too) the odds of delivery were r value_from_ez(santa_exp, row = 1). That is r value_from_ez(santa_exp, row = 1) times more presents were delivered than not.

For the effects of treat and quantity the exponents of $\hat{b}$ (the odds ratios) are close to 1: for treat the odds ratio is r value_from_ez(santa_exp, row = 2), and for quantity it is r value_from_ez(santa_exp, row = 3). Remember that an odds ratio of 1 represents 'no effect', so these effect sizes suggest that when considered alone, the type of treat and the quantity consumed did not have an large impact on the deliveries. This conclusion is backed up by the non-significant p-values of r value_from_ez(santa_exp, row = 2, value = "p") and r value_from_ez(santa_exp, row = 3, value = "p") respectively.

The odds ratio for the interaction effect is much less than 1 ($\widehat{OR}$ = r value_from_ez(santa_exp, row = 4)) and the confidence interval for it ranges from r value_from_ez(santa_exp, row = 4, value = "CI_low") and r value_from_ez(santa_exp, row = 4, value = "CI_high"). It is important that this interval doesn't contain 1 because it suggests that (assuming this sample is one of the 95% for which the confidence interval contains the population value) the population value is not 1. In other words the true effect relates to a decrease in the odds of delivery as the interaction increases. To unpick what this value of r value_from_ez(santa_exp, row = 4) means, we'll do two things:

Plot the interaction
Fit models that predict deliveries from quantity alone separately for mulled wine and Christmas pudding (simple slopes).

`r user_visor()` Plotting the interaction [(B)]{.lbl}

A useful way to visualise the interaction is to plot the average predicted probabilities for all combinations of values of the predictors. To get these values, we use the estimate_means() from [modelbased]{.alt}, which we've used many times before. If we store these average predicted probabilities then we can use the plot() function from the [see]{.pkg} package, which is another of our friends from [easystats]{.pkg}.

my_probs <- estimate_means(my_model,
                           by = c("predictor 1", "predictor 2" ... "predictor 3"))
plot(my_probs)

in which [my_probs]{.alt} is the name you're giving for the saved predicted probabilities from the model. You replace [my_model]{.alt} with the name of the model from which you want predicted values, and ["predictor 1", "predictor 2" ... "predictor 3"]{.alt} is replaced with a list of predictors across which you want predicted values. Having stored these values, we put the object we've just created into plot(). The output of plot() is a [ggplot2]{.pkg} object, so we can augment it with any functions and themes from [ggplot2]{.pkg} by adding them as layers to the plot in the usual way.

`r robot()` Code example

To get the predicted probabilities from our final model we'd use

santa_probs <- estimate_means(santa_glm, by = c("quantity", "treat"))
plot(santa_probs)

This will plot quantity on the x-axis and display treat as different coloured lines.

We can jazz up the plot by using the viridis palette, which is good for accessibility, for both the colour (scale_colour_viridis_d(begin = 0.3, end = 0.85)) and fill (scale_fill_viridis_d(begin = 0.3, end = 0.85)) attributes of the plot, supply labels for the axes and legend, and apply a theme.

plot(santa_probs) +
  scale_colour_viridis_d(begin = 0.3, end = 0.85) +
  scale_fill_viridis_d(begin = 0.3, end = 0.85) +
  labs(x = "Quantity of treats", y = "Probability of delivery", colour = "Treat", fill = "Treat") +
  theme_minimal()

`r cat_space()` **Tip** To reverse which variable appears on the *x*-axis and which appears and different lines, reverse the order when you get the predicted probabilities. For example, to plot plot `treat` on the *x*-axis and display `quantity` as different coloured lines use: wzxhzdk:46 Note that within `estimate_means()` we have specified `c("treat", "quantity")` instead of `c("quantity", "treat")`.

`r alien()` Alien coding challenge

Plot the treat × quantity interaction effect.

# Get the means
santa_probs <- estimate_means(xxxx, xxxx = c("xxxx", "xxxx"))

# Get the means
santa_probs <- estimate_means(santa_glm, by = c("quantity", "treat"))
# Now plot the predicted values

# Get the means
santa_probs <- estimate_means(santa_glm, by = c("quantity", "treat"))
# Now plot the predicted values
plot(xxxx)

# Get the means
santa_probs <- estimate_means(santa_glm, by = c("quantity", "treat"))
# Now plot the predicted values
plot(santa_probs)
# you can do better than that! Jazz up the plot with labels, themes and accessible colours

santa_probs <- estimate_means(santa_glm, by = c("quantity", "treat"))
plot(santa_probs) +
  scale_colour_viridis_d(begin = 0.3, end = 0.85) +
  scale_fill_viridis_d(begin = 0.3, end = 0.85) +
  labs(x = "Quantity of treats", y = "Probability of delivery", colour = "Treat", fill = "Treat") +
  theme_minimal()

The plot very clearly shows that for Christmas pudding the probability of delivery doesn't change much as the quantity of treats increases (the blue line is flat). However, for mulled wine, as the quantity consumed goes up the probability of delivery decreases: basically, the more wine the elves drink the lower the probability of delivery. This effect really starts to kick in at a quantity of two.

`r user_visor()` Simple slopes [(B)]{.lbl}

To get simple slopes we can use the estimate_slopes() function from the [modelbased]{.alt} package that was introduced in discovr_10 but we have used numerous times since. With logistic models there's a couple of extra arguments we need to think about. In general:

estimate_slopes(model = my_model,
                trend = predictor_variable,
                by = moderator_variable,
                predict = "link",
                transform = exp)

The two argument we haven't see before are:

[predict = "link"]{.alt}: this argument tells the function to express the results in the same units as the original model (that is, log odds). The [link]{.alt} refers to what's known as a link function, which is the function used to transform a non-linear relationship (e.g. predictors of categories membership) to a linear one (e.g. the log odds of delivery/the probability of delivery).
[transform = exp]{.alt}: like with the main model we might want the model parameters to be expressed as odds (e.g. exponentiated) and this argument tells the function top exponentiate the model parameters.

`r alien()` Alien coding challenge

Use estimate_slopes() to view the simple slopes of quantity within each treat. Remember to exponentiate the results.

estimate_slopes(santa_glm,
                trend = "quantity",
                by = "treat",
                predict = "link",
                transform = exp) |> 
  display()

quiz(caption = "Odds ratio check (level 2)",
  question(sprintf("How would you interpret the odds ratio for the effect of `quantity` for those that had pudding?"),
    answer("As the quantity increases by 1 standard deviation, deliveries increased by 1 standard deviation"),
    answer("As the quantity increased by 1 unit, deliveries increased by 1 unit"),
    answer("As the quantity increased by 1 unit, the odds of delivery changed by 0.92", correct = TRUE),
    answer("As the quantity increase by 1 unit, the log odds of delivery change by 0.92", message = "This is the interpretation of the raw parameter estimate not the exponentiated one"),
    correct = "Correct - well done!",
    random_answer_order = TRUE,
    allow_retry = T
  )
)

The output shows the effect of the quantity of Christmas pudding consumed on delivery of presents for the different treats. For those consuming pudding, the the odds ratio is close to 1, r report_ss(santa_slopes, row = 1, symbol = "\\widehat{OR}"). the effect is also not significant, which all suggest that quantity has no real effect when the treat consumed is pudding. This mirrors what was shown by the blue line in the plot: the line is fairly flat suggesting that quantity of pudding consumed doesn't affect the probability of delivery.

In contrast, for those consuming mulled wine, the the odds ratio is well below 1, r report_ss(santa_slopes, row = 2, symbol = "\\widehat{OR}"), and the effect is significant. These results suggest that quantity has an effect on deliveries when the treat consumed is mulled wine. This mirrors what was shown by the green line on the plot we made earlier: as quantity increases the probability of delivery rapidly declines.

The interaction effect, therefore, reflects the fact that the effect of quantity on delivery is significantly different for Christmas pudding and mulled wine. To make sense of these findings. As a rough approximation it means that in the plot, the blue and green lines have different slopes.

`r pencil()` **Report**`r rproj()` A logistic regression model was fit predicting delivery of presents from the type of treat offered, the quantity of treats and their interaction. Table 1 shows the parameter estimates, their 95% confidence intervals and significance tests. The main effects of Treat and Quantity were not significant but the interaction was, suggesting that the effect of quantity was moderated by the type of treat (and vice versa). To tease apart the significant interaction, simple slopes analysis was conducted. When the treat was pudding, the quantity of treats had no significant effect on delivery, `r report_ss(santa_slopes, row = 1, symbol = "\\widehat{OR}")`. However, when the treat was mulled wine, the quantity of treats significantly affected delivery, `r report_ss(santa_slopes, row = 2, symbol = "\\widehat{OR}")`. Specifically, for each additional treat consumed, the odds of delivery were `r value_from_ez(santa_slopes, row = 2, value = "Slope")` times the odds of delivery had that additional treat not been consumed. wzxhzdk:57

`r alien()` Alien coding challenge

As an optional exercise that might help you understand the interaction, use the code box to view the raw parameter estimates for the simple slopes (Hint: remove [transform = exp]{.alt}.)

estimate_slopes(santa_glm,
                trend = "quantity",
                by = "treat",
                predict = "link") |> 
  display()

santa_ss <- estimate_slopes(santa_glm,
                trend = "quantity",
                by = "treat",
                predict = "link")

b_pud <- value_from_ez(santa_ss, row = 1, value = "Slope", as_is = T)
b_wine <- value_from_ez(santa_ss, row = 2, value = "Slope", as_is = T)

The interaction effect reflects the fact that the effect of quantity on delivery is significantly different for Christmas pudding and mulled wine. From the outputs, the log odds of delivery for Christmas pudding are r report_value(b_pud) and for mulled wine are r report_value(b_wine). The significant interaction means that r report_value(b_pud) is significantly different from r report_value(b_wine).

Interestingly, if we calculate the difference between the $\hat{b}$s we get: r report_value(b_wine) - (r report_value(b_pud)) = r report_value(b_wine-b_pud), which is the parameter estimate for the interaction effect. And if we take the exponent of this value we get the odds ratio for the interaction effect: r paste0("$e^{", report_value(b_wine-b_pud), "}$") = r report_value(exp(b_wine-b_pud)).

So, the odds ratio for the interaction is the odds ratio for the difference in the effect of one predictor across levels of the other. In this case, it's the odds ratio for the difference between the effect of quantity of Christmas pudding on delivery and the effect of the quantity of mulled wine on delivery.

`r user_visor()` Robust logistic regression [(B)]{.lbl}

It is also possible to create a robust model using the glmrob() function from the robustbase package. This takes the same form as the glm() function but returns robust estimates of parameters and associated standard errors and p-values.

`r robot()` Code example

To create a robust variant of the model we would execute:

santa_rob <- robustbase::glmrob(delivered ~ treat*quantity,
                                data = santa_tib,
                                family = binomial())

model_parameters(santa_rob) |> 
  display()

which creates an object [santa_rob]{.alt} that contains robust model parameters. We can again summarise it with model_parameters().

`r alien()` Alien coding challenge

Creates a robust model of present deliveries [santa_rob]{.alt} using the code above and summarise it with model_parameters().

santa_rob <- robustbase::glmrob(delivered ~ treat*quantity,
                                data = santa_tib,
                                family = binomial())

model_parameters(santa_rob) |> 
  display()

santa_rob <- robustbase::glmrob(delivered ~ treat*quantity, data = santa_tib, family = binomial())
santa_rob_par <- model_parameters(santa_rob)
santa_rob_exp <- model_parameters(santa_rob, exponentiate = TRUE)

Compare the resulting model to the one you created earlier. Basically, the parameter estimates ($\hat{b}$s) don't change much and the conclusions of the model stay largely the same. For example, the parameter estimate for the interaction term was $\hat{b}$ = r value_from_ez(santa_par, row = 4) for the non-robust model but $\hat{b}$ = r value_from_ez(santa_rob_par, row = 4) for the robust model.

`r alien()` Alien coding challenge

Now view the model parameters as odds rather than log odds.

santa_rob <- robustbase::glmrob(delivered ~ treat*quantity, data = santa_tib, family = binomial())

model_parameters(santa_rob, exponentiate = TRUE) |> 
  display()

As you'd expect based on the raw parameter estimate, the robust odds ratio of $\widehat{OR}$ = r value_from_ez(santa_rob_exp, row = 4) is virtually identical to the one from the non-robust model, which was $\widehat{OR}$ = r value_from_ez(santa_exp, row = 4). The robust model gives us confidence that we can trust the original model. If the parameters were very different we'd report the robust model.

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R.

**A message from Mae Jemstone:** Mae stared at the box. She hadn't slept - the excitement had been too much. As everyone knew knew, December 24th was the one night of the year that the Sanza from the Solstice Nebula left their part of the galaxy to visit the rest of the universe. Mae loved December 25th and the anticipation that the Sanza had visited. As everyone knew, if you had tried to be the best version of 'you', the Sanza left you a gift. You didn't have to have behaved well, or acted perfectly, or made no mistakes. But you did have to try your best. You did have to try to choose love and empathy even when it was hardest to do. Mae always tried to choose love and like every other year of her life, she woke to find a small, plain cardboard cube, stamped with the characteristic *S* of the Sanza. The gift of the Sanza was to truly feel, for one short day, your place in the universe and the ripples your effort creates. By opening the box, every good deed, every positive thought, every attempt to be kind, was reflected back at you. She opened the box and was overwhelmed. Despite everything, it had been a good year.

Resources/References {data-progressive=FALSE}

References

Any scripts or data that you put into this service are public.

discovr documentation built on Feb. 5, 2026, 5:07 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

discovr Interactive Tutorials and Data for "Discovering Statistics Using R and RStudio"

discovr categorical outcomes In discovr: Interactive Tutorials and Data for "Discovering Statistics Using R and RStudio"

discovr categorical outcomes (logistic regression)

Packages {data-progressive=FALSE}

Data

Fitting models

r user_visor() A Christmas disaster [(B)]{.lbl}

r bmu() Step 1: summarize [(A)]{.lbl}

r alien() Alien coding challenge

r alien() Alien coding challenge

r robot() Code example

r alien() Alien coding challenge

r user_visor() Odds and the odds ratio [(B)]{.lbl}

Odds

The odds ratio

r bmu() Step 2: Visualize [(A)]{.lbl}

r user_visor() Step 3: Fit the model [(B)]{.lbl}

r user_visor() The glm() function [(B)]{.lbl}

r robot() Code example

r alien() Alien coding challenge

r alien() Alien coding challenge

r user_visor() Step 4: Evaluate the model [(B)]{.lbl}

r robot() Code example

r alien() Alien coding challenge

r alien() Alien coding challenge

r user_visor() Step 5: Interpret the model [(B)]{.lbl}

r alien() Alien coding challenge

r robot() Code example

r alien() Alien coding challenge

r alien() Alien coding challenge

r alien() Alien coding challenge

r user_visor() Plotting the interaction [(B)]{.lbl}

r robot() Code example

r alien() Alien coding challenge

r user_visor() Simple slopes [(B)]{.lbl}

r alien() Alien coding challenge

r alien() Alien coding challenge

r user_visor() Robust logistic regression [(B)]{.lbl}

r robot() Code example

r alien() Alien coding challenge

r alien() Alien coding challenge

Resources/References {data-progressive=FALSE}

References

Try the discovr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

discovr
Interactive Tutorials and Data for "Discovering Statistics Using R and RStudio"

discovr categorical outcomes
In discovr: Interactive Tutorials and Data for "Discovering Statistics Using R and RStudio"

`r user_visor()` A Christmas disaster [(B)]{.lbl}

`r bmu()` Step 1: summarize [(A)]{.lbl}

`r alien()` Alien coding challenge

`r alien()` Alien coding challenge

`r robot()` Code example

`r alien()` Alien coding challenge

`r user_visor()` Odds and the odds ratio [(B)]{.lbl}

`r bmu()` Step 2: Visualize [(A)]{.lbl}

`r user_visor()` Step 3: Fit the model [(B)]{.lbl}

`r user_visor()` The `glm()` function [(B)]{.lbl}

`r robot()` Code example

`r alien()` Alien coding challenge

`r alien()` Alien coding challenge

`r user_visor()` Step 4: Evaluate the model [(B)]{.lbl}

`r robot()` Code example

`r alien()` Alien coding challenge

`r alien()` Alien coding challenge

`r user_visor()` Step 5: Interpret the model [(B)]{.lbl}

`r alien()` Alien coding challenge

`r robot()` Code example

`r alien()` Alien coding challenge

`r alien()` Alien coding challenge

`r alien()` Alien coding challenge

`r user_visor()` Plotting the interaction [(B)]{.lbl}

`r robot()` Code example

`r alien()` Alien coding challenge

`r user_visor()` Simple slopes [(B)]{.lbl}

`r alien()` Alien coding challenge

`r alien()` Alien coding challenge

`r user_visor()` Robust logistic regression [(B)]{.lbl}

`r robot()` Code example

`r alien()` Alien coding challenge

`r alien()` Alien coding challenge