knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

#necessary to render tutorial correctly
library(learnr) 
library(htmltools)
#tidyverse
library(dplyr)
library(forcats)
library(ggplot2)
library(tidyr)
#non tidyverse
library(broom)
library(interactions)
library(knitr)
library(robustbase)
#students don't use
library(gt)

source("./www/discovr_helpers.R")

# Read data files needed for the tutorial

santa_tib <- discovr::santas_log
# Create bib file for R packages
here::here("inst/tutorials/discovr_20/packages.bib") |>
  knitr::write_bib(c('here', 'tidyverse', 'dplyr', 'readr', 'forcats', 'tibble', 'tidyr', 'knitr', 'gt', 'broom', 'interactions', 'robustbase'), file = _)

discovr: Categorical outcomes (Logistic regression)

Overview

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **Usage:** This tutorial accompanies [Discovering Statistics Using R and RStudio](https://www.discovr.rocks/) [@field_discovering_2023] by [Andy Field](https://en.wikipedia.org/wiki/Andy_Field_(academic)). It contains material from the book so there are some copyright considerations but I offer them under a [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). Tl;dr: you can use this tutorial for teaching and non-profit activities but please don't meddle with it or claim it as your own work.

r cat_space(fill = blu) Welcome to the discovr space pirate academy

Hi, welcome to discovr space pirate academy. Well done on embarking on this brave mission to planet r rproj()s, which is a bit like Mars, but a less red and more hostile environment. That's right, more hostile than a planet without water. Fear not though, the fact you are here means that you can master r rproj(), and before you know it you'll be as brilliant as our pirate leader Mae Jemstone (she's the badass with the gun). I am the space cat-det, and I will pop up to offer you tips along your journey.

On your way you will face many challenges, but follow Mae's system to keep yourself on track:

It's not just me that's here to help though, you will meet other characters along the way:

Also, use hints and solutions to guide you through the exercises (Figure 1).

Each codebox has a hints or solution button that activates a popup window containing code and text to guide you through each exercise.
Figure 1: In a code exercise click the hints button to guide you through the exercise.

By for now and good luck - you'll be amazing!

Workflow

Packages

This tutorial uses the following packages:

It also uses these tidyverse packages [@R-tidyverse; @tidyverse2019]: dplyr [@R-dplyr], forcats [@R-forcats], ggplot2 [@wickhamGgplot2ElegantGraphics2016], readr [@R-readr], and tidyr [@R-tidyr].

Coding style

There are (broadly) two styles of coding:

  1. Explicit: Using this style you declare the package when using a function: package::function(). For example, if I want to use the mutate() function from the package dplyr, I will type dplyr::mutate(). If you adopt an explicit style, you don't need to load packages at the start of your Quarto document (although see below for some exceptions).

  2. Concise: Using this style you load all of the packages at the start of your Quarto document using library(package_name), and then refer to functions without their package. For example, if I want to use the mutate() function from the package dplyr, I will use library(dplyr) in my first code chunk and type the function as mutate() when I use it subsequently.

Coding style is a personal choice. The Google r rproj() style guide and tidyverse style guide recommend an explicit style, and I use it in teaching materials for two reasons (1) it helps you to remember which functions come from which packages, and (2) it prevents clashes resulting from using functions from different packages that have the same name. However, even with this style it makes sense to load tidyverse because the dplyr and ggplot2 packages contain functions that are often used within other functions and in these cases explicit code is difficult to read. Also, no-one wants to write ggplot2:: before every function from ggplot2.

You can use either style in this tutorial because all packages are pre-loaded. If working outside of the tutorial, load the tidyverse package (and any others if you're using a concise style) at the beginning of your Quarto document:

library(tidyverse)

Data

To work outside of this tutorial you need to download the following data files:

Set up an r rstudio() project in the way that I recommend in this tutorial, and save the data files to the folder within your project called [data]{.alt}. Place this code in the first code chunk in your Quarto document:

santa_tib <- here::here("data/santas_log.csv") |>
  readr::read_csv() |>
   dplyr::mutate(
    treat = forcats::as_factor(treat) |>
      forcats::fct_relevel("Pudding", "Mulled wine"),
    delivered = forcats::as_factor(delivered) |>
      forcats::fct_relevel("Not delivered", "Delivered")
  )

This code reads in the data and converts the variables treat and delivered to be factors (categorical variable). It also uses fct_relevel() to set the order of the levels of each factor.

r bmu() A Christmas disaster [(2)]{.alt}

Let's begin with a Christmas tale. A year ago Santa was resting in his workshop studying his nice and naughty lists. He noticed a name on the naughty list in bold, upper case letters. It said ANDY FIELD OF UNIVERSITY OF SUSSEX. He went to look up the file of this Andy Field character. He stared into his snow globe, and as the mists cleared he saw a sad, lonely, friend-less character walking across campus. Under one arm a box of chocolates, under the other a small pink Hippo. As he walked the campus he enticed the young students around him to follow him by offering chocolate. Like the Pied Piper, he led them to a large hall. Once inside, the boys and girls' eyes glistened in anticipation of more chocolate. Instead he unleashed a monologue about the general linear model of such fearsome tedium that Santa began to wonder how anyone could have grown to be so soulless and cruel.

Santa dusted off his sleigh and whizzed through the night sky to the Sussex campus. Once there he confronted the evil fiend that he had seen in his globe. "You've been a naughty boy," he said. "I give you a choice. Give up teaching statistics, or I will be forced to let the Krampus pay you a visit."

Andy looked sad, "But I love statistics," he said to Santa, "It's cool."

Santa pulled out a candy cane, from it emerged a screen. Just as he was about to instruct the screen to call the Krampus, an incoming message appeared. It was Eddie the Elf. Eddie said that last year 10.6% of presents weren't delivered.

What was Santa to do? How could he find out what determines whether presents get delivered or not? He panicked.

Just then, Santa heard a sad little voice. It said, "I can help you".

"How? replied Santa.

"My students," he replied, "they can save Christmas. All they need are data."

With that, Santa looked into his candy screen at the elves who had called him, and turned to Andy. "Tell them what you need."

Andy discovered that to deliver presents Santa uses a large team of elves, and that at each house they usually consume treats. The treats might be Christmas pudding, or sometimes mulled wine. He also discovered that they consume different quantities. Sometimes nothing is left, but other times there might be 1, 2, 3 or even 4 pieces of pudding or glasses of mulled wine. The Elves transmitted a log file of 400 of the previous year's deliveries. It was called [santa_tib]{.alt}.

r alien() Alien coding challenge

View the data in [santa_tib]{.alt}.


santa_tib

Note that there are four variables:

The variables treat and delivered are factors (categorical variable), so having read the data file and converted these variables to factors it's a good idea to check the order of the levels for each one.

r alien() Alien coding challenge

Using what you've learnt in previous tutorials check the order of the levels of the variables treat and delivered.


# use this function:
levels()
# Remember that to access a variable you use:

name_of_tibble$name_of_variable
# solution:

levels(santa_tib$treat)
levels(santa_tib$delivered)

You'll find that the factor levels are:

These category orders are as they are because I set the data up within this tutorial. Working outside of the tutorial you might need to manually set the order of levels for each factor using forcats::fct_relevel() as described in the [Data]{.alt} part of this tutorial.

r bmu() The model [(2)]{.alt}

We'll start by looking solely at the effect of the type of treat. The model is described by the following equation:

$$ P(\text{delivery}) = \frac{1}{1+ e^{-(\hat{b}_0 + \hat{b}_1\text{treat}_i+e_i)}}
$$

In other words, we predict the probability of presents being delivered from the type of treat consumed. We can re-arrange this equation to express it differently. It's exactly the same equation, but re-arranged:

$$ \ln\bigg(\frac{P(\text{delivery})}{1- P(\text{delivery})}\bigg) = \hat{b}0 + \hat{b}_1\text{treat}{i} + e_i $$

In this version we predict the log odds of delivery from the standard equation for a linear model.

Odds and the odds ratio

Odds

The odds of an event (e.g. delivery) is the ratio of the number of times an event occurs compared to the number of times it doesn't occur:

$$ \begin{aligned} \text{odds} &= \frac{\text{number of times event occurs}}{\text{number of times even does not occur}} \ \text{odds}_\text{delivery} &= \frac{\text{number of times presents are delivered}}{\text{number of times presents are not delivered}} \end{aligned} $$

Table 1 shows the number of successful and unsuccessful deliveries split by whether pudding or mulled wine was consumed (we're ignoring quantity to keep things simple).

santa_tib |> 
  dplyr::group_by(treat, delivered) |> 
  dplyr::summarize(n = n()) |> 
  tidyr::pivot_wider(
    id_cols = "treat",
    names_from = "delivered",
    values_from = "n"
  ) |>
  dplyr::mutate(
    Total = `Delivered` + `Not delivered`
  ) |> 
  dplyr::ungroup() |>
  gt::gt(rowname_col = "treat") |> 
  gt::grand_summary_rows(
    fns = list(label = "Total", fn = "sum"),
    decimals = 0
  )
quiz(caption = "Odds quiz (level 2)",
     question("Using Table 1, the odds of delivery are?",
         answer("1.67", correct = TRUE),
         answer("0.6", message = "These are the odds of non-delivery. You've got the equation upside down!"),
         answer("0.82", message = "These are the odds of delivery after wine, not overall"),
         answer("5.36", message = "These are the odds of delivery after pudding, not overall"),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         )
     )

The odds ratio

The odds ratio expresses the change in odds. In this case, we could compute the change in odds as we change from wine as a treat to pudding. To do this we would compute the odds of delivery in the wine condition, and then do the same for the pudding condition. The resulting odds ratio would be:

$$ \text{odds ratio} = \frac{\text{odds}\text{delivery after wine}}{\text{odds}\text{delivery after pudding}} $$

quiz(caption = "Odds ratio quiz (level 2)",
  question("Using Table 1, what are the odds of delivery after wine?",
         answer("1.67", message = "These are the odds of delivery overall, not for wine only"),
         answer("1.22", message = "These are the odds of non-delivery after wine. You've got the equation upside down!"),
         answer("0.82", correct = TRUE),
         answer("5.36", message = "These are the odds of delivery after pudding, not wine"),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         ),
  question("Using Table 1, what are the odds of delivery after pudding?",
         answer("1.67", message = "These are the odds of delivery overall, not for wine only"),
         answer("0.19", message = "These are the odds of non-delivery after pudding. You've got the equation upside down!"),
         answer("0.82", message = "These are the odds of delivery after wine, not pudding"),
         answer("5.36", correct = TRUE),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         ),
  question("Using the answers to the previous two questions, what is the odds ratio for wine compared to pudding?",
         answer("6.18", message = "The odds ratio is the odds in one condition divided by the odds in the other."),
         answer("6.67", message = "This is the odds ratio for pudding relative to wine not wine relative to pudding. You've got the equation upside down!"),
         answer("0.82", message = "The odds ratio is the odds in one condition divided by the odds in the other."),
         answer("0.15", correct = TRUE),
         incorrect = "Try again!",
         allow_retry = T,
         random_answer_order = T
         )
)

We will interpret these values as we encounter them later on.

r user_visor() Exploring the data [(2)]{.alt}

Our outcome variable is categorical so to summarise it we want to look at frequencies, or counts (that is, how many cases fall into different categories). A simple but crude way to get a crosstabulation table of frequencies is the xtabs() function, but we can also use what we know about tidyverse too. We used both of these methods in discovr_19, and we'll revisit them here.

r bmu() Using xtabs() [(1)]{.alt}

The xtabs() function is part of base r rproj(). When we used it in discovr_19 we applied it to a data set that contained the frequencies rather than the raw data, and we saw that it takes this form.

xtabs(frequency ~ row_variable + column_variable + table_variable, data = tibble)

When applying xtabs() to raw data we can omit the [frequencies]{.alt} (because they don't exist) and instead the function takes this general form:

xtabs(~ row_variable + column_variable + table_variable, data = tibble)

Essentially we replace [row_variable]{.alt} and [column_variable]{.alt} with the variables in the data that you would like to form the rows and columns of the contingency table respectively. In more complex situations you can specify a third categorical variable as the [table_variable]{.alt} and separate contingency tables will be created for each category of that third variable.

r robot() Code example

To look at the table of frequencies for the variables of treat and delivered we'd execute:

delivery_xtab <- xtabs(~ treat + delivered, data = santa_tib)

r alien() Alien coding challenge

Use the code box to create a table of frequencies for the variables of treat and delivered. You should get a table that shows the four frequencies in Table 1 (ignoring the column and row labelled 'Total').


delivery_xtab <- xtabs(~ treat + delivered, data = santa_tib)
delivery_xtab

r user_visor() Using tidyverse [(2)]{.alt}

A different approach is to use all of the tidyverse skills we have acquired. Specifically, we can use code that we learnt about in discovr_19. To remind you, in that tutorial we used the following dplyr functions:

To count the frequencies of scores we need to do two things:

  1. Tell r rproj() which categories we want to count. For example, we could use group_by(treat) to tell r rproj() that we want to conduct any future commands on the 'Pudding' and 'Mulled wine' categories separately. However, we want to count across all combinations of treat and delivered so we need to include both variables in the function: group_by(treat, delivered).
  2. Count how many scores fall into each 'category'. This is done using tally(), which counts how many cases are in each group created by group_by(). This variable will, therefore, contain the number of times each combination of the values of treat and delivered occur.

r robot() Code example

This might all make more sense if we have a go. I'll show you the pipe and then explain it line by line:

santa_xtab <- santa_tib |> 
  dplyr::group_by(treat, delivered) |> 
  dplyr::tally()

Let's look at each line:

  1. The first line creates an object santa_xtab. We begin by passing the santa_tib tibble into the pipe.
  2. The second line groups the data by all combinations of treat and delivered. It effectively creates 4 cells: 2 categories of treat (pudding, mulled wine) × 2 categories of delivery (delivered, not delivered). To be explicit it creates these combinations: [{pudding, not delivered}]{.alt}, [{pudding, delivered}]{.alt}, [{mulled wine, not delivered}]{.alt} and [{mulled wine, delivered}]{.alt}
  3. The final line uses the tally() function to count the cases. The function creates a variable called n and, because in the previous line we grouped the data by all combinations of variables, counts the number of observations in each of the 4 combinations of categories. In effect, we have calculated the cell frequencies for all combinations of treat and delivered.

r alien() Alien coding challenge

Create a table of frequencies for the variables of treat and delivered called [santa_xtab]{.alt} and view it. (Remember that to view an object you execute its name.)


santa_xtab <- santa_tib |> 
  dplyr::group_by(treat, delivered) |> 
  dplyr::tally()
santa_xtab # This line displays the object we created above

You should see a tibble with 4 rows representing the different combinations of treat and delivered, and a variable (n) containing their cell counts. The values should match Table 1.

If we want the table to look like Table 1, where values for 'delivered' and 'not delivered' are spread across columns, we use pivot_wider (discovr_06, discovr_09 and discovr_19) to spread the values for the variable delivered across columns of the table. To remind you, the pivot_wider() restructures rows into columns. It's general form is:

tidyr::pivot_wider(
  data = tibble,
  id_cols = variables_that_you_do_not_want_to_restructure,
  names_from = "variable_containing_the_names_of_columns",
  values_from = " variable_containing_the_scores",
)

If we want to spread values associated with 'delivered' and 'not delivered' across columns then we want to get the column [names_from]{.alt} the variable delivered. The frequencies are stored in the variable n, so we'd want to get [values_from]{.alt} from that variable. Finally, we want to leave categories of treat in rows so we'd select treat as the [id_cols]{.alt}.

r robot() Code example

Assuming we've already created the object [santa_xtab]{.alt} as above, we could use this code:

santa_xtab <- santa_xtab |> 
  tidyr::pivot_wider(
    id_cols = "treat",
    names_from = "delivered",
    values_from = "n"
  )

r alien() Alien coding challenge

Use pivot_wider() to spread the values for the categories of delivered across columns.

santa_xtab <- santa_tib |> 
  dplyr::group_by(treat, delivered) |> 
  dplyr::tally()

santa_xtab <- santa_xtab |> 
  tidyr::pivot_wider(
    id_cols = "treat",
    names_from = "delivered",
    values_from = "n"
  )
santa_xtab # This line displays the object we created above

r user_visor() Fitting the model [(2)]{.alt}

r user_visor() The glm() function [(2)]{.alt}

To fit the model we use the function glm() which stands for 'generalized linear model'. The function takes the same form as lm(), which we have used many times before. However, it differs in that it allows you to specify a specific distribution for the errors in the model. When the outcome is a dichotomy (categorical with two categories) we need to specify the family as binomial() giving us a general form of the function as follows:

glm(outcome ~ predictor(s),
    data = my_tib,
    family = binomial(),
    na.action = na.fail)

In which you replace [outcome ~ predictor(s)]{.alt} with the formula for your model in exactly the same you would for lm(). In this case, we want to predict delivered from treat so our formula is [delivered ~ treat]{.alt}. We replace [my_tib]{.alt} with the name of our tibble (in this case [santa_tib]{.alt}). Alternatively we can pipe the tibble into glm() and use [data = .]{.alt}. The argument [na.action = na.fail]{.alt} determines how to treat missing values. By default it is set to [na.fail]{.alt} which means that the model fails (it is not fit). If you have missing values then set this option to [na.omit]{.alt} which removes cases, or [na.exclude]{.alt} which excludes cases from the model (but does not remove them).

In addition to these familiar arguments, there is

r robot() Code example

We could fit our model by executing:

santa_glm <- glm(delivered ~ treat, data = santa_tib, family = binomial())

We can use the methods in the broom package with the resulting object, which means that we can obtain the model fit using glance() and the model parameters using tidy():

broom::glance(santa_glm)
broom::tidy(santa_glm, conf.int = TRUE)

We will focus on using tidy() to view the model parameters because with logistic models you don't get overall significance tests for fit (although we look at ways to obtain these statistics in due course).

r alien() Alien coding challenge

Fit a model predicting present deliveries from the type of treat and view the model parameters to 2 decimal places.


# fit the model
santa_glm <- glm(xxxxxx ~ xxxxxx, data = xxxxx, family = xxxxxx)
# fit the model
santa_glm <- glm(delivered ~ treat, data = santa_tib, family = binomial())

# view the model parameters
broom::tidy(xxxxxxx, xxxx.xxxx = xxxxx)
santa_glm <- glm(delivered ~ treat, data = santa_tib, family = binomial())
broom::tidy(santa_glm, conf.int = TRUE) |> 
  knitr::kable(digits = 2)
santa_glm <- glm(delivered ~ treat, data = santa_tib, family = binomial())
santa_tbl <- broom::tidy(santa_glm, conf.int = TRUE)
santa_exp <- broom::tidy(santa_glm, conf.int = TRUE, exponentiate = TRUE)
delivery_xtab <- xtabs(~ treat + delivered, data = santa_tib)
odds_p <- delivery_xtab[1, 2]/delivery_xtab[1, 1]
odds_w <- delivery_xtab[2, 2]/delivery_xtab[2, 1]

We can see that $\hat{b}_0$ = r get_par(santa_tbl, row = 1), which means that in the condition coded as zero (i.e. the pudding condition) the log odds of delivery were r get_par(santa_tbl, row = 1). What does this mean? Earlier we calculated the odds of delivery after pudding as being r sprintf("%.2f", odds_p). The log odds, is the natural logarithm of this value, ln(r sprintf("%.2f", odds_p)) = r sprintf("%.2f", log(odds_p)).

The $\hat{b}$ for the effect of treat tells us how the log odds in the baseline (pudding) group change as the variable treat changes by 1 unit. A change of 1 unit in this case would be a change from the pudding category to the mulled wine category. The value of r get_par(santa_tbl, row = 2) tells us that as we move from pudding to mulled wine the log odds of delivery decreases by r get_par(santa_tbl, row = 2). In other words, the odds of delivery are going down (they are worse after wine than pudding).

The problem that most of us have is that our brain cannot think in terms of logs - it's difficult enough to think in terms of odds let alone log odds. Consequently, it makes life (a little) easier if we convert the log odds back into odds by taking the exponent of them. For example, instead of trying to interpret r get_par(santa_tbl, row = 2), the change in the log odds, we instead interpret e^r get_par(santa_tbl, row = 2)^ = r get_par(santa_exp, row = 2), the change in the odds.

r robot() Code example

We can get tidy() to do this conversion for us by including the argument [exponentiate = TRUE]{.alt}.

broom::tidy(santa_glm, conf.int = TRUE, exponentiate = TRUE)

r alien() Alien coding challenge

Use tidy() to view the exponentiated model parameters and kable() to round to 2 decimal places.

santa_glm <- glm(delivered ~ treat, data = santa_tib, family = binomial())

broom::tidy(santa_glm, conf.int = TRUE, exponentiate = TRUE) |> 
  knitr::kable(digits = 2)

Th output has changed: it now contains the model parameters expressed as odds rather than log odds. For $\hat{b}_0$, then, the odds of delivery in the baseline (pudding) group were r get_par(santa_exp, row = 1). We calculated this value earlier, it means that r get_par(santa_exp, row = 1) times more presents were delivered than not after consuming Christmas pudding.

The parameter estimate for the variable treat is $\hat{b}$ = r get_par(santa_exp, row = 2), and this matches the value of the odds ratio that we calculated earlier. Look back to that section and you'll see that this value means that the odds of delivery after wine are r get_par(santa_exp, row = 2) the odds of delivery after pudding. Put another way, the odds of delivery are about 1/r get_par(santa_exp, row = 2) = r sprintf("%.2f", 1/round(santa_exp$estimate[2], 2)) times smaller after wine than pudding.

Assuming the current sample is one of the 95% for which the confidence interval contains the true value, then the population value of the odds ratio for treat lies between r get_par(santa_exp, row = 2, col = "conf.low") and r get_par(santa_exp, row = 2, col = "conf.high"). However, our sample could be one of the 5% that produces a confidence interval that 'misses' the population value. The important thing is that the interval does not contain 1 (both values are less than 1). The value of 1 is important because it is the threshold at which the direction of the effect changes. Think about what the odds ratio represents: values greater than 1 mean that as the predictor variable increases, so do the odds of (in this case) delivery, but values less than 1 mean that as the predictor variable increases, the odds of delivery decrease. If the value is 1 it means that the odds of delivery are identical for pudding and wine. In other words, there is no effect at all of treat.

If the confidence interval contains 1 then the population value might be one that suggests that the type of treat increases the odds of delivery, or decreases it or doesn't change it. For our confidence interval, the fact that both limits are below 1 suggests that the direction of the relationship that we have observed reflects that of the population (i.e., it's likely that the odds of delivery after wine really are worse than after pudding). If the upper limit had been above 1 then it would tell us that there is a chance that in the population the odds of delivery are actually higher after wine than pudding, or that the type of treat makes no difference at all.

Now we have the basic understanding of what the parameters mean, consider working through the optional section on looking at the overall model fit.

`r pencil()` **Report**`r rproj()` A logistic regression model was fit predicting delivery of presents from the type of treat offered. The odds of delivery in those receiving pudding as a treat was significantly different to 1, odds = `r report_pars(santa_exp, row = 1)`. The odds ratio for the type of treat was also significant, $\widehat{OR}$ = `r report_pars(santa_exp, row = 2)`. The odds of delivery after wine were `r get_par(santa_exp, row = 2)` the size of the odds of delivery after pudding.

r user_astronaut() Assessing overall fit [(3)]{.alt}

To get test of overall fit we need to build up models step-by_step and use anova() to compare them. In this example, with only a single predictor this would mean fitting a model with only the intercept, then fitting a model that includes the predictor and comparing the two models:

santa_int <- glm(delivered ~ 1, data = santa_tib, family = binomial())
santa_treat <- glm(delivered ~ treat, data = santa_tib, family = binomial())
anova(santa_int, santa_treat, test = "Chisq")

The first line of code specifies a model that predicts deliveries from the intercept, the second line fits a model that predicts deliveries from both the intercept and the variable treat. The final line compares the two previously created model and applies a Chi square-test that tests whether the change in the deviance statistic is significantly different from zero (in other words has the fit of the model significantly improved because of including treat as a predictor).

r alien() Alien coding challenge

Use the code box to try the code above. Round the output to 3 decimal places.


santa_int <- glm(delivered ~ 1, data = santa_tib, family = binomial())
santa_treat <- glm(delivered ~ treat, data = santa_tib, family = binomial())
anova(santa_int, santa_treat, test = "Chisq") |>  
knitr::kable(digits = 3)
santa_int <- glm(delivered ~ 1, data = santa_tib, family = binomial())
santa_treat <- glm(delivered ~ treat, data = santa_tib, family = binomial())
santa_aov <- anova(santa_int, santa_treat, test = "Chisq") |> 
  tibble::as_tibble() |> 
  dplyr::rename(
    resid_dev = `Resid. Dev`
  )

The residual deviance statistic [Resid. Dev]{.alt} is a measure of model fit (think of it like the total error - large values mean poorer fit). Note that with the intercept only model the residual deviance is r sprintf("%.2f", santa_aov$resid_dev[1]) but when we include treat it reduces to r sprintf("%.2f", santa_aov$resid_dev[2]) : a reduction of r sprintf("%.2f", santa_aov$Deviance[2]) (the value in the table under [Deviance]{.alt}) The fact that the deviance got smaller means that the fit has improved. The Chi-square test tells us whether this reduction of r sprintf("%.2f", santa_aov$Deviance[2]) is significantly different from zero (i.e., no reduction). If this change in the residual deviance is significantly different from 0 then the model fit has improved, which is the case here. Put another way, a significant deviance shows that by including treat as a predictor the fit of the model significantly improves compared to when it wasn't included.

r user_visor() Multiple predictors [(2)]{.alt}

It seems like the type of treat affects the odds of delivery. However, the elves don't want to be banned from the mulled wine so they ask Santa whether the issue might actually be the quantity of treats they consume? We can extend the model to include the quantity of treats as a predictor and also the interaction between the quantity of treats consumed and the type of treat consumed.

r user_visor() The model [(2)]{.alt}

The model is described by the following equation:

$$ P(\text{delivery}) = \frac{1}{1+ e^{-(\hat{b}_0 + \hat{b}_1\text{treat} + \hat{b}_2\text{quantity}_i + \hat{b}_3\text{treat} \times \text{quantity}_i + e_i)}}
$$ In other words, we predict the probability of presents being delivered from the type of treat consumed, the quantity consumed, and the interaction of those two variables. We can re-arrange this equation to express it differently. It's exactly the same equation, but re-arranged:

$$ \ln\bigg(\frac{P(\text{delivery})}{1- P(\text{delivery})}\bigg) = \hat{b}_0 + \hat{b}_1\text{treat} + \hat{b}_2\text{quantity}_i + \hat{b}_3\text{treat} \times \text{quantity}_i + e_i $$

In this version we predict the log odds of delivery from the standard equation for a linear model.

r user_astronaut() Hierarchical variable entry [(3)]{.alt}

This optional section shows how we could build up the model step-by-step and look at the overall change in the fit at each step. This is useful in two situations. First if you want to do hierarchical variable entry (that is, build up the model in a theory-driven way) this would be the way to do it. Second, it is useful to quantify the overall significance of categorical predictors made up of several categories (i.e., that are split across several dummy variables).

We can use the same code as before but add two new models that add in quantity and the interaction term. A quicker way to add (or subtract) things to a model is to use the update() function, which takes this form:

update(model_to_update, .~. + new_predictor)

Basically, this function takes any existing model placed within it and updates it according to any changes you specify. So, we'd replace [model_to_update]{.alt} with the name of the model we want to update. The [.~.]{.alt} tells it to retain the existing outcome and predictors (incidentally [.~]{.alt} would tell it to retain the outcome variable but drop all predictors). We replace [new_predictor]{.alt} with the name of the predictor we want to add.

Let's look at an example. Earlier on we used this code to specify two models:

santa_int <- glm(delivered ~ 1, data = santa_tib, family = binomial())
santa_treat <- glm(delivered ~ treat, data = santa_tib, family = binomial())

Using update we'd use

santa_int <- glm(delivered ~ 1, data = santa_tib, family = binomial())
santa_treat <- update(santa_int, .~. + treat)

The second line now creates [santa_treat]{.alt} by updating [santa_int]{.alt} to include treat as a predictor. We can subsequently update this model to include quantity as a predictor and then update that model to include the interaction:

santa_int <- glm(delivered ~ 1, data = santa_tib, family = binomial())
santa_treat <- update(santa_int, .~. + treat)
santa_quant <- update(santa_treat, .~. + quantity)
santa_full <- update(santa_quant, .~. + treat:quantity)

Note that each time we change the name of the model being updated to the model at the previous step, and also note that the interaction is specified using [treat:quantity]{.alt} (see the tip in the next section)

We can now compare these models as before using:

anova(santa_int, santa_treat, santa_quant, santa_full, test = "Chisq")

r alien() Alien coding challenge

Fit the models described above and compare them using anova(). Round the output to 3 decimal places.


santa_int <- glm(delivered ~ 1, data = santa_tib, family = binomial())
santa_treat <- update(santa_int, .~. + treat)
santa_quant <- update(santa_treat, .~. + quantity)
santa_full <- update(santa_quant, .~. + treat:quantity)
anova(santa_int, santa_treat, santa_quant, santa_full, test = "Chisq")  |> 
  knitr::kable(digits = 3)

Note that for each model we add one new predictor so the degrees of freedom decrease by 1 at each step. Also note that each model significantly improves the fit compared to the previous model (i.e., each model significantly decreases the residual deviance) suggesting that each predictor is significant.

As I mentioned, this step-by-step process is useful if (1) you want to do hierarchical variable entry (that is, build up the model in a theory-driven way), or (2) you have a predictor with multiple categories and you want to know its overall effect.

r user_visor() Fitting the model in one step [(2)]{.alt}

We can extend the model using glm() but with an updated formula.

`r cat_space()` **Tip** Remember from `discovr_10` that `*` is used to specify all main effects and interactions whereas `:` is used to specify only an interaction. Therefore we can update the formula for the model that predicts Christmas deliveries by either specifying the effects of **treat**, *quantity* and the **treat × quantity** interaction implicitly with wzxhzdk:49 or by specifying the three effects explicitly with wzxhzdk:50

r alien() Alien coding challenge

Fit a model predicting present deliveries from the type of treat, the quantity of treats consumed and their interaction. Call the model [santa_full_glm]{.alt} View the raw model parameters (i.e. do not exponentiate them).


# fit the model
santa_full_glm <- glm(xxxxxx ~ xxxxxx*xxxxxx, data = xxxxx, family = xxxxxx)
# fit the model
santa_full_glm <- glm(delivered ~ treat*quantity, data = santa_tib, family = binomial())

# view the model parameters
broom::tidy(xxxxxxx, xxxx.xxxx = xxxxx)
santa_full_glm <- glm(delivered ~ treat*quantity, data = santa_tib, family = binomial())
broom::tidy(santa_full_glm, conf.int = TRUE) |> 
  knitr::kable(digits = 2)
santa_full_glm <- glm(delivered ~ treat*quantity, data = santa_tib, family = binomial())
santa_full_tbl <- broom::tidy(santa_full_glm, conf.int = TRUE)
santa_full_exp <- broom::tidy(santa_full_glm, conf.int = TRUE, exponentiate = T)

From the resulting table we can see that $\hat{b}_0$ is r get_par(santa_full_tbl, row = 1), which means that when all predictors are zero (i.e. the situation where zero puddings were consumed, and the interaction term is zero too) the log odds of delivery were r get_par(santa_full_tbl, row = 1). The $\hat{b}$s for the main effects of treat and quantity are both quite close to zero indicating very small effects. However, the interaction effect is significant (p < 0.001), and the $\hat{b}$-value suggests that as combined effect of treat and quantity increases by one unit, the log odds of delivery change by r get_par(santa_full_tbl, row = 4) units. Whatever that means. To make (more) sense of these effects lets convert the parameters back to odds.

r alien() Alien coding challenge

View the exponentiated model parameters.

santa_full_glm <- glm(delivered ~ treat*quantity, data = santa_tib, family = binomial())

broom::tidy(santa_full_glm, conf.int = TRUE, exponentiate = TRUE) |> 
  knitr::kable(digits = 2)

From the resulting table we can see that $\hat{b}_0$ is r get_par(santa_full_exp, row = 1), which means that when all predictors are zero (i.e. the situation where zero puddings were consumed, and the interaction term is zero too) the odds of delivery were r get_par(santa_full_exp, row = 1). That is r get_par(santa_full_exp, row = 1) times more presents were delivered than not.

For the effects of treat and quantity the exponents of $\hat{b}$ (the odds ratios) are close to 1: for treat the odds ratio is r get_par(santa_full_exp, row = 2), and for quantity it is r get_par(santa_full_exp, row = 3). Remember that an odds ratio of 1 represents 'no effect', so these effect sizes suggest that when considered alone, the type of treat and the quantity consumed did not have an large impact on the deliveries. This conclusion is backed up by the non-significant p-values of r get_par(santa_full_exp, row = 2, col = "p.value") and r get_par(santa_full_exp, row = 3, col = "p.value") respectively.

The odds ratio for the interaction effect is much less than 1 ($\widehat{OR}$ = r get_par(santa_full_exp, row = 4)) and the confidence interval for it ranges from r get_par(santa_full_exp, row = 4, col = "conf.low") and r get_par(santa_full_exp, row = 4, col = "conf.high"). It is important that this interval doesn't contain 1 because it suggests that (assuming this sample is one of the 95% for which the confidence interval contains the population value) the population value is not 1. In other words the true effect relates to a decrease in the odds of delivery as the interaction increases. To unpick what this value of r get_par(santa_full_exp, row = 4) means, we'll do two things:

  1. Plot the interaction
  2. Fit models that predict deliveries from quantity alone separately for mulled wine and Christmas pudding (simple slopes).

r user_visor() Plotting the interaction [(2)]{.alt}

A useful way to visualise the interaction is to plot it. The interact_plot() function from interactions will plot interaction effects from logistic models. It takes the general form:

interactions::interact_plot(my_model, pred = my_predictor, modx = my_moderator)

Basically you replace [my_model]{.alt} with the name of your model, [my_predictor]{.alt} with the name of the predictor variable (in this case quantity makes sense) and [my_moderator]{.alt} with the name of the moderator variable (in this case treat makes sense). We looked at moderation in discovr_10, but in this context what matters is that the predictor is plotted on the x-axis whereas the moderator is used to split the data. If we specify quantity as the predictor and treat as the moderator we'll get the effect of quantity on delivery and different lines for Christmas pudding and mulled wine.

The output is a ggplot2 object, which means that you can augment it with any functions and themes from ggplot2 by adding them as layers to the plot in the usual way.

r robot() Code example

To plot the treat × quantity interaction effect use this code:

interactions::interact_plot(santa_full_glm, pred = quantity, modx = treat)

r alien() Alien coding challenge

Plot the treat × quantity interaction effect. Use what you know about ggplot2 to add labels and a minimal theme.


#Basic plot
interactions::interact_plot(xxxxx, pred = xxxx, modx = xxxxx)
#Basic plot
interactions::interact_plot(santa_full_glm, pred = quantity, modx = treat)
# Now use 'labs()` to add x and y label axes
# Now use 'labs()` to add x and y label axes
interactions::interact_plot(santa_full_glm, pred = quantity, modx = treat) + labs(x = "Quantity of treats consumed", y = "Probability of delivery", fill = "Treat")
# now add a minimal theme
interactions::interact_plot(santa_full_glm, pred = quantity, modx = treat) +
  labs(x = "Quantity of treats consumed", y = "Probability of delivery", fill = "Treat") +
  theme_minimal()

The plot very clearly shows that for Christmas pudding the probability of delivery doesn't change much as the quantity of treats increases (the blue line is flat). However, for mulled wine, as the quantity consumed goes up the probability of delivery decreases: basically, the more wine the elves drink the lower the probability of delivery. This effect really starts to kick in at a quantity of two.

r user_visor() Breaking down the interaction [(2)]{.alt}

Another approach is to look at what the odds ratio of r get_par(santa_full_exp, row = 4) for the interaction means, by fitting separate models for mulled wine and Christmas pudding that predict deliveries from quantity alone. There's various ways we could do this. We could, for example, use filter() to select only cases of mulled wine before piping into the glm() function. However, the glm() function contains an argument [subset]{.alt} that allows you to subset the data directly from within the function. In general

glm(outcome ~ predictor(s),
    data = my_tib,
    subset = filtering_instructions,
    family = binomial(),
    na.action = na.fail)

Notice that I have included the subset argument (which I omitted before to keep things simple). We need to replace [filtering_instructions]{.alt} with instructions like we would use in filter(). So, if we wanted to fit a model only to elves who ate Christmas pudding we would replace [filtering_instructions]{.alt} with [treat == "Pudding"]{.alt} giving us:

r robot() Code example

pudding_glm <- glm(delivered ~ quantity,
                   data = santa_tib,
                   subset = treat == "Pudding",
                   family = binomial())

Note that the model formula has changed to [delivered ~ quantity]{.alt} because we want to predict deliveries by quantity alone, and we have included [subset = treat == "Pudding"]{.alt} so that this model is applied only to cases where the variable treat has the value ["Pudding"]{.alt}.

r robot() Code example

Similarly, we can fit a model to the elves who had mulled wine by setting the [subset]{.alt} argument to [subset = treat == "Mulled wine"]{.alt}.

wine_glm <- glm(delivered ~ quantity,
                data = santa_tib,
                subset = treat == "Mulled wine",
                family = binomial())

r alien() Alien coding challenge

Create the models [pudding_glm]{.alt} using the code above. View the exponentiated parameter estimates of both models.


# Specify the pudding model
pudding_glm <- glm(xxxx ~ xxxxx, data = santa_tib, subset = xxxxx == "xxxxxx", family = binomial())
# Specify the pudding model
pudding_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Pudding", family = binomial())

# View the model
broom::tidy(xxxx, conf.int = TRUE, exponentiate = XXXX)
pudding_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Pudding", family = binomial())

broom::tidy(pudding_glm, conf.int = TRUE, exponentiate = TRUE) |> 
  knitr::kable(digits = 2)
pudding_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Pudding", family = binomial())

wine_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Mulled wine", family = binomial())
pud_tbl <- broom::tidy(pudding_glm, conf.int = TRUE)
wine_tbl <- broom::tidy(wine_glm, conf.int = TRUE)
pud_exp <- broom::tidy(pudding_glm, conf.int = TRUE, exponentiate = TRUE)
wine_exp <- broom::tidy(wine_glm, conf.int = TRUE, exponentiate = TRUE)
quiz(caption = "Odds ratio check (level 2)",
  question(sprintf("How would you interpret the odds ratio for the effect of **quantity**?"),
    answer("As the quantity increases by 1 standard deviation, deliveries increased by 1 standard deviation"),
    answer("As the quantity increased by 1 unit, deliveries increased by 1 unit"),
    answer("As the quantity increased by 1 unit, the odds of delivery change by 0.92", correct = TRUE),
    answer("As the quantity increase by 1 unit, the log odds of delivery change by 0.92", message = "This is the interpretation of the raw parameter estimate not the exponentiated one"),
    correct = "Correct - well done!",
    random_answer_order = TRUE,
    allow_retry = T
  )
)

The output shows the effect of the quantity of Christmas pudding consumed on delivery of presents. The the odds ratio is close to 1 ($\widehat{OR}$ = r get_par(pud_exp, row = 2)), and the p = r get_par(pud_exp, row = 2, col = "p.value"), which all suggest that quantity has no real effect when the treat consumed is pudding. This mirrors what was shown by the blue line in the plot: the line is fairly flat suggesting that quantity of pudding consumed doesn't affect the probability of delivery.

r alien() Alien coding challenge

Create the models [wine_glm]{.alt} using the code above. View the exponentiated parameter estimates of both models.


# Specify the wine model
wine_glm <- glm(xxxxx ~ xxxxx, data = santa_tib, subset = xxxxx == "xxxxx xxxx", family = binomial())
# Specify the wine model
wine_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Mulled wine", family = binomial())
# View the model
broom::tidy(xxxx, conf.int = TRUE, exponentiate = XXXX)
wine_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Mulled wine", family = binomial())

broom::tidy(wine_glm, conf.int = TRUE, exponentiate = TRUE) |> 
  knitr::kable(digits = 2)

The output shows the effect of the quantity of mulled wine consumed on delivery of presents. The odds ratio is closer to 0 than 1 ($\widehat{OR}$ = r get_par(wine_exp, row = 2)), and the p < 0.001, which suggests that quantity has a negative effect on deliveries: as quantity of mulled wine increases the odds of delivery decrease. This mirrors what was shown by the orange line on the plot we made earlier: as quantity increases the probability of delivery rapidly declines.

The interaction effect, therefore, reflects the fact that the effect of quantity on delivery is significantly different for Christmas pudding and mulled wine. As a rough approximation it means that in the plot, the blue and orange lines have different slopes.

`r pencil()` **Report**`r rproj()` A logistic regression model was fit predicting delivery of presents from the type of treat offered, the quantity of treats and their interaction. Table 2 shows the parameter estimates, their 95% confidence intervals and significance tests. The main effects of Treat and Quantoity were not significant but the interaction was, suggesting that the effect of quantity was moderated by the type of treat (and vice versa). To tease apart the significant interaction, a model predicting delivery from the quantity of treats was fitted separately for Christmas pudding and mulled wine. When the treat was pudding, the quantity of treats had no significant effect on delivery, $\widehat{OR}$ = `r report_pars(pud_exp, row = 2, glm = T)`. However, the treat was mulled wine, the quantity of treats significantly affected on delivery, $\widehat{OR}$ = `r report_pars(wine_exp, row = 2, glm = T)`. Specifically, for each additional treat consumed, the odds of delivery were `r get_par(wine_exp, row = 2)` times the odds of delivery had that additional treat not been consumed. wzxhzdk:79

r alien() Alien coding challenge

As an optional exercise that might help you understand the interaction, use the code box to view the raw parameter estimates for [pudding_glm]{.alt} and [wine_glm]{.alt} to three decimal places. (Hint: use tidy() but setting [exponentiate = F]{.alt}.)

wine_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Mulled wine", family = binomial())
pudding_glm <- glm(delivered ~ quantity, data = santa_tib, subset = treat == "Pudding", family = binomial())

broom::tidy(pudding_glm, conf.int = TRUE) |> 
  knitr::kable(digits = 2)
broom::tidy(wine_glm, conf.int = TRUE) |> 
  knitr::kable(digits = 2)

The interaction effect reflects the fact that the effect of quantity on delivery is significantly different for Christmas pudding and mulled wine. From the outputs, the log odds of delivery for Christmas pudding are r get_par(pud_tbl, row = 2) and for mulled wine are r get_par(wine_tbl, row = 2). The significant interaction means that r get_par(pud_tbl, row = 2) is significantly different from r get_par(wine_tbl, row = 2).

Interestingly, if we calculate the difference between the $\hat{b}$s we get: $−1.11−(−0.08) = −1.03$, which was the parameter estimate for the interaction effect. And if we take the exponent of this value we get the odds ratio for the interaction effect: $e^{-1.03}$ = r sprintf("%.2f", exp(-1.03)).

So, the odds ratio for the interaction is the odds ratio for the difference in the effect of one predictor across levels of the other. In this case, it's the odds ratio for the difference between the effect of quantity of Christmas pudding on delivery and the effect of the quantity of mulled wine on delivery.

r user_visor() Robust logistic regression [(2)]{.alt}

It is also possible to create a robust model using the glmrob() function from the robustbase package. This takes the same form as the glm() function but returns robust estimates of parameters and associated standard errors and p-values.

r robot() Code example

To create a robust variant of the model we would execute:

santa_rob <- robustbase::glmrob(delivered ~ treat*quantity,
                                data = santa_tib,
                                family = binomial())

broom::tidy(santa_rob)

which creates an object [santa_rob]{.alt} that contains robust model parameters. We can again summarise it with broom::tidy() but we cannot use [exponentiate = TRUE]{.alt} with robust models.

r alien() Alien coding challenge

Creates a robust model of present deliveries [santa_rob]{.alt} using the code above and summarise it with broom::tidy().


santa_rob <- robustbase::glmrob(delivered ~ treat*quantity, data = santa_tib, family = binomial())
broom::tidy(santa_rob, conf.int = TRUE) |> 
  knitr::kable(digits = 2)
santa_rob_tbl <- robustbase::glmrob(delivered ~ treat*quantity, data = santa_tib, family = binomial()) |> 
  broom::tidy() |> 
  dplyr::mutate(
    OR = exp(estimate)
  )

Compare the resulting model to the one you created earlier. Basically, the parameter estimates ($\hat{b}$s) don't change much and the conclusions of the model stay largely the same. For example, the parameter estimate for the interaction term was $\hat{b}$ = r get_par(santa_full_tbl, row = 4) for the non-robust model but $\hat{b}$ = r get_par(santa_rob_tbl, row = 4) for the robust model.

As I mentioned, [exponentiate = TRUE]{.alt} doesn't work in tidy() with robust models, but we can manually convert the robust $\hat{b}$ to an odds ratio by remembering that the output of tidy() is tibble. Therefore, we can use mutate to add a column that is the exponent (exp()) of the parameter estimate:

broom::tidy(santa_rob, conf.int = TRUE) |> 
  dplyr::mutate(
    OR = exp(estimate)
  )

This code pipes the output of tidy() into the mutate function where we create a variable called OR which is the exponent of the column called estimate.

r alien() Alien coding challenge

Add the odds ratio to the output of your robust model.

santa_rob <- robustbase::glmrob(delivered ~ treat*quantity, data = santa_tib, family = binomial())      

broom::tidy(santa_rob, conf.int = TRUE) |> 
  dplyr::mutate(
    OR = exp(estimate)
  ) |> 
  knitr::kable(digits = 2)

Scroll to the far right of the table to view the new column in the table that you created (labelled OR). As you'd expect based on the raw parameter estimate, the robust odds ratio of $\widehat{OR}$ = r get_par(santa_rob_tbl, row = 4, col = "OR") is virtually identical to the one from the non-robust model, which was r get_par(santa_full_exp, row = 4). The robust model gives us confidence that we can trust the original model. If the parameters were very different we'd report the robust model.

discovr package hex sticker, female space pirate with gun. Gunsmoke forms the letter R. **A message from Mae Jemstone:** Mae stared at the box. She hadn't slept - the excitement had been too much. As everyone knew knew, December 24th was the one night of the year that the Sanza from the Solstice Nebula left their part of the galaxy to visit the rest of the universe. Mae loved December 25th and the anticipation that the Sanza had visited. As everyone knew, if you had tried to be the best version of 'you', the Sanza left you a gift. You didn't have to have behaved well, or acted perfectly, or made no mistakes. But you did have to try your best. You did have to try to choose love and empathy even when it was hardest to do. Mae always tried to choose love and like every other year of her life, she woke to find a small, plain cardboard cube, stamped with the characteristic *S* of the Sanza. The gift of the Sanza was to truly feel, for one short day, your place in the universe and the ripples your effort creates. By opening the box, every good deed, every positive thought, every attempt to be kind, was reflected back at you. She opened the box and was overwhelmed. Despite everything, it had been a good year.

Resources {data-progressive=FALSE}

Statistics

r rproj()

Acknowledgement

I'm extremely grateful to Allison Horst for her very informative blog post on styling learnr tutorials with CSS and also for sending me a CSS template file and allowing me to adapt it. Without Allison, these tutorials would look a lot worse (but she can't be blamed for my colour scheme).

References



profandyfield/discovr documentation built on May 4, 2024, 4:32 p.m.