library(learnr) library(tidyverse) library(forcats) library(robust) library(effects) knitr::opts_chunk$set(echo = FALSE, warning = FALSE) tutorial_options(exercise.cap = "Exercise") hint_text <- function(text, text_color = "#E69F00"){ hint <- paste("<font color='", text_color, "'>", text, "</font>", sep = "") return(hint) } #Read dat files needed for the tutorial recovery_tib <- adventr::recovery_dat #setup objects for code blocks
recovery_tib <- recovery_tib %>% mutate( id = forcats::as_factor(id), treatment = forcats::as_factor(treatment) %>% forcats::fct_relevel("No toggle switch"), recovered = forcats::as_factor(recovered) %>% forcats::fct_relevel("Not recovered") ) recovery_freq <- recovery_tib %>% dplyr::group_by(treatment, dose, recovered) %>% dplyr::summarize(n = n()) %>% dplyr::group_by(treatment, dose) %>% mutate( rel_freq = n / sum(n), perc = rel_freq*100 ) recovery_mod<- glm(recovered ~ treatment*dose, data = recovery_tib, family = binomial())
This tutorial is one of a series that accompanies An Adventure in Statistics [@RN10163] by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, ^[Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.]
Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don't have the book I begin each tutorial with a précis of the story. If you're not interested then fair enough - click past this section.
It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.
Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.
But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.
Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions
Everyone has a star, a limitless space on which to store their digital world.
Zach and Alice are Clocktarians. Their technology consists mainly of:
Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.
On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.
Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.
Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.
Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.
This tutorial uses the following packages:
effects
[@RN11414] to get plots for logistic regression modelsforcats
[@RN11418] for processing categorical variablesHmisc
[@RN11417] to compute confidence intervalstidyverse
[@RN11407] for general data processingThese packages are automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing install.packages("package_name")
, where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing library(package_name)
, where package_name is the name of the package.
This tutorial has the data files pre-loaded so you shouldn't need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:
You can load the file in several way:
recovery_tib
by executing:recovery_tib <- readr::read_csv("../data/ais_ch03_ha.csv")
recovery_tib <- readr::read_csv("http://www.discoveringstatistics.com/repository/ais_data/ais_c03_ha.csv")
At the end of the story it is revealed that Alice used her C-gene therapy to restore the code 1318 workers to a human state. This dataset relates to her final attempt. It contains data from 400 code 1318 workers. Workers were randomly assigned to two arms of a trial (C-gene therapy and C-gene therapy with a toggle switch). They were also given one of five different doses of C-gene: 0 (no dose) or 1, 2, 3, or 4 doses. The outcome was whether (or not) the worker had fully recovered after the treatment. The data are in the tibble recovery_tib
, which has five variables:
The model we're fitting is described by the following equation:
$$ \begin{aligned} p(Y_i) &= \frac{1}{1 + e^{-(b_0 + b_1X_{1i} + b_2X_{2i} + b_3X_{1i}X_{2i})}} \ p(\text{recovered}_i) &= \frac{1}{1 + e^{-(b_0 + b_1\text{treatment}_i + b_2\text{dose}_i + b_3\text{treatment}\times\text{dose}_i)}} \ \end{aligned} $$
The data are already in tidy format, so we don't need to restructure them. Use the code box to inspect the recovery_tib
tibble.
recovery_tib
However, the variables id, treatment, and recovered currently have the data type of character. This is fine (because R is fairly intelligent about how it treats character variables). However, it does obscure how dummy coding will be applied. It's generally good practice to be explicit about which category you want to act as a baseline (i.e. be coded as zero). One way to achieve this is to convert these variables to factors and relevel them such that the baseline category is first. In previous tutorials we have done this using the factor()
function, but for variety, we can also use the as_factor()
function from the forcats
package (which has a more tidyverse flavour). We also want to specify the baseline category for each factor. We can combine the tasks using a pipe:
library(forcats); library(tidyverse) #Not nessesary inside this tutorial recovery_tib <- recovery_tib %>% dplyr::mutate( id = forcats::as_factor(id), treatment = forcats::as_factor(treatment) %>% forcats::fct_relevel("No toggle switch"), recovered = forcats::as_factor(recovered) %>% forcats::fct_relevel("Not recovered") )
Outside of the tutorial we'd load the libraries forcats
and tidyverse
. The first line creates recovery_tib
from itself. We then use the mutate()
function to over-write each categorical variable with a version of itself that is converted to a factor. Within mutate()
, the first line re-creates id from itself but as a factor. The remaining lines do the same for treatment and recovered but also use fct_relevel()
to specify the category to be used as the baseline (first) category. In the case of treatment we're asking to use the No toggle switch condition as the first level. Finally, we do much the same for recovered specifying "Not recovered" as the first category. (Be careful to check the spelling and use of upper case letters when you specify the first category - it needs to exactly match what is in the variable.)
Try the above code in the box and then use the levels()
function (e.g., levels(recovery_tib$recovered)
) on treatment and recovered to see how R has assigned 'levels' of the factors.
recovery_tib <- recovery_tib %>% dplyr::mutate( id = forcats::as_factor(id), treatment = forcats::as_factor(treatment) %>% forcats::fct_relevel("No toggle switch"), recovered = forcats::as_factor(recovered) %>% forcats::fct_relevel("Not recovered") ) levels(recovery_tib$treatment) levels(recovery_tib$recovered)
Note that levels have been assigned as requested: "No toggle switch" is the first category for treatment and "Not recovered" is the base category for recovered.
Our outcome variable is categorical so there is little sense in looking at means. Instead we want to look at frequencies, or counts (that is, how many cases fall into different categories). We can do this using the xtabs()
function built into R. It has the following form:
xtabs(~ row_variable + column_variable + table_variable, data = tibble)
Essentially we list any variables that we want to decompose the data by, and specify the tibble containing the data. The order in which we list variables determines whether they form rows, columns or different tables. For example, to look at the table of frequencies for the variables of recovered and treatment we'd execute:
recovery_xtab <- xtabs(~ treatment + recovered, data = recovery_tib)
The result would be a 2 × 2 contingency tables with the rows representing different levels of treatment and the columns the different levels of recovered. We could create 5 different versions of this 2 × 2 table, one for each level of dose, by including it as a third variable in the formula:
recovery_full_xtab <- xtabs(~ treatment + recovered + dose, data = recovery_tib)
Try both of these commands in the code box, and inspect the results. You could try changing the order of variables to see how this affects which variables are presented in the rows and columns of the tables.
recovery_xtab <- xtabs(~ treatment + recovered, data = recovery_tib) recovery_full_xtab <- xtabs(~ treatment + recovered + dose, data = recovery_tib) recovery_xtab recovery_full_xtab
A different approach is to use the dplyr
functions from tidyverse
and a pipe. This is good practice for seeing the power of the pipe, and allows us to compute relative frequencies (which are useful to plot). However, it is quite complicated so feel free to skip ahead. I'll show you the pipe and then explain it line by line:
recovery_freq <- recovery_tib %>% dplyr::group_by(treatment, dose, recovered) %>% dplyr::summarize(n = n()) %>% dplyr::group_by(treatment, dose) %>% dplyr::mutate( rel_freq = n / sum(n), perc = rel_freq*100 )
Let's look at each line:
recovery_freq
. We begin by passing the recovery_tib
tibble into the pipe.summarize()
function to create a variable called n using the function n()
. Because in the previous line we grouped the data by all combinations of variables, line 3, in effect counts the number of observations in each of the 20 cells. In effect, up to this point we have calculated the cell frequencies for all combinations of treatment, dose and recovered. It can be instructive to execute the pipe only up to this point to see what has been done so far.sum(n)
. What is sum(n)
? It is all the values of n added up. However, because in the previous line we grouped by treatment and dose, the sum will be computed separately for each combination of treatment and dose. In other words, sum(n)
adds together the value of n for the recovered cells and the value of n for the Not recovered cell. The new variable freq is, therefore, expressing each value of n as the relative frequency of recovered to not recovered cases. When you inspect the final product, you'll see, for example, when the dose was 0 and there was no toggle switch, there were 27 not recovered cases and 3 recovered. The sum of these (sum(n)
) will be 27 + 3 = 30. The resulting frequencies are for recovered cases 27/30 = 0.9, or 90%, and 3/30 = 0.1, or 10%, for recovered cases. In other words, for those with a 0 dose of C-gene and no toggle switch 10% recovered but 90% did not.mutate()
function.As I said, it can be useful to execute parts of the pipe to see how it builds up, or even to break it into separate smaller pipes to help you understand what it is doing. Use the code box to execute the pipe and inspect the resulting tibble.
r hint_text("Tip: If you're doing this outside of the tutorial remember to load the package tidyverse (or dplyr)")
recovery_freq <- recovery_tib %>% dplyr::group_by(treatment, dose, recovered) %>% dplyr::summarize(n = n()) %>% dplyr::group_by(treatment, dose) %>% dplyr::mutate( rel_freq = n / sum(n), perc = rel_freq*100 ) recovery_freq
You should see a tibble with 20 rows representing the different combinations of treatment, dose and recovered, a variable (n) containing their cell counts, a variable called rel_freq expressing the relative frequency of recovered to non-recovered cases for each combination of treatment and dose, and a variable perc that expressed that relative frequency as a percentage.
The first plot we'll do uses the original data and plots the number recovered over the top of the number not recovered. We'll plot the dose along the x-axis and create different panels for the two treatment conditions. This is the code:
recovery_plot <- ggplot2::ggplot(recovery_tib, aes(dose, fill = recovered, colour = recovered)) recovery_plot + geom_histogram(binwidth = 1, position = "identity", alpha = 0.6) + scale_colour_manual(values = c("#DF4738", "#009E73")) + facet_wrap(~ treatment) + labs(x = "Dose of C-gene", y = "Frequency", fill = "Recovery status", colour = "Recovery status") + theme_bw()
Let's explore this command line by line:
recovery_plot
using the ggplot()
function. Within this function we specify the data as the tibble called recovery_tib
, then within aes()
we set dose along the x-axis, and then set the fill and colour to both vary by the variable recovered. This will ensure we get different coloured/filled bars for recovered and non-recovered cases.recovery_plot
object.facet_wrap()
to create separate plots for the toggle switch and no toggle switch conditions.fill = "Recovery status", colour = "Recovery status"
and making their labels identical we get a single legend representing both the fill and colour.Use the code box to draw this plot. Try deleting lines and observing the effect it has on the plot.
r hint_text("Tip: If you're doing this outside of the tutorial remember to load the tidyverse (or ggplot2) package")
recovery_plot <- ggplot2::ggplot(recovery_tib, aes(dose, fill = recovered, colour = recovered)) recovery_plot + geom_histogram(binwidth = 1, position = "identity", alpha = 0.6) + scale_colour_manual(values = c("#DF4738", "#009E73")) + facet_wrap(~ treatment) + labs(x = "Dose of C-gene", y = "Frequency", fill = "Recovery status", colour = "Recovery status") + theme_bw()
An alternative plot that I used in the lecture was one that plotted the percentage recovered by dose and treatment. To do this plot we need to use the recovery_freq
object we created earlier which has the percentages stored. However, we want to ignore all of the non-recovered cases (because we're just plotted the percentage recovered). First, then, we might create a new tibble recover_percent
that is based on the tibble we created earlier that contains the relative frequencies and percentages (recovery_freq
) but filters out the not recovered cases by including only cases where the variable recovered has the value "Recovered":
recover_percent <- recovery_freq %>% filter(recovered == "Recovered")
We plot data from this tibble:
percent_plot <- ggplot2::ggplot(recover_percent, aes(dose, perc, colour = treatment)) percent_plot + geom_line(size = 1) + geom_point(size = 3, alpha = 0.7) + scale_colour_manual(values = c("#DF4738", "#009E73")) + coord_cartesian(ylim = c(0, 100)) + scale_y_continuous(breaks = seq(0, 100, 10)) + labs(x = "Dose of C-gene", y = "Percent recovered", colour = "Treatment group") + theme_bw()
Let's explore this command line by line:
percent_plot
using the ggplot()
function. Within this function we specify the data as the tibble called recover_percent
, then within aes()
we set dose along the x-axis, and then set the colour to vary by the variable treatment. This will ensure we get different coloured lines/points for the toggle switch and no toggle switch groups.percent_plot
object.coord_cartesian()
to set the limits of the y-axis to 0 and 100 (because we're plotting percentages).scale_y_continuous()
to set the breaks on the y-axis to be a sequence from 0 to 100 in steps of 10 (i.e. 0, 10, 20, 30 40 ...).fill = "Recovery status", colour = "Recovery status"
and making their labels identical we get a single legend representing both the fill and colour.Use the code box to draw this plot. Try deleting lines and observing the effect it has on the plot.
r hint_text("Tip: Remember that you need to create the recover_percent tibble first")
recover_percent <- recovery_freq %>% dplyr::filter(recovered == "Recovered") percent_plot <- ggplot2::ggplot(recover_percent, aes(dose, perc, colour = treatment)) percent_plot + geom_line(size = 1) + geom_point(size = 3, alpha = 0.7) + scale_colour_manual(values = c("#DF4738", "#009E73")) + coord_cartesian(ylim = c(0, 100)) + scale_y_continuous(breaks = seq(0, 100, 10)) + labs(x = "Dose of C-gene", y = "Percent recovered", colour = "Treatment group") + theme_bw()
glm()
functionTo fit the model we use the function glm()
which stands for 'generalized linear model'. The function takes the same form as lm()
, which we have used many times before, except that it also allows you to specify a specific distribution for the errors in the model. When the outcome is a dichotomy (categorical with two categories) we need to specify the family as binomial()
giving us a general form of the function as follows:
glm(outcome ~ predictor(s), data, subset, family = binomial(), na.action = na.fail)
Let's remind ourselves of these arguments:
outcome ~ predictor(s)
: the way we specify the model is basically to write out the model equation but ignoring the b values and using a tilde (~) instead of =. For example, if we want the model to predict recovered from the variable treatment we write the model as recovered ~ treatment
. If, we want to add dose into the mix as a second predictor then we could fit this model by specifying recovered ~ treatment + dose
. If we want to include the interaction term as well, we could specify recovered ~ treatment + dose + treatment:dose
or use the equivalent expression of recovered ~ treatment*dose
. The model specification maps onto the equation describing the model we're fitting.data
: you use this argument to specify the tibble containing the data (e.g., data = recovery_tib
). If you're using glm()
within a pipe where you've previously input a tibble then remember you use data = .
subset
: this argument allows you to specify a condition by which you subset data (a bit like you might use in the filter()
function). For example, we will fit separate models for the toggle switch and no toggle switch groups using subset = treatment == "Toggle switch"
. Of course, you can also achieve the same thing using a pipe (recovery_tib %>% treatment == "Toggle switch" %>% glm(recovered ~ treatment*dose, data = ., family = binomial())
)na.action
: this option determines how to treat missing values. By default it is set to na.fail
which means that the model fails (it is not fit). If you have missing values then set this option to na.omit
which removes cases, or na.exclude
which excluded cases from the model (but does not remove them). See adventr_14 for more detail.family = guassian()
determines the distribution of errors. By default the family is set to guassian()
(a posh word for 'normal') and the function will fit a standard linear model. However, we have changed it to binomial()
because we want a logistic linear model.We could fit our model by executing:
recovery_mod<- glm(recovered ~ treatment*dose, data = recovery_tib, family = binomial())
and summarize it using the summary()
function to get the model parameters and anova()
to get deviance statistics for each step of the model. Try these commands in the code box below.
recovery_mod<- glm(recovered ~ treatment*dose, data = recovery_tib, family = binomial()) summary(recovery_mod) anova(recovery_mod)
We can see that the main effect of treatment, b = -0.20, SE = 0.52, z = -0.38, p = 0.701, was not significant, nor was the main effect of dose, b = 0.08, SE = 0.17, z = 0.48, p = 0.629. However, the treatment × dose interaction was significant, b = 1.03, SE = 0.23, z = 4.45, p < 0.001. The raw b-values are difficult to interpret because they tell us about the change in the log odds of the outcome associated with a unit change in the predictor. People's brains just don't work in terms of log odds. Odds are hard enough. Log odds, forget it. Instead, it is better to convert these model parameters to odds ratios, which we'll do in a minute. But first ...
The deviance statistics (the -2LL) can be used to tell us whether adding predictors to the model improves the fit. If the residual deviance is decreasing then the fit is improving. The column in the output labelled Deviance shows the change in the residual deviance at each step. For example, the residual deviance for the null model is 529.25, when treatment is added as a predictor this decreases to 460.49, which is a change of 529.25 - 460.49 = 68.76. The corresponding change in the residual degrees of freedom is 399 - 398 = 1, which is the value in the column Df. When dose is added the residual deviance decreases to 410.70, which is a change of 460.49 - 410.7 = 49.79 with a corresponding degrees of freedom of 1 (398 - 397). And so on for the interaction term.
We can test whether these changes are significant using the pchisq(x, df, lower.tail = TRUE)
function, which returns the value of x from the probability density function of the chi-square distribution with degrees of freedom equal to df. In other words, you enter into the function a chi-square value, and a value for the degrees of freedom, and (by default) it returns the probability of obtaining a value as large as the value entered or smaller (i.e. the lower tail of the distribution, hence lower.tail = TRUE
). Remember that the p-value is the probability of getting a value as big as the one observed or larger so to convert the output of pchisq()
to a p-value we need to include lower.tail = FALSE
to return the upper tail of the distribution.
The question is, what values do we enter into the function? Notice that the object returned by anova(recovery_mod)
is a table with a column called Deviance containing the change in the deviance statistic at each stage of the model, and a column called Df
that contains the corresponding degrees of freedom. We can use these columns as the x and df that we input into pchisq()
. We could do this in a single pipe:
model_dev <- recovery_mod %>% anova() %>% tibble::as_tibble() %>% dplyr::mutate( chi_p = pchisq(Deviance, Df, lower.tail = FALSE) ) model_dev
First we feed the recovery_mod
object into the anova()
function to get the table of deviances, which we then feed into the as_tibble()
function (from the tibble
package within tidyverse
) to convert the table to a tibble. We then use mutate()
from dplyr
to add a column to this tibble. I call this column chi_p
and the values in it are created by entered the value from the Deviance variable (already in the tibble) as the chi-square value, the corresponding value from the Df variable (also already in the tibble) as the df value, and asking for the probability from the upper tail by setting lower.tail = FALSE
. The result will be the same table that the anova()
function returned but with an additional column containing the p-values. I store the resulting tibble in an object called model_dev
. Try the code above in the code box.
model_dev <- recovery_mod %>% anova() %>% tibble::as_tibble() %>% dplyr::mutate( chi_p = pchisq(Deviance, Df, lower.tail = FALSE) ) model_dev
To convert the model parameters to odds ratios we can use the function exp()
, which returns the exponent, $e^x$ (the reverse of the natural log). The resulting value tells us about the change in the odds (rather than the log odds) of the outcome associated with a unit change in the predictor. In other words it returns the odds ratio for the predictor. As with many things in R there are a few ways we could go about this. Before we get into this, here are the functions we will use:
coef()
: returns the b-values for a model (i.e. it returns the values in the column labelled Estimate in the output from summary()
confint()
: returns the confidence intervals for a model (we've used this function before in adventr_14).cbind()
or bind_cols()
: both bind columns togetherPossibly the quickest way to get the odds ratios and their confidence intervals would be to execute something like this (you could do this as a single line of code by nesting the right-hand side of the first line inside exp()
but I'm trying to avoid too many of functions nested within other functions):
b_ci <- cbind(coef(recovery_mod), confint(recovery_mod)) exp(b_ci)
The first line creates an object called b_ci
which used coef()
to extract the b-values from recovery_mod
, and confint()
to get the corresponding confidence intervals, and then binds these columns together (to effectively create a table). The second line returns the exponent of all of the values in b_ci
. In other words it will return the odds ratios and their confidence intervals. Try this out in the code box:
b_ci <- cbind(coef(recovery_mod), confint(recovery_mod)) exp(b_ci)
An alternative might be to do something a little more 'tidyverse' and use a pipe. Some people find this easier to get to grips with than nested functions Again, we could do the below as a single pipe, but I've broken it down to make things simpler:
b_values <- coef(recovery_mod) %>% as_tibble() b_ci <- confint(recovery_mod) %>% as_tibble() b_values %>% bind_cols(b_ci) %>% exp() %>% rename("odds ratio" = value)
The first line creates an object called b_values
by using coef()
to extract the b-values from recovery_mod
and then converting the results to a tibble using as_tibble()
. The second line does the same to create a tibble called b_ci
by using confint()
to extract the 95% confidence intervals for the b-values from recovery_mod
. The final line takes the object b_values
and binds the columns from b_ci
to it, which gives us a table containing the values of the bs and their confidence intervals. These are then passed into exp()
to convert them to odds ratios. The b-values are stored in a variable called value so to make the tibble a little more friendly we use rename()
to rename the variable value as odds ratio. Try this out in the code box:
b_values <- coef(recovery_mod) %>% as_tibble() b_ci <- confint(recovery_mod) %>% as_tibble() b_values %>% bind_cols(b_ci) %>% exp() %>% rename("odds ratio" = value)
If the value of the odds ratio is greater than 1 then it indicates that as the predictor increases, the odds of the outcome occurring increase. Conversely, a value less than 1 indicates that as the predictor increases, the odds of the outcome occurring decrease. The important thing about the confidence intervals is whether they cross 1. This is important because values greater than 1 mean that as the predictor variable increases, so do the odds of (in this case) being cured. Values less than 1 mean the opposite: as the predictor variable increases, the odds of being cured decrease. Assuming our confidence intervals are from the 95% of samples that contain the population value, then if the interval contains 1 it means that the population value could reflect an effect in one direction, but could also reflect an effect in the opposite direction, or even no effect at all (an odds ratio of 1).
The odds ratio for the main effect of treatment is 0.82 [0.30, 2.30], which means that the odds of a patient who has a toggle switch being cured are 0.82 times higher than those of a patient who did not get the toggle switch. This value is close to 1 indicating that the odds of being cured are basically the same in the no toggle switch or toggle switch groups.
quiz( question(sprintf("How would you interpret the odds ratio for the effect of **dose**?"), answer("As the dose increase by 1 standard deviation, recovery increased by 1 standard deviation"), answer("As the dose increase by 1 unit, recovery increased by 1 unit"), answer("As the dose increase by 1 unit, the odds of recovery change by 1.08", correct = TRUE), answer("As the dose increase by 1 unit, the log odds of recovery change by 1.08"), correct = "Correct - well done!", random_answer_order = TRUE, allow_retry = T ) )
For the interaction, which was significant, the odds ratio is 2.79 [1.79, 4.43], which means that the odds ratiob*-value for dose is 2.79 times higher in the intervention group than the control. To see what this value of 2.79 represents, let's use the subset argument to run the same model separately for the no toggle switch or toggle switch groups.
no_toggle_mod <- glm(recovered ~ dose, data = recovery_tib, family = binomial(), subset = treatment == "No toggle switch") toggle_mod <- glm(recovered ~ dose, data = recovery_tib, family = binomial(), subset = treatment == "Toggle switch") no_tog_dose_or <- no_toggle_mod %>% coef() %>% exp() tog_dose_or <- toggle_mod %>% coef() %>% exp() tog_dose_or/no_tog_dose_or
The first line creates an object called no_toggle_mod
which predicts recovery from only dose and uses subset = treatment == "No toggle switch"
to restrict the model to the group that received no toggle switch. Line 2 does the same but creates an object called toggle_mod
to fit the same model to the group that did receive a toggle switch.
The next two lines create objects called no_tog_dose_or
and tog_dose_or
respectively, which take the models created above, extract the b-values using coef()
and then converts them to odds ratios by applying exp()
. So within these objects we have the odds ratios for the effect of dose in the no toggle switch group and the toggle switch group. If you inspect the objects you will find that in the no toggle switch group, the odds ratio for dose is 1.084, and in the toggle switch group it is 3.030. The final line takes the odds ratio for the toggle switch group and divides it by the odds ratio for the no toggle switch group. The result is 3.030/1.084 = 2.79, the odds ratio for the interaction! So, the interaction odds ratio tells us that the odds ratio for dose is 2.79 times larger in the toggle switch group than the no toggle switch group. Try the above code in the code box:
no_toggle_mod <- glm(recovered ~ dose, data = recovery_tib, family = binomial(), subset = treatment == "No toggle switch") toggle_mod <- glm(recovered ~ dose, data = recovery_tib, family = binomial(), subset = treatment == "Toggle switch") no_tog_dose_or <- no_toggle_mod %>% coef() %>% exp() tog_dose_or <- toggle_mod %>% coef() %>% exp() tog_dose_or/no_tog_dose_or
We can plot the interaction effect using the allEffects()
function from the package effects
[@RN11414]. The code box shows a pipe to make the plot. First we put the model we created earlier recovery_mod
into the allEffects()
function to create the data we want to plot, we then push this data into the plot()
function. The resulting plot shows that in the no toggle switch group the effect of dose is basically non-existent (the line is flat) which we know already from the odds ratio of 1 that we computed in the previous section. In the toggle switch group, however, the effect of dose is positive (the line slopes upwards to indicate that higher dose is associated with a greater probability of recovery) which we also knew from the odds ratio of 3 that we computed in the previous section.
library(effects) recovery_mod %>% effects::allEffects() %>% plot()
library(effects) recovery_mod %>% effects::allEffects() %>% plot()
It is also possible to create a robust model using the glmRob()
function from the robust
package. This takes the same form as the glm()
function but returns robust estimates of parameters and associated standard errors and p-values. To create a robust variant of the model we would execute:
recovery_rob <- robust::glmRob(recovered ~ treatment*dose, data = recovery_tib, family = binomial())
which creates an object recovery_rob
that contains robust model parameters. We can summarise it with summary()
. Try this in the code box below and compare the resulting model to the one you created earlier. Basically, the parameter estimates (bs) don't change much and the conclusions of the model stay largely the same. Having generated a robust model you would, of course, compute odds ratios (but not confidence intervals). Try doing this too by adapting the code you used earlier.
recovery_rob<- robust::glmRob(recovered ~ treatment*dose, data = recovery_tib, family = binomial()) summary(recovery_rob) coef(recovery_rob) %>% exp()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.