adventr: growth models

library(forcats)
library(learnr)
library(tidyverse)

library(nlme)

knitr::opts_chunk$set(echo = FALSE, warning = FALSE)
tutorial_options(exercise.cap = "Exercise")

hint_text <- function(text, text_color = "#E69F00"){
  hint <- paste("<font color='", text_color, "'>", text, "</font>", sep = "")
  return(hint)
}

#Read dat files needed for the tutorial

rehab_growth_tib <- adventr::rehab_growth_dat

#setup objects for code blocks
rehab_growth_tib <- rehab_growth_tib %>% 
  mutate(
    id = forcats::as_factor(id),
    intervention = forcats::as_factor(intervention) %>% fct_relevel("Wait list"),
    time = forcats::as_factor(time) %>% fct_relevel("t12", after = Inf)
  )

An Adventure in R: Growth models

Overview

This tutorial is one of a series that accompanies An Adventure in Statistics [@RN10163] by me, Andy Field. These tutorials contain abridged sections from the book so there are some copyright considerations but I offer them under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, ^[Basically you can use this tutorial for teaching and non-profit activities but do not meddle with it or claim it as your own work.]

Story précis

Why a précis?

Because these tutorials accompany my book An adventure in statistics, which uses a fictional narrative to teach the statistics, some of the examples might not make sense unless you know something about the story. For those of you who don't have the book I begin each tutorial with a précis of the story. If you're not interested then fair enough - click past this section.

General context for the story

It is the future. Zach, a rock musician and Alice, a geneticist, who have been together since high school live together in Elpis, the ‘City of Hope’.

Zach and Alice were born in the wake of the Reality Revolution which occurred after a Professor Milton Gray invented the Reality Prism – a transparent pyramid worn on the head – that brought honesty to the world. Propaganda and media spin became unsustainable, religions collapsed, advertising failed. Society could no longer be lied to. Everyone could know the truth about anything that they could look at. A gift, some said, to a previously self-interested, self-obsessed society in which the collective good had been eroded.

But also a curse. For, it soon became apparent that through this Reality Prism, people could no longer kid themselves about their own puffed-up selves as they could see what they were really like – by and large, pretty ordinary. And this caused mass depression. People lost faith in themselves. Artists abandoned their pursuits, believing they were untalented and worthless.

Zach and Alice have never worn a Reality Prism and have no concept of their limitations. They were born after the World Governance Agency (WGA) destroyed all Reality Prisms, along with many other pre-revolution technologies, with the aim of restoring community and well-being. However, this has not been straightforward and in this post-Prism world, society has split into pretty much two factions

Everyone has a star, a limitless space on which to store their digital world.

Zach and Alice are Clocktarians. Their technology consists mainly of:

Main Protagonists

How Zach's adventure begins

Alice has been acting strangely, on edge for weeks, disconnected and uncommunicative, as if she is hiding something and Zach can’t get through to her. Arriving home from band practice, unusually, she already home and listening to an old album that the two of them enjoyed together, back in a simpler, less complicated time in their relationship. During an increasingly testy evening, that involves a discussion with the Head about whether or not a Proteus causes brain cancer, Alice is interrupted by an urgent call which she takes in private. She returns looking worried and is once again, distracted. She tells Zach that she has ‘a big decision to make’. Before going to bed, Zach asks her if he can help with the decision but she says he ‘already has’, thanking him for making ‘everything easier.’ He has no idea what she means and goes to sleep, uneasy.

On waking, Zach senses that something is wrong. And he is right. Alice has disappeared. Her clothes, her possessions and every photo of them together have gone. He can’t get hold of any of her family or friends as their contact information is stored on her Proteus, not on his diePad. He manages to contact the Beimeni Centre but is told that no one by the name of Alice Nightingale has ever worked there. He logs into their constellation but her star has gone. He calls her but finds that her number never existed. She has, thinks Zach, been ‘wiped from the planet.’ He summons The Head but he can’t find her either. He tells Zach that there are three possibilities: Alice has doesn’t want to be found, someone else doesn’t want her to be found or she never existed.

Zach calls his friend Nick, fellow band member and fan of the WGA-installed Repositories, vast underground repositories of actual film, books, art and music. Nick is a Chipper – solely for the purpose of promoting the band using memoryBank – and he puts the word out to their fans about Alice missing.

Thinking as hard as he can, Zach recalls the lyrics of the song she’d been playing the previous evening. Maybe they are significant? It may well be a farewell message and the Head is right. In searching for clues, he comes across a ‘memory stone’ which tells him to read what’s on there. File 1 is a research paper that Zach can’t fathom. It’s written in the ‘language of science’ and the Head offers to help Zach translate it and tells him that it looks like the results of her current work were ‘gonna blow the world’. Zach resolves to do ‘something sensible’ with the report.

Zach doesn’t want to believe that Alice has simply just left him. Rather, that someone has taken her and tried to erase her from the world. He decides to find her therapist, Dr Murali Genari and get Alice’s file. As he breaks into his office, Dr Genari comes up behind him and demands to know what he is doing. He is shaking but not with rage – with fear of Zach. Dr Genari turns out to be friendly and invites Zach to talk to him. Together they explore the possibilities of where Alice might have gone and the likelihood, rating her relationship satisfaction, that she has left him. During their discussion Zach is interrupted by a message on his diePad from someone called Milton. Zach is baffled as to who he is and how he knows that he is currently discussing reverse scoring. Out of the corner of his eye, he spots a ginger cat jumping down from the window ledge outside. The counsellor has to go but suggests that Zach and ‘his new friend Milton’ could try and work things out.

Packages and data

Packages

This tutorial uses the following packages:

These packages are automatically loaded within this tutorial. If you are working outside of this tutorial (i.e. in RStudio) then you need to make sure that the package has been installed by executing install.packages("package_name"), where package_name is the name of the package. If the package is already installed, then you need to reference it in your current session by executing library(package_name), where package_name is the name of the package.

Data

This tutorial has the data files pre-loaded so you shouldn't need to do anything to access the data from within the tutorial. However, if you want to play around with what you have learnt in this tutorial outside of the tutorial environment (i.e. in a stand-alone RStudio session) you will need to download the data files and then read them into your R session. This tutorial uses the following file:

You can load the file in several ways:

The model

At the end of the book it is revealed that Alice used her C-gene therapy to restore the code 1318 workers to a human state. This dataset relates to her second attempt. She obtained data from 141 code 1318 workers measured at four time points (baseline and 1, 6, and 12 month follow-up). Workers were randomly assigned to two arms of the trial (wait list vs. C-gene therapy) and the outcome was how much they resembled their pre-zombie state (as a percentage). Alice predicted that resemblance scores would increase over time in the gene therapy group relative to those on the wait list (because this would indicate that those workers' appearances would more closely resemble their pre-zombification state). The data are in the tibble rehab_growth_tib which has 564 rows (141 participants measured at each of 4 time points) and 5 variables:

The data has a hierarchical structure because resemblance scores are nested within participants (Figure 1). We, therefore, need to model the individual differences in resemblance scores (random intercept) and the variance in the change in resemblance scores over time (random slopes). We also want to include the effect of treatment condition and the interaction it has with the change in resemblance over time.

Figure 1: The structure of thelongitudinal rehabilitation data

The model we're fitting (expressed in composite form) is described by the following equation:

$$ \begin{aligned} \text{resemblance}{ij} = &\big[\gamma{00} + \gamma_{10}\text{time}{ij} + \gamma{01}\text{intervention}i + \gamma{11}(\text{intervention}i \times \text{time}{ij})\big] + \ & \big[\zeta_{0i} +\zeta_{1i}\text{time}{ij} + \epsilon{ij}\big] \end{aligned} $$

Resemblance scores within participants (i) at times (j) are predicted from time, intervention group and their interaction, but we model the following too:

$$ \begin{aligned} \gamma_{00} &= \text{average baseline resemblance when intervention=0 (wait list)} \ \gamma_{10} &= \text{average rate of change in resemblance when intervention=0 (wait list)} \ \gamma_{01} &= \text{baseline difference between wait list and gene therapy} \ \gamma_{11} &= \text{difference in rate of change in resemblance between wait list and gene therapy groups} \ \zeta_{0i} &= \text{deviation of individual's baseline resemblance from group average} \ \zeta_{1i} &= \text{deviation of individual's rate of change in resemblance from group average} \ \epsilon_{ij} &= \text{portion individual’s resemblance score that is unpredicted at time } j \end{aligned} $$

Exploring the data

Preparing the data

The data are already in tidy format, so we don't need to restructure them. Use the code box to inspect the rehab_growth_tib tibble.


rehab_growth_tib   

However, the variables id, intervention, and time currently have the data type of character. This is fine for a lot of what we want to do (because R is fairly intelligent about how it treats character variables). However, for some things (like assigning contrasts) we need these variables to be defined as factors. We can do this by using the mutate() and as_factor() functions, which we have met before. For example, we could execute:

rehab_growth_tib <- rehab_growth_tib %>% 
  dplyr::mutate(
    id = forcats::as_factor(id),
    intervention = forcats::as_factor(intervention),
    time = forcats::as_factor(time)
  )

Try this in the code box and then use the levels() function (e.g., levels(rehab_growth_tib$time)) on the three variables to see how R has assigned 'levels' of the factors.


rehab_growth_tib <- rehab_growth_tib %>% 
  dplyr::mutate(
    id = forcats::as_factor(id),
    intervention = forcats::as_factor(intervention),
    time = forcats::as_factor(time)
  )
levels(rehab_growth_tib$id)
levels(rehab_growth_tib$intervention)
levels(rehab_growth_tib$time)

Note that levels have been assigned as they occur in the data So, for intervention the first level is Wait list and the second is Gene therapy. Similarly, the levels of time have been ordered as t0, t1, t6 and t12. These orders are what we want because we need the times in the correct order and for intervention it makes the interpretation of the model parameter (b) straightforward if the base/first category is the wait list not the gene therapy group. If the levels were ordered differently we could use fct_relevel() from the forcats package to change the order (see adventr_02).

Descriptive statistics

Using what you've learnt in previous tutorials, create a tibble called growth_summary containing the mean resemblance scores (and their confidence intervals) at each time point and within each treatment condition.

r hint_text("Tip: Use the variable called time, for time, and if you're doing this outside of the tutorial remember to load the packages tidyverse and Hmisc")


growth_summary <-  rehab_growth_tib %>%
  dplyr::group_by(intervention, time) %>%
  dplyr::summarize(
    mean = mean(resemblance),
    ci_low = ggplot2::mean_cl_normal(resemblance)$ymin,
    ci_upp = ggplot2::mean_cl_normal(resemblance)$ymax
)
growth_summary              

Plotting the data

Use the code box below to create a plot with time_num on the x-axis and resemblance on the y-axis, and a line that summarizes the linear trend over time for each intervention condition as a separate colour. Some tips to help you out:


growth_plot <- ggplot2::ggplot(rehab_growth_tib, aes(time_num, resemblance, colour = intervention, fill = intervention))
growth_plot +
  geom_point(size = 1, position = position_jitter(width = 0.1, height = 0.1), alpha = 0.6) + 
  geom_smooth(method = "lm", alpha = 0.3) +
  scale_x_continuous(breaks = c(0, 1, 6, 12), labels = c("0", "1", "6", "12")) +
  scale_y_continuous(breaks = seq(0, 90, 10)) +
  coord_cartesian(ylim = c(0, 90), xlim = c(0,12)) + 
  labs(x = "Time from baseline (months)", y = "Resemblance (%)", colour = "Intervention", fill = "Intervention") +
  theme_bw()

Now let's imagine we're treating time as a categorical variable (yuk!). Use the code box below to create a plot with time on the x-axis and resemblance on the y-axis, with error bars representing means and their confidence intervals. Some tips to help you out:


growth_mean_plot <- ggplot2::ggplot(rehab_growth_tib, aes(time, resemblance, colour = intervention))
growth_mean_plot +
  stat_summary(fun.data = "mean_cl_normal", size = 1, position = position_dodge(width = 0.5), alpha = 0.6) +
  scale_y_continuous(breaks = seq(20, 60, 5)) +
  coord_cartesian(ylim = c(20, 60)) + 
  labs(x = "Time from baseline (months)", y = "Resemblance (%)", colour = "Intervention") +
  theme_bw()

Fitting a growth model

Modelling time

Traditionally many people would treat the design of this study as, what's known as, a mixed design. It is so-called because the predictors (independent variables) are a 'mix' of repeated measures and independent measures. Specifically:

The RM-ANOVA approach can be used to analyse these 'mixed' designs. Like the repeated-measures examples we've seen before (e.g., adventr_16_rm) the RM-ANOVA approach is a restricted model in which:

When we're dealing with change over time, it also means that we're treating time points as equally spaced. That is, we treat time as a factor where different time points are different categories. Therefore, in our data we'd have categories of t0, t1, t6 and t12. The fact that t0 and t1 were 1 month apart, but t6 and t12 were 6 months apart is ignored. The categories are assumed to be equally spaced. Clearly it's better if we represent time along a continuum that represents the actual temporal spacing of events than to lose information by converting this to categories. The variable time_num represents time in this way as months from baseline. As such it contains values of 0 (baseline), 1 (1 month form baseline), 6 (6 months from baseline) and 12 (12 months from baseline). Growth models allow us to treat time in this flexible way by using a numeric variable to represent time rather than a factor (i.e. categories).

Modelling time in RStudio

To fit the model we use the lme() function from nlme, because we want to model the dependency between scores over time. This function is explained in the tutorial adventr_mlm, just to recap it takes the following general form (I've retained only the key options):

new_object <- lme(outcome ~ predictors, random = formula, data = tibble, method = "REML", na.action = na.fail)

We might want to test formally whether the effect of time varies by individuals (i.e., a random slope of time), in which case we can specify a random intercept model, then add the random slope, and compare the two. We can do this as:

rehab_ri <- nlme::lme(resemblance ~ time_num, random = ~ 1|id, data = rehab_growth_tib, method = "ML")
rehab_rs <- update(rehab_ri, random = ~ time_num|id)
anova(rehab_ri,rehab_rs)

The first line creates an object called rehab_ri which predicts resemblance scores from the effect of time_num and with intercepts allowed to vary across participants (I appended _ri to the name to remind me it is a random intercepts). The second line uses the update() function to update the model we just created (rehab_ri) to include a random slope of time (we did this by changing the random part of the model to be random = ~ time_num|id). I named this model rehab_rs with the _rs to remind me it is a random slopes model. The final line uses the anova() function to compare the two models. Try executing these commands in the code box.


rehab_ri <- nlme::lme(resemblance ~ time_num, random = ~ 1|id, data = rehab_growth_tib, method = "ML")
rehab_rs <- update(rehab_ri, random = ~ time_num|id)
anova(rehab_ri,rehab_rs)

The output shows that adding the random slope of time_num (i.e., allowing the change in resemblance scores over time to differ across participants) significantly improves the fit of the model, $\chi^2(2) = 36.52, p < 0.0001$. This finding should not surprise us given that we'd expect different changes for zombies who had gene therapy to those on the wait list.

We can also model more complex changes over time, such as a second-order polynomial or quadratic trend (i.e. a curvilinear trend). To do this, we can replace the predictor time_num with poly(time_num, 2). This action creates a power polynomial up to the order of 2 (because we put 2 in the function). In effect this creates two variables that represent linear (order 1) and quadratic (order 2) trends over time. The advantage of this method is that the resulting predictors for the two trends are independent from each other. We can again use update() to create a new model (rehab_quad) that adds the quadratic term to the previous, random-slope, model (rehab_rs):

rehab_quad <- update(rehab_rs, .~ poly(time_num, 2))
anova(rehab_ri,rehab_rs, rehab_quad)

The first line creates an object called rehab_quad using the update() function to update the random slope model (rehab_rs) to include the quadratic trend of time. We did this by changing the formula so that it included the previous outcome but not any of the previous predictors (that's what the .~ does, and note that there is no dot after the tilde, which means that we're excluding all predictors from the model we're updating). We add the linear and quadratic trends using the poly() function (poly(time_num, 2)). I named this model rehab_quad with the _quad to remind me it contains the quadratic trend of time. The second line uses the anova() function to compare the three models we have created. Try executing these commands in the code box.

rehab_ri <- nlme::lme(resemblance ~ time_num, random = ~ 1|id, data = rehab_growth_tib, method = "ML")
rehab_rs <- nlme::lme(resemblance ~ time_num, random = ~  time_num|id, data = rehab_growth_tib, method = "ML")

rehab_quad <- update(rehab_rs, .~ poly(time_num, 2))
anova(rehab_ri,rehab_rs, rehab_quad)

The output shows that adding the quadratic trend of time_num (i.e., allowing a curvilinear change in resemblance scores over time significantly improves the fit of the model, $\chi^2(1) = 8.67, p = 0.0032$. This finding suggests we should retain this curvilinear trend.

Modelling the effect of the intervention

We expect the change over time to be moderated by the intervention condition (we'd predict greater change for those in the gene therapy group than the wait list. In effect we're predicting an interaction between intervention and time_num. To test this interaction we need to add the fixed effect of intervention as a predictor and also its interaction with time_num. Remember that we have two terms representing time so we need to include both of their interactions with intervention. To specify an interaction in R we use a colon. For example, to specify the interaction between time_num and intervention we would use time_num:intervention. We, therefore, need to add these terms to the model:

We could add each term individually and compare to previous models, but this doesn't make sense because to evaluate the interactions we need the main effect of intervention in the model, so let's add them all in one hit.

rehab_mod <- update(rehab_quad, .~. + intervention + poly(time_num, 2):intervention)
anova(rehab_ri,rehab_rs, rehab_quad, rehab_mod)

The first line creates an object called rehab_mod using the update() function to update the quadratic trend model (rehab_quad) to include the effects of intervention and its interactions with time. We did this by changing the formula so that it included all previous outcomes and predictors (that's what the .~. does) but adding in the new effects (+ intervention + poly(time_num, 2):intervention). I named this model rehab_mod. The second line uses the anova() function to compare the three models we have created. Try executing these commands in the code box.

rehab_ri <- nlme::lme(resemblance ~ time_num, random = ~ 1|id, data = rehab_growth_tib, method = "ML")
rehab_rs <- nlme::lme(resemblance ~ time_num, random = ~  time_num|id, data = rehab_growth_tib, method = "ML")
rehab_quad <- nlme::lme(resemblance ~ poly(time_num, 2), random = ~  time_num|id, data = rehab_growth_tib, method = "ML")

rehab_mod <- update(rehab_quad, .~. + intervention + poly(time_num, 2):intervention)
anova(rehab_ri,rehab_rs, rehab_quad, rehab_mod)

The output shows that adding the main effect of intervention and its interactions with the trends over time (i.e., allowing the change in resemblance scores over time to be moderated by the treatment condition) significantly improves the fit of the model, $\chi^2(3) = 49.37, p < 0.0001$. We can inspect the model by using anova() to get F-statistics for each fixed effect, summary() to see the model parameters and their significance tests, and intercals() to get the confidence intervals for the parameters for the fixed effects. We have used these functions in adventr_mlm.

anova(rehab_mod)
summary(rehab_mod)
intervals(rehab_mod, which = "fixed")

Try these commands in the code box.

rehab_mod <- nlme::lme(resemblance ~ poly(time_num, 2) + intervention + poly(time_num, 2):intervention, random = ~  time_num|id, data = rehab_growth_tib, method = "ML")

anova(rehab_mod)
summary(rehab_mod)
intervals(rehab_mod, which = "fixed")

The results of the anova() command, show significant effects for the overall linear and quadratic change over time, F(2, 419) = 7.30, p = 0.0008 and the interaction between intervention condition and the change over time, F(1, 419) = 28.03, p < 0.001. The main effect of intervention was not significant, F(1, 419) = 0.22, p = 0.641. As such the main hypothesis was supported that the change over time would be different for those in the gene therapy group compared to the wait list. The graph (see earlier) tells us that resemblance scores increased over time in the gene therapy group by decreased for the wait list.

The model summary and confidence intervals shows that:

Optional learning exercise

This section is entirely unnecessary other than as a learning exercise to demonstrate how the multilevel model approach to repeated measures designs is a more flexible version of the analysis of variance approach (which, I'll refer to as RM-ANOVA). Skip to the end if you're not interested.

Anyway, in case, you come across a mixed design and you're unconvinced by using a multilevel model (or you have a small sample) I'll show you how to analyse the current example using RM-ANOVA. We'd execute (trust me):

rehab_aov <- aov(resemblance ~ time*intervention + Error(id), data = rehab_growth_tib)
summary(rehab_aov)

Note that we use the aov() function (which is essentially the lm() function but spews the model information out in a more F-statistic based way). We specify the model as resemblance ~ time*intervention, which means that we're predicting resemblance scores from the effects of time, intervention and the interaction between the two time:intervention. We also include + Error(id) in the formula to tell R to compute an error term from within the id variable. Execute the code in the box and look at the output.

To fit the same model using a multilevel approach let's start with a random slopes model. To mimic what the ANOVA does (i.e. treat time as a categorical variable) we'll use time rather than time_num (which we used in our growth models). We'd define this model as in the code box:

rehab_lme <- nlme::lme(resemblance ~ time*intervention, random = ~ intervention|id, data = rehab_growth_tib)
anova(rehab_lme)

Execute the code in the box and see that the results of the F-statistics and their significance values are not too dissimilar to the RM-ANOVA model, but not exactly the same. The growth model allows the effect of intervention to vary across participants, whereas RM-ANOVA does not. So, let's restrict the model so that it retains a random intercept (overall recall can vary across participants) but not a random slope (the effect of intervention is constant across participants). We achieve this restriction by changing random = ~ intervention|id to random = ~ 1|id (I've also added the suffix _ri for random intercept):

rehab_lme_ri <- nlme::lme(resemblance ~ time*intervention, random = ~ 1|id, data = rehab_growth_tib)
anova(rehab_lme_ri)

Execute the code and see that the results are basically the same as the RM-ANOVA now (probably because there wasn't a major deviation from the assumption of compound symmetry). The second restriction is to assume that covariances across conditions are equal. We can add this restriction by including the statement correlation = corCompSymm(form = ~ intervention|id) in the model above. This argument sets the covariance structure to have compound symmetry across the intervention variable (which is nested within participants). I'll change the suffix to this model to _cs (for compound symmetry):

rehab_lme_cs<- nlme::lme(resemblance ~ time*intervention, random = ~ 1|id, data = rehab_growth_tib, correlation = corCompSymm(form = ~ intervention|id))
anova(rehab_lme_cs)

Execute the code and see that in this case this additional restriction hasn't affected the F-statistics and their significance values, which again match the RM-ANOVA.

Other resources

Statistics

R

References



Try the adventr package in your browser

Any scripts or data that you put into this service are public.

adventr documentation built on July 1, 2020, 11:50 p.m.