Mixed Model Reanalysis of RT data"

req_suggested_packages <- c("emmeans", "multcomp", 
                            "dplyr", "tidyr","ggplot2")
pcheck <- lapply(req_suggested_packages, requireNamespace, 
                 quietly = TRUE)
if (any(!unlist(pcheck))) {
   message("Required package(s) for this vignette are not available/installed and code will not be executed.")
   knitr::opts_chunk$set(eval = FALSE)
op <- options(width = 90, dplyr.summarise.inform = FALSE)
load(system.file("extdata/", "output_mixed_vignette.rda", package = "afex"))


This documents reanalysis response time data from an Experiment performed by Freeman, Heathcote, Chalmers, and Hockley (2010) using the mixed model functionality of afex implemented in function mixed followed by post-hoc tests using package emmeans (Lenth, 2017). After a brief description of the data set and research question, the code and results are presented.

Description of Experiment and Data

The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants presented in Freeman et al. (2010). The 300 items in each stimulus condition were selected to form a balanced $2 \times 2$ design with factors neighborhood density (low versus high) and frequency (low versus high). The task was a between subjects factor: 25 participants worked on the lexical decision task and 20 participants on the naming task. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. We analyzed log RTs which showed an approximately normal picture.

Data and R Preparation

We start with loading some packages we will need throughout this example. For data manipulation we will be using the dplyr and tidyr packages from the tidyverse. A thorough introduction to these packages is beyond this example, but well worth it, and can be found in 'R for Data Science' by Wickham and Grolemund. For plotting we will be using ggplot2, also part of the tidyverse.

After loading the packages, we will load the data (which comes with afex), remove the errors, and take a look at the variables in the data.

library("afex") # needed for mixed() and attaches lme4 automatically.
library("emmeans") # emmeans is needed for follow-up tests 
library("multcomp") # for advanced control for multiple testing/Type 1 errors.
library("dplyr") # for working with data frames
library("tidyr") # for transforming data frames from wide to long and the other way round.
library("ggplot2") # for plots
theme_set(theme_bw(base_size = 15) + 
                  panel.grid.major.x = element_blank()))

data("fhch2010") # load 
fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors
str(fhch2010) # structure of the data

To make sure our expectations about the data match the data we use some dplyr magic to confirm the number of participants per condition and items per participant.

## are all participants in only one task?
fhch %>% group_by(id) %>%
  summarise(task = n_distinct(task)) %>%
  as.data.frame() %>% 
  {.$task == 1} %>%

## participants per condition:
fhch %>% group_by(id) %>%
  summarise(task = first(task)) %>%
  ungroup() %>%
  group_by(task) %>%
  summarise(n = n())

## number of different items per participant:
fhch %>% group_by(id, stimulus) %>%
  summarise(items = n_distinct(item)) %>%
  ungroup() %>%
  group_by(stimulus) %>%
  summarise(min = min(items), 
            max = max(items), 
            mean = mean(items))

Before running the analysis we should make sure that our dependent variable looks roughly normal. To compare rt with log_rt within the same figure we first need to transform the data from the wide format (where both rt types occupy one column each) into the long format (in which the two rt types are combined into a single column with an additional indicator column). To do so we use tidyr::pivot_longer. Then we simply call ggplot with geom_histogram and facet_wrap(vars(rt_type)) on the new tibble. The plot shows that log_rt looks clearly more normal than rt, although not perfectly so. An interesting exercise could be to rerun the analysis below using a transformation that provides an even better 'normalization'.

fhch_long <- fhch %>% 
  pivot_longer(cols = c(rt, log_rt), names_to = "rt_type", values_to = "rt")
ggplot(fhch_long, aes(rt)) +
  geom_histogram(bins = 100) +
  facet_wrap(vars(rt_type), scales = "free_x")

Descriptive Analysis

The main factors in the experiment were the between-subjects factor task (naming vs. lexdec), and the within-subjects factors stimulus (word vs. nonword), density (low vs. high), and frequency (low vs. high). Before running an analysis it is a good idea to visually inspect the data to gather some expectations regarding the results. Should the statistical results dramatically disagree with the expectations this suggests some type of error along the way (e.g., model misspecification) or at least encourages a thorough check to make sure everything is correct. We first begin by plotting the data aggregated by-participant.

In each plot we plot the raw data in the background. To make the individual data points visible we use ggbeeswarm::geom_quasirandom and alpha = 0.5 for semi-transparency. On top of this we add a (transparent) box plot as well as the mean and standard error.

agg_p <- fhch %>% 
  group_by(id, task, stimulus, density, frequency) %>%
  summarise(mean = mean(log_rt)) %>%

ggplot(agg_p, aes(x = interaction(density,frequency), y = mean)) +
  ggbeeswarm::geom_quasirandom(alpha = 0.5) +
  geom_boxplot(fill = "transparent") +
  stat_summary(colour = "red") +
  facet_grid(cols = vars(task), rows = vars(stimulus))

Now we plot the same data but aggregated across items:

agg_i <- fhch %>% group_by(item, task, stimulus, density, frequency) %>%
  summarise(mean = mean(log_rt)) %>%

ggplot(agg_i, aes(x = interaction(density,frequency), y = mean)) +
  ggbeeswarm::geom_quasirandom(alpha = 0.3) +
  geom_boxplot(fill = "transparent") +
  stat_summary(colour = "red") +
  facet_grid(cols = vars(task), rows = vars(stimulus))

These two plots show a very similar pattern and suggest several things:

Model Setup

To set up a mixed model it is important to identify which factors vary within which grouping factor generating random variability (i.e., grouping factors are sources of stochastic variability). The two grouping factors are participants (id) and items (item). The within-participant factors are stimulus, density, and frequency. The within-item factor is task. The 'maximal model' (Barr, Levy, Scheepers, and Tily, 2013) therefore is the model with by-participant random slopes for stimulus, density, and frequency and their interactions and by-item random slopes for task.

It is rather common that a maximal model with a complicated random effect structure, such as the present one, does not converge successfully. The best indicator of this is a "singular fit" warning. A model with a singular fit warning should not be reported or used. Instead, one should make sure that qualitatively the same results are also observed with a model without singular fit warnings. If the maximal model that does not converge and a reduced model without a singular fit warning (i.e., the final model) diverge in their results, results should only be interpreted cautiously.

In case of a singular fit or another indicator of a convergence problem, the usual first step is removing the correlations among the random terms. In our example, there are two sets of correlations, one for each random effect grouping variable. Consequently, we can build four model that have the maximal structure in terms of random-slopes and only differ in which correlations among random terms are calculated:

  1. With all correlations.
  2. No correlation among by-item random effects (i.e., no correlation between random intercept and task random slope).
  3. No correlation among by-participant random effect terms (i.e., no correlation among random slopes themselves and between the random slopes and the random intercept).
  4. No correlation among either random grouping factor.

The next decision to be made is which method to use for obtaining $p$-values. The best control against anti-conservative results is provided by method = "KR" (=Kenward-Roger). However, KR needs quite a lot of RAM, especially with complicated random effect structures and large data sets. In this case we have both, relatively large data (i.e., many levels on each random effect, especially the item random effect) and a complicated random effect structure. Consequently, it seems a reasonable decision to choose another method. The second 'best' method (in terms of controlling for Type I errors) is the 'Satterthwaite' approximation, method = 'S'. It provides a similar control of Type I errors as the Kenward-Roger approximation and needs less RAM. The Satterthwaite method is currently also the default method for calculating $p$-values so does not need to be set explicitly.

Finding the Final Model

The following code fits the four models using the Satterthwaite method. To suppress random effects we use the || notation. Note that it is necessary to set expand_re = TRUE when suppressing random effects among variables that are entered as factors and not as numerical variables (all independent variables in the present case are factors). Also note that mixed automatically uses appropriate contrast coding if factors are included in interactions (contr.sum) in contrast to the R default (which is contr.treatment). To make sure the estimation does not end prematurely we set the allowed number of function evaluations to a very high value (using lmerControl).

However, because fitting the models in R might take quite a while, you should also be able to load the fitted binary files them from this url and then load them in R with load("freeman_models.rda").

m1s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task|item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)))
message("boundary (singular) fit: see ?isSingular")
m2s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task||item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
message("boundary (singular) fit: see ?isSingular")
m3s <- mixed(log_rt ~ task*stimulus*density*frequency +
               (task|item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
message("boundary (singular) fit: see ?isSingular")
m4s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task||item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
message("boundary (singular) fit: see ?isSingular")

We can see that fitting each of these models emits the "singular fit" message (it is technically a message and not a warning) indicating that the model is over-parameterized and did not converge successfully. In particular, this message indicates that the model is specified with more random effect parameters than can be estimated given the current data.

It is instructive to see that even with a comparatively large number of observations, 12960, a set of seven random slopes for the by-participant term and one random slope for the by-item term cannot be estimated successfully. And this holds even after removing all correlations. Thus, it should not be surprising that similar sized models regularly do not converge with smaller numbers of observations. Furthermore, we are here in the fortunate situation that each factor has only two levels. A factor with more levels corresponds to more parameters of the random effect terms.

Before deciding what to do next, we take a look at the estimated random effect estimates. We do so for the model without any correlations. Note that for afex models we usually do not want to use the summary method as it prints the results on the level of the model coefficients and not model terms. But for the random effects we have to do so. However, we are only interested in the random effect terms so we only print those using summary(model)$varcor.

cat(outp_m4s_vc$output, sep = "\n")

The output shows that the estimated SDs of the random slopes of the two-way interactions are all zero. However, because we cannot remove the random slopes for the two way interaction while retaining the three-way interaction, we start by removing the three-way interaction first.

m5s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task||item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
message("boundary (singular) fit: see ?isSingular")
cat(outp_m5s_vc$output, sep = "\n")

Not too surprisingly, this model also produces a singular fit. Inspection of the estimates shows that the two-way interaction of the slopes are still estimated to be zero. So we remove those in the next step.

m6s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task||item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), 
             expand_re = TRUE)
message("boundary (singular) fit: see ?isSingular")

This model still shows a singular fit warning. The random effect estimates below show a potential culprit.

cat(outp_m6s_vc$output, sep = "\n")

As in m4s above, the random effect SD for the density term is estimated to be zero. Thus, we remove this as well in the next step.

m7s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task||item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)

This model finally does not emit a singular fit warning. Is this our final model? Before deciding on this, we see whether we can add the correlation terms again without running into any problems. We begin by adding the correlation to the by-participant term.

m8s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task||item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)
warning(fit_m8s$warnings, call. = FALSE)

This model does not show a singular fit message but emits another warning. Specifically, a warning that the absolute maximal gradient at the final solution is too high. This warning is not necessarily critical (i.e., it can be a false positive), but can also indicate serious problems. Consequently, we try adding the correlation between the by-item random terms instead:

m9s <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task|item), fhch, 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), expand_re = TRUE)

This model also does not show any warnings. Thus, we have arrived at the end of the model selection process.

Results of Maximal and Final Model

We now have the following two relevant models.

Robust results are those that hold regardless across maximal and final (i.e., reduced) model. Therefore, let us compare the pattern of significant and non-significant effects.

left_join(nice(m1s), nice(m9s), by = "Effect", 
          suffix = c("_full", "_final")) 
cat(outp_comb_anova$output, sep = "\n")

What this shows is that the pattern of significant and non-significant effect is the same for both models. The only significant effect for which the evidence is not that strong is the 3-way interaction between stimulus:density:frequency. It is only just below .05 for the full model and has a somewhat lower value for the final model.

We can also see that one of the most noticeable differences between the maximal and the final model is the number of denominator degrees of freedom. This is highly influenced by the random effect structure and thus considerable larger in the final (i.e., reduced) model. The difference in the other statistics is lower.

LRT Results

It is instructive to compare those results with results obtained using the comparatively 'worst' method for obtaining $p$-values implemented in afex::mixed, likelihood ratio tests. Likelihood ratio-tests should in principle deliver reasonable results for large data sets. A common rule of thumb is that the number of levels for each random effect grouping factor needs to be large, say above 50. Here, we have a very large number of items (600), but not that many participants (45). Thus, qualitative results should be the very similar, but it still is interesting to see exactly what happens. We therefore fit the final model using method='LRT'.

m9lrt <- mixed(log_rt ~ task*stimulus*density*frequency + 
               (task|item), fhch, method = "LRT", 
             control = lmerControl(optCtrl = list(maxfun = 1e6)), 
             expand_re = TRUE)
cat(outp_m9lrt$output, sep = "\n")

The results in this case match the results of the Satterthwaite method. With lower numbers of levels of the grouping factor (e.g., less participants) this would not necessarily be expected.

Summary of Results

Fortunately, the results from all models converged on the same pattern of significant and non-significant effects providing a high degree of confidence in the results. This might not be too surprising given the comparatively large number of total data points and the fact that each random effect grouping factor has a considerable number of levels (above 30 for both participants and items). In the following we focus on the final model using the Satterthwaite method, m9s.

In terms of the significant findings, there are many that seem to be in line with the descriptive results described above. For example, the highly significant effect of task:stimulus:frequency with $F(1, 578.91) = 124.16$, $p < .001$, appears to be in line with the observation that the frequency effect appears to change its sign depending on the task:stimulus cell (with nonword and naming showing the opposite patterns than the other three conditions). Consequently, we start by investigating this interaction further below.

Follow-Up Analyses

Before investigating the significant interaction in detail it is a good idea to remind oneself what a significant interaction represents on a conceptual level; that one or multiple of the variables in the interaction moderate (i.e., affect) the effect of the other variable or variables. Consequently, there are several ways to investigate a significant interaction. Each of the involved variables can be seen as the moderating variables and each of the variables can be seen as the effect of interest. Which one of those possible interpretations is of interest in a given situation highly depends on the actual data and research question and multiple views can be 'correct' in a given situation.

In addition to this conceptual issue, there are also multiple technical ways to investigate a significant interaction. One approach not followed here is to split the data into subsets according to the moderating variables and compute the statistical model again for the subsets with the effect variable(s) as remaining fixed effect. This approach, also called simple effects analysis, is, for example, recommended by Maxwell and Delaney (2004) as it does not assume variance homogeneity and is faithful to the data at each level. The approach taken here is to simply perform the test on the estimated final model. This approach assumes variance homogeneity (i.e., that the variances in all groups are homogeneous) and has the added benefit that it is computationally relatively simple. In addition, it can all be achieved using the framework provided by emmeans (Lenth, 2017).

task:stimulus:frequency Interaction

Our interest in the beginning is on the effect of frequency by task:stimulus combination. So let us first look at the estimated marginal means of this effect. In emmeans parlance these estimated means are called 'least-square means' because of historical reasons, but because of the lack of least-square estimation in mixed models we prefer the term estimated marginal means, or EMMs for short. Those can be obtained in the following way. To prevent emmeans from calculating the df for the EMMs (which can be quite costly), we use asymptotic dfs (i.e., $z$ values and tests). emmeans requires to first specify the variable(s) one wants to treat as the effect variable(s) (here frequency) and then allows to specify condition variables.

emm_options(lmer.df = "asymptotic") # also possible: 'satterthwaite', 'kenward-roger'
emm_i1 <- emmeans(m9s, "frequency", by = c("stimulus", "task"))
message("NOTE: Results may be misleading due to involvement in interactions")
cat(emm_o1$output, sep = "\n")

The returned values are in line with our observation that the nonword and naming condition diverges from the other three. But is there actual evidence that the effect flips? We can test this using additional emmeans functionality. Specifically, we first use the pairs function which provides us with a pairwise test of the effect of frequency in each task:stimulus combination. Then we need to combine the four tests within one object to obtain a family-wise error rate correction which we do via update(..., by = NULL) (i.e., we revert the effect of the by statement from the earlier emmeans call) and finally we select the holm method for controlling for family wise error rate (the Holm method is uniformly more powerful than the Bonferroni).

update(pairs(emm_i1), by = NULL, adjust = "holm")
message("NOTE: Results may be misleading due to involvement in interactions")
cat(emm_o2$output, sep = "\n")

We could also use a slightly more powerful method than the Holm method, method free from package multcomp. This method takes the correlation of the model parameters into account. However, results do not differ much here:

summary(as.glht(update(pairs(emm_i1), by = NULL)), test = adjusted("free"))
cat(emm_o3$output, sep = "\n")

We see that the results are exactly as expected. In the nonword and naming condition we have a clear negative effect of frequency while in the other three conditions it is clearly positive.

We could now also use emmeans and re-transform the estimates back onto the original RT response scale. For this, we can again update the emmeans object by using tran = "log" to specify the transformation and then indicating we want the means on the response scale with type = "response". These values might be used for plotting.

emm_i1b <- update(emm_i1, tran = "log", type = "response", by = NULL)
cat(emm_o4$output, sep = "\n")

A more direct approach for plotting the interaction is via afex_plot. For a plot that is not too busy it makes sense to specify across which grouping factor the individual level data should be aggregated. We use the participant variable "id" here. We also use ggbeeswarm::geom_quasirandom as the geom for the data in the background following the example in the afex_plot vignette.

afex_plot(m9s, "frequency", "stimulus", "task", id = "id",
          data_geom = ggbeeswarm::geom_quasirandom, 
          data_arg = list(
            dodge.width = 0.5,  ## needs to be same as dodge
            cex = 0.8,
            color = "darkgrey"))

task:stimulus:density:frequency Interaction

As the last example, let us take a look at the significant four-way interaction of task:stimulus:density:frequency, $F(1, 578.77) = 11.72$, $p < .001$. Here we might be interested in a slightly more difficult question namely whether the density:frequency interaction varies across task:stimulus conditions. If we again look at the figures above, it appears that there is a difference between low:low and high:low in the nonword and lexdec condition, but not in the other conditions.

Looking at the 2-way interaction of density:frequency by the task:stimulus interaction can be done using emmeans using the joint_test function. We simply need to specify the appropriate by variables and get conditional tests this way.

joint_tests(m9s, by = c("stimulus", "task"))
cat(emm_o5$output, sep = "\n")

This test indeed shows that the density:frequency interaction is only significant in the nonword and lexdec condition. Next, let's see if we can unpack this interaction in a meaningful manner. For this we compare low:low and high:low in each of the four groups. And just for the sake of making the example more complex, we also compare low:high and high:high.

To do so, we first need to setup a new set of EMMs. Specifically, we get the EMMs of the two variables of interest, density and frequency, using the same by specification as the joint_test call. We can then setup custom contrasts that tests our hypotheses.

emm_i2 <- emmeans(m2s, c("density", "frequency"), by = c("stimulus", "task"))
cat(emm_o6$output, sep = "\n")

These contrasts can be specified via a list of custom contrasts on the EMMs (or reference grid in emmeans parlance) which can be passed to the contrast function. The contrasts are a list where each element should sum to one (i.e., be a proper contrast) and be of length equal to the number of EMMs (although we have 16 EMMs in total, we only need to specify it for a length of four due to conditioning by c("stimulus", "task")).

To control for the family wise error rate across all tests, we again use update(..., by = NULL) on the result to revert the effect of the conditioning.

# desired contrats:
des_c <- list(
  ll_hl = c(1, -1, 0, 0),
  lh_hh = c(0, 0, 1, -1)
update(contrast(emm_i2, des_c), by = NULL, adjust = "holm")
cat(emm_o7$output, sep = "\n")

In contrast to our expectation, the results show two significant effects and not only one. In line with our expectations, in the nonword and lexdec condition the EMM of low:low is smaller than the EMM for high:low, $z = -5.63$, $p < .0001$. However, in the nonword and naming condition we found the opposite pattern; the EMM of low:low is larger than the EMM for high:low, $z = 3.53$, $p = .003$. For all other effects $|z| < 1.4$, $p > .98$. In addition, there is no difference between low:high and high:high in any condition.



Try the afex package in your browser

Any scripts or data that you put into this service are public.

afex documentation built on April 30, 2022, 1:05 a.m.