In aineito/stats.VWP: Analysing visual world eye-tracking data in R

rm(list = ls())
knitr::opts_chunk$set(echo = TRUE)

Set-up

Load the following packages. If you haven't installed them yet, do so first (e.g. install.packages("learnr")). If you haven't installed the stats.VWP package (course materials) yet, install it using this command remotes::install_github("aineito/stats.VWP").

require(learnr)
require(Rmisc)
require(tidyverse)
require(ggplot2)
require(lme4)
require(stats.VWP)

We will also use gazer package for this tutorial. Install and load it using the commands below.

remotes::install_github("dmirman/gazer")

require(gazer)

This is optional: If you run the command below, R will display very large or very small numbers in a plain format (e.g. 100000 instead of 1e+05, 0.000001 instead of 1e-06).
If you like the indices format, skip the command below.

options(scipen=999)

Look at the data

We will use fix.binom.50bin data in the stats.VWP package for this tutorial.
The data is from Ito, Pickering & Corley (2018, JML). Let's look at the summary.

summary(fix.binom.50bin)

This data set contains the following data.

Column |Description :-------------|:---------------------------------------------------------- Subject |Subject ID Trial |Trial number Bin |Time bin ID Time |Time relative to the target word onset (Time -1000 contains 50 ms from the time -1000 ms) Count |Right-eye sample count on the critical object Condition |Condition (Targ=target, Eng=English competitor, Jap=Japanese competitor, Unr=unrelated) Item |Item ID Lang |Language group (L1=native English speakers, L2=native Japanese, non-native English speakers)

The details/description of the variables can also be found in the help page.

?fix.binom.50bin

We will use the data from -800 ms to 0 ms relative to the target word onset, and drop the 'Jap' condition from Condition.

fix.binom.pred = fix.binom.50bin %>% filter(Time >= -800, Time < 0, Condition %in% c('Targ','Eng','Unr')) %>% droplevels()

In R, you want to use an informative name for the data set. Let's use fix.binom.pred for this tutorial.

We need to create a second-order polynomial based on the Time values in the range of the selected time.
We will use the code_poly command from the gazer package.

poly.order=2 indicates that we need a second-order polynomial.
orthogonal=T indicates that we want the polynomial to be orthogonal (if you set it to FALSE, you will get natural polynomial).
draw.poly=T indicates that we want to create a graph showing transformed polynomial predictor values.

fix.binom.pred = code_poly(df=fix.binom.pred, predictor="Time", poly.order=2, orthogonal=T, draw.poly=T)

If we check the data set again, we can see that we now have poly1 and poly2 columns added to the original data set.

summary(fix.binom.pred)

poly1 corresponds to the linear term, and poly2 corresponds to the quadratic term.

Now, let's code our categorical variables.

We want to use dummy-coding (or treatment-coding) for Condition because we want to treat the 'Unr' condition as the baseline condition (reference level).
We will use sum-coding for Lang.

First, let's check the default coding for Condition. The default coding scheme for R is dummy-coding. We can check the contrasts by running the command below:

contrasts(fix.binom.pred$Condition)
contrasts(fix.binom.pred$Lang)

R treats the first level (i.e. 'Targ' and 'L1') as the reference level.
We want to change the matrix so that the 'Unr' condition becomes the reference level.
For sum-coding, we can use contr.sum function. We put the number 2 because we have 2 levels for Lang.

contrasts(fix.binom.pred$Condition) = matrix(c(1,0,0,0,1,0), ncol=2)
contrasts(fix.binom.pred$Lang) = contr.sum(2)

Now, the contrasts look good.

contrasts(fix.binom.pred$Condition)
contrasts(fix.binom.pred$Lang)

Let's doublecheck the data again.

summary(fix.binom.pred)

Fit the model

Now, we're ready to fit the model.

Let's construct a simple random-effects structure model just with the by-subject and by-item intercepts (no random slopes).

This is a GLMM testing an interaction of Condition by Lang and effects of Condition and Lang. It also includes 2-way interactions of Condition by poly1, Condition by poly2, Lang by poly1 and Lang by poly2, and 3-way interactions of Condition by Lang by poly1 and Condition by Lang by poly2.

Note: We will be excluding random slopes here to save time. Hence, the results will be partly inconsistent with what is reported in the paper. When you create a model, you need to choose the best random-effects structure for your design/data.

fix.binom.pred.m.cond.lang = glmer(Count ~ (poly1+poly2) * Condition * Lang + (1|Subject) + (1|Item), family=binomial, data=fix.binom.pred)

Plot the model fit

Before looking at the summary of the model, let's plot the model fit so that we can compare the model output with the graph.

We will add the model fit to the data set we used for the analysis.
The command below will create a new column mfit.

fix.binom.pred$mfit = fitted(fix.binom.pred.m.cond.lang)

ggplot(fix.binom.pred, aes(Time, Count, color=Condition, lty=Condition)) + 
  facet_wrap(~Lang) + theme_bw() + 
  stat_summary(fun=mean,geom="point") +
  stat_summary(aes(y=mfit),fun=mean,geom="line") +
  stat_summary(fun.data=mean_se, geom="pointrange") + 
  labs(y="Fixation Proportion", x="Time relative to target word onset (ms)") + 
  scale_color_manual('Condition',breaks=c('Targ','Eng','Unr'),labels=c("Target","English competitor","Unrelated"),values=c("red","blue","darkgrey")) +
  scale_linetype_manual('Condition',breaks=c('Targ','Eng','Unr'),labels=c("Target","English competitor","Unrelated"),values=c("solid","longdash","twodash")) +
  theme(text=element_text(size=12),legend.key.height=unit(.3,"in"),legend.key.width=unit(.6,"in"))

Good. Now let's check the summary of the model.

summary(fix.binom.pred.m.cond.lang)

At the top, you can find the model's formula and the name of the data set you used.
You can find the effects and interactions of the predictors in the fixed effects section.

Let's look at the effect of Condition first. In the most left column, you can see Condition1 and Condition2. The p-values for both (in the Pr(>|z|) column) are smaller than .05.

Quiz:

question_checkbox("What do they suggest? Hint: Do you remember the contrasts for 'Condition'? (see below)",
  answer("The target object was significantly more likely to be fixated than the English competitor object, and the English competitor object was significantly more likely to be fixated than the unrelated object"),
  answer("The target object and the English competitor object were both significantly more likely to be fixated than the unrelated object", correct=T),
  answer("The fixation proportion difference between the target condition and the unrelated condition was significantly larger than the fixation proportion difference between the English competitor condition and the unrelated condition"),
  allow_retry = T
)

contrasts(fix.binom.pred$Condition)

Now, if we look at the 2-way interactions of Condition by poly1 and Condition by poly2, the p-value for poly1:Condition1 is significant.
This suggests that the effect of condition (Target vs. Unrelated) interacts with the linear term.
If you look at the graph, you can see that the fixation proportion for the Target condition increases over time, whereas the fixation proportion for the Unrelated condition stays around 0.2 (imagine a linear line on these plots).
This time-course difference is similar in L1 and L2 groups (no 3-way interaction, see poly1:Condition1:Lang1).

Additionally, there are significant 2-way interactions of Condition1:Lang1 and Condition2:Lang1 on the intercept term and significant 3-way interactions of poly2:Condition1:Lang1 and poly2:Condition2:Lang1.

We can resolve these interactions by running the model separately for each group.

fix.binom.pred.m.cond.L1 = glmer(Count ~ (poly1+poly2) * Condition + (1|Subject) + (1|Item), family=binomial, data=fix.binom.pred[fix.binom.pred$Lang=='L1',]) # model for L1 group

summary(fix.binom.pred.m.cond.L1)

fix.binom.pred.m.cond.L2 = glmer(Count ~ (poly1+poly2) * Condition + (1|Subject) + (1|Item), family=binomial, data=fix.binom.pred[fix.binom.pred$Lang=='L2',]) # model for L2 group

summary(fix.binom.pred.m.cond.L2)

We can see that there is a significant effect of Condition1 in both groups, but the effect is larger in the L1 group than in the L2 group. Thus, both L1 and L2 groups were more likely to look at the target object over the unrelated object, but this difference was larger in the L1 group than in the L2 group.

The effect of Condition2 is significant in the L1 group but not in the L2 group. Thus, the L1 group was more likely to look at the English competitor object over the unrelated object, but the L2 group was not.

Moving on to the 3-way interactions of poly2:Condition1:Lang1 and poly2:Condition2:Lang1, we see that the poly2:Condition1 interaction is significant in both groups. That is, the curvature for the Target condition is different from that for the Unrelated condition in both groups. The curvature is shallower for the Unrelated condition in the L1 group, whereas it is shallower for the Target condition in the L2 group.

The poly2:Condition2 interaction is significant in the L1 group. The curvature is shallower for the Unrelated condition compared to the English competitor condition. There is no such difference in the L2 group.

Evaluate the need for higher-order polynomial(s)

In the previous section, we fitted the model with linear and quadratic terms.
Let's check if the model with linear and quadratic terms is better than the model without a quadratic term.

To do so, we need to construct another model without a quadratic term.

For this illustration, we will use the L1 group's data. As we can see, we just dropped poly2 from the original model.

fix.binom.pred.m.cond.lang.linear = glmer(Count ~ poly1 * Condition * Lang + (1|Subject) + (1|Item), family=binomial, data=fix.binom.pred)

summary(fix.binom.pred.m.cond.lang.linear)

Let's plot the model fit again.
We will add the model fit with a new name mfit_linear.

fix.binom.pred$mfit_linear = fitted(fix.binom.pred.m.cond.lang.linear)

ggplot(fix.binom.pred, aes(Time, Count, color=Condition, lty=Condition)) + 
  facet_wrap(~Lang) + theme_bw() + 
  stat_summary(fun=mean,geom="point") +
  stat_summary(aes(y=mfit_linear),fun=mean,geom="line") +
  stat_summary(fun.data=mean_se, geom="pointrange") + 
  labs(y="Fixation Proportion", x="Time relative to target word onset (ms)") + 
  scale_color_manual('Condition',breaks=c('Targ','Eng','Unr'),labels=c("Target","English competitor","Unrelated"),values=c("red","blue","darkgrey")) +
  scale_linetype_manual('Condition',breaks=c('Targ','Eng','Unr'),labels=c("Target","English competitor","Unrelated"),values=c("solid","longdash","twodash")) +
  theme(text=element_text(size=12),legend.key.height=unit(.3,"in"),legend.key.width=unit(.6,"in"))

If you compare this graph with the graph plotting a model fit for the model with linear and quadratic terms, you can see that the model fit for the model without a quadratic term looks worse.

Quick Q: How can you tell?

Let's do a model comparison to doublecheck.
We can use the anova() function to compare two (or more) models.

anova(fix.binom.pred.m.cond.lang, fix.binom.pred.m.cond.lang.linear)

The model comparison result suggests that the model with linear and quadratic terms is significantly better than the model without a quadratic term.

Report the results

Once you get the results from the best model, you can write a summary of the findings.

There are two rules of thumb for reporting growth curve analysis results (from Dan Mirman's website):

Clearly describe each of the three key components of the model.

The functional form (third-order orthogonal polynomial), the fixed effects (effect of Condition on all time terms), and the random effects (effect of Subject on each of the time terms and nested effects of Subject-by-Condition on each of the time terms except the cubic). Depending on the circumstances and complexity of the model, you may want to include additional information about the factors and why they were included or not. It's also a good idea to report which method was used for computing p-values.

For key findings, report parameter estimates and standard errors along with significance tests.

In some cases the model comparison is going to be enough, but for key findings, the readers should want to see the parameter estimates. The parameter estimate standard errors are critical for interpreting the estimates, so those should be reported as well. The t-values are not critical to report (they are just Estimate divided by the Std Error, so they can always be computed from the reported estimates and standard errors). If there are many estimated parameters, it may be a good idea to focus the main text discussion on the most important ones and report the full set in a table or appendix.

See here for an example.

Exercise

In this tutorial, our model included linear and quadratic terms, and it looked like this:

fix.binom.pred.m.cond.L1 = glmer(Count ~ (poly1+poly2) * Condition + (1|Subject) + (1|Item), family=binomial, data=fix.binom.pred[fix.binom.pred$Lang=='L1',]) # model for L1 group

Let's assume (hypothetically) you have another dependent variable logFix in the fix.binom.50bin dataset, which is a log-transformed fixation proportion, and you want to test whether logFix is significantly different between the target vs. unrelated conditions and between the English competitor vs. unrelated conditions in L1 speakers in the time window from -800 to 200 ms relative to the target word onset. You also want to include a cubic term in the model, in addition to the linear and quadratic terms.

Write a code for this analysis (but don't run it). Click Solution to see the solution.

fix.pred = fix.binom.50bin %>% filter(Time >= -800, Time < 200, Condition %in% c('Targ','Eng','Unr')) %>% droplevels()

fix.pred = code_poly(df=fix.pred, predictor="Time", poly.order=3, orthogonal=T, draw.poly=T)

fix.pred.m.cond.L1 = lmer(logFix ~ (poly1+poly2+poly3) * Condition + (1|Subject) + (1|Item), data=fix.pred[fix.pred$Lang=='L1',]) 

summary(fix.pred.m.cond.L1)

Solution

For this exercise, you should have changed the following from the tutorial code:

change the time window filter(Time >= -800, Time < 200,
change poly.order specification to include a cubic term poly.order=3
change the model syntax to include the cubic term logFix ~ (poly1+poly2+poly3)
change glmer to lmer and remove family=binomial because your dependent variable is no longer binomial
change the dependent variable to logFix