In dmbates/RePsychLing: Data sets from Psychology and Linguistics experiments

library(RePsychLing)
library(knitr)
library(rstan)
library(parallel)
library(xtable)
opts_chunk$set(comment=NA)
options(width=92,
        show.signif.stars = FALSE)

Structure of the data

The data from @Kronmuller:Barr:2007 is available as the kb07 data frame in the RePsychLing package for R.

str(kb07)

As is common with factorial designs the experimental factors and their interactions are given short names: a single character (S,P,C) for the main effects, two-character names for the two-factor interactions and a three-character name for the three-factor interaction.

The data are from 56 subjects who responded to 32 iterms. Specifically, subjects had to select one of several objects presented on a monitor with a cursor. The manipulations involved

auditory instructions that maintained or broke a precedent of reference for the objects established over prior trials, (P)
with the instruction being presented by the speaker who established the precedent (i.e., an old speaker) or a new speaker (S), and
whether the task had to be performed without or with a cognitive load consisting of six random digits. (C)

All factors were expressed using a ±1 encoding which ensures that the scale of the interactions is the same as the scale of the main effects and of the intercept term. The columns of the full-factorial model matrix, X

X <- unname(model.matrix(~ 1+S+P+C+SP+SC+PC+SPC, kb07))
attr(X,"assign") <- NULL
str(X)

have the same length and are nearly orthogonal.

crossprod(X)   # X'X

The response, RTtrunc, is the reaction time after truncating some unusually large values. The minimum value is 336 and the maximum 5144 ms.

summary(kb07$RTtrunc)
hist(kb07$RTtrunc,freq=FALSE,xlab="msec",
     main="Distribution of RTtrunc")

Details of the linear mixed models

We fit two models, the maximal model, and the final model that was settled on in the accompanying paper [@BatesEtAlParsimonious]. We attempt to provide more detail than usual in order to help researchers unfamiliar with the matrix notation of linear mixed models; for a more simplified introduction to fitting LMMs in Stan, please see the tutorial article @SorensenVasishth.

Maximal model

The model in matrix form is:

$$\mathrm{y} = X\beta + Z_s b_s + Z_i b_i + \epsilon$$

Here, $X$ is the $N\times p$ model matrix (p=8 since we have eight predictors, including the intercept), $\beta$ is a $p\times 1$ vector of fixed effects parameters, $Z_s$ and $Z_i$ are the subject and item model matrices ($N\times p$), and $b_s$ and $b_i$ are the by-subject and by-item adjustments to the fixed effects estimates; these are often referred to as the Best Linear Unbiased Predictors (BLUPs) in the linear mixed modeling literature. $\epsilon$ refers to the residual error ($N\times 1$).

We assume that $\epsilon \sim N(0,\sigma^2)$, and that $b_s \sim N(0,\Sigma_s)$, and
$b_i \sim N(0,\Sigma_i)$. These are assumed to be mutually independent. $\Sigma_s$ and $\Sigma_i$ are $8\times 8$ variance-covariance matrices for subject and item random effects respectively; the diagonals have the variance estimates for each predictor, and the off-diagonals have the respective pairwise covariances. For example, a $2\times 2$ variance covariance matrix for two random variables $X_1$ and $X_2$ would have the form

$$ \Sigma_u = \left[ \begin{array}{cc} \sigma_1^2 & \rho_{12} \, \sigma_{1} \sigma_{2} \ \rho_{12} \, \sigma_{1} \sigma_{2} & \sigma_{2}^2 \end{array} \right] $$

Here, $\rho_{12} \, \sigma_{1} \sigma_{2}$ is the covariance $Cov(X_1,X_2)$, and $\rho_{12}$ is the correlation between the two random variables. Notice that we can construct a $2\times 2$ correlation matrix for the above covariance matrix as shown below; the diagonals will always have 1's in them as a random variable is perfectly correlated with itself. This becomes relevant in the next subsection on prior specificiation.

$$ \left[ \begin{array}{cc} 1 & \rho_{12} \ \rho_{12} & 1 \end{array} \right] $$

In a Bayesian analysis, we have to specify a prior distribution for each parameter. We discuss this point next.

Prior specification

The parameters in the maximal model are: eight $\beta$ coefficients, the variance of the residual error $\sigma^2$, eight variance components for subject, another eight for item, and 28 subject and 28 item correlations.

We define the following priors:

$\beta$ have a flat uniform improper prior.
$\tau_s \sim Cauchy(0,2.5)$
$\tau_i \sim Cauchy(0,2.5)$
$\sigma$ has a uniform prior distribution with a lower bound of 0.
The prior for the correlations needs some explanation. First, note that a correlation matrix can be decomposed into something analogous to a square root of a matrix, using the Cholesky decomposition. For example, given a matrix like

C<-matrix(c(1,.1,.5,.1,1,-.3,.5,-.3,1),
                byrow=TRUE,ncol=3)
C

We can decompose this matrix as follows:

L<-chol(C)
L

This decomposition gives a kind of square root of the matrix: we can recover the correlation matrix by squaring $L$:

## Gives back original matrix C:
t(L)%*%L

In Stan, we define priors on $L$, the Cholesky decomposition of the correlation matrix, using the so-called LKJ priors, with parameter $\eta=2$ (see the Stan manual for details). Once we have a prior on $L$, we can (a) compute the posterior estimates of the correlation matrix, and (b) generate the adjustments (BLUPs) $b_s$ and $b_i$ shown in the equation above.

The procedure for generating$b_s$ is:

Premultiply the diagonalized by-subject variance vector $\tau_s$ with the Cholesky decomposition $L$ to get a matrix $\Lambda_s$.
Generate values from a random variable $u_s$ that has a N(0,1) distribution.
Multiply $\Lambda_s$ with $u_s$ to obtain the correlated random variables $b_s$. The procedure for generating $b_i$ is analogous.

As an illustration of how this procedure works, consider the simple case where we have two random variables, which implies two variance components. Let the correlation between the random variables be $-0.6$. We can generate five pairs of correlated $b_s$ values as follows:

## given two variance components:
sigmas<-c(1,10)
## and a correlation of -0.6:
corrs<-c(1,-.6)
C<-matrix(c(corrs,corrs[c(2,1)]),
ncol=2,byrow=T)
C
## Cholesky decomposition:
L<-chol(C)
L
## generate 5 pairs of N(0,1) random variables:
z<-rnorm(2*5)
z<-matrix(z,ncol=2)
t(z)
## generate correlated random variables:
sigmas<-sigmas*diag(2)
Lambda<-sigmas%*%L
b_s<-Lambda%*%t(z)
b_s
## check that the random variables have
## the expected correlation:
cor(t(b_s))

This completes the explanation for how priors are defined. We now turn to the implementation details of the Stan model.

Formulating the maximal Stan model

Stan is a system for creating Markov Chain Monte Carlo (MCMC) samplers for statistical models [@stan-manual:2014].

In the model specification the data, including dimensions of arrays, are described first. Consistent with lme4 terminology, we will use the terms fixed effect and random effect, although note that in the Bayesian setting this distinction disappears (this is discussed below).

standat <- '
data {
  int<lower=0>  N; // num observations
  int<lower=1>  K; // length of fixed-effects vector
  int<lower=0>  M; // num subjects
  int<lower=1>  J; // length of subj vector-valued random effects
  int<lower=0>  L; // num items
  int<lower=1>  I; // length of item vector-values random effects
  int<lower=1,upper=M> subj[N]; // subject indicator
  int<lower=1,upper=L> item[N]; // item indicator
  row_vector[K] X[N]; // model matrix for fixed-effects parameters
  row_vector[J] Zs[N]; // generator model matrix for subj random effects
  row_vector[I] Zi[N]; // generator model matrix for item random effects 
  vector[N]     y; // response vector (reaction time)
}
'

The response vector, y, is expressed as a vector of N elements. The integer vectors subj and item are the indicators of subject and of item for each response.

The fixed-effects model matrix, X, the subject-specific random effects model matrix, Zs, and the item-specific random effects model matrix, Zi, are stored as vectors of N row_vector's. The reason for storing the model matrices like this is to provide fast access to individual rows when formulating the model.

For generality the number of columns in these matrices is specified separately as K (number of fixed effects), J (number of subject random effects) and I (number of item random effects). For the maximal model these will all be eight. The numbers of subjects and items are M and L, respectively.

Model parameters

In a Bayesian formulation all the random variables are considered model parameters; there is no distinction between the random effects and the fixed-effects coefficients. As discussed above, the covariance matrices for the within-subject and within-item random effects are specified according to the Cholesky factor of the correlation matrix, a cholesky_factor_corr type and the standard deviations, which are called taus and taui, respectively.

The spherical random effects, often written as $u$, are called us and ui respectively.
These are matrices that are stored as vectors of vectors.

(A note regarding the phrase "spherical random effect": The unconditional distribution of the random variable U is multivariate normal with mean 0 and covariance matrix $\sigma^2I$. Because the contours of constant probability density of such a distribution are spheres centered at the origin, it is called a "spherical normal" distribution and so we call U the "spherical random effects".)

stanpars <- '
parameters {
  cholesky_factor_corr[J] Ls; // Cholesky factor of subj r.e. correlations
  cholesky_factor_corr[I] Li; // Cholesky factor of item r.e. correlations
  vector<lower=0>[J] taus; // standard deviations of unconditional subj r.e. dist
  vector<lower=0>[I] taui; // standard deviations of unconditional item r.e. dist
  vector[J] us[M];     // spherical subj random effects
  vector[I] ui[L];     // spherical item random effects
  vector[K] beta;      // fixed-effects
  real<lower=0> sigma; // standard deviation of response given random effects
}

'

The correlation matrices formed from the Cholesky factors are considered transformed parameters

stantrans <- '
transformed parameters {
  matrix[J,J] corrs;
  matrix[I,I] corri;
  corrs <- tcrossprod(Ls);  // for monitoring subj correlations
  corri <- tcrossprod(Li);  // for monitoring item correlations
}

'

Finally, the model is expressed in terms of square matrices Lambdas and Lambdai that are the Cholesky factors of the covariance matrices for the subject-specific and item-specific random effects, respectively.
The prior distributions for the parameters (taus, taui, Ls, Li, us, and ui) are also specified at this point. No prior distribution is provided for beta, implying that this vector has a "flat" or "locally uniform" improper prior distribution.
It could be given, say, a diffuse multivariate Gaussian distribution, but doing so has no effect on inferences.

stanmod <- '
model {
  matrix[J,J] Lambdas; 
  vector[J] bs[M];
  matrix[I,I] Lambdai; 
  vector[I] bi[L];
  taus ~ cauchy(0,2.5);
  taui ~ cauchy(0,2.5);
  Ls ~ lkj_corr_cholesky(2);
  Li ~ lkj_corr_cholesky(2);
  Lambdas <- diag_pre_multiply(taus,Ls);
  Lambdai <- diag_pre_multiply(taui,Li);
  for (m in 1:M) {
    us[m] ~ normal(0,1);
    bs[m] <- Lambdas * us[m];
  }
  for (l in 1:L) {
    ui[l] ~ normal(0,1);
    bi[l] <- Lambdai * ui[l];
  }
  for (n in 1:N)
    y[n] ~ normal(X[n] * beta + Zs[n] * bs[subj[n]] + Zi[n] * bi[item[n]], sigma);
}

'

In the last loop the linear predictor for y[n] is evaluated as the sum of three products of row_vectors and vectors. A row_vector multiplied by a vector is a scalar, which is the reason for storing X, Zs and Zi as vectors of row_vectors.

These pieces are concatenated to form the model

model <- paste(standat, stanpars, stantrans, stanmod)

Compiling the model

The model is compiled via a call to stan that includes the model and the data. The data must be specified as a list or as an environment.

For the maximal model, X, Zs and Zi are the model matrix for the full factorial design.

maxdat <- 
  within(list(), {
    N <- nrow(X)
    K <- J <- I <- ncol(X)
    M <- length(levels(kb07$subj))
    L <- length(levels(kb07$item))
    X <- Zs <- Zi <- unname(X)
    y <- kb07$RTtrunc
    subj <- as.integer(kb07$subj)
    item <- as.integer(kb07$item)
    }
    )
str(maxdat)

Initially, we set the number of chains to zero to check that the model compiles properly.

maxmodel <- stan(model_name="maxmodel", model_code=model, data=maxdat, chains=0)

Creating the chains.

We use mclapply from the parallel package to generate the chains in parallel.

system.time(KB07_stan <-
  sflist2stanfit(
    mclapply(1:4, mc.cores = 4,    # adjust number of cores to suit 
      function(i) stan(fit = maxmodel, 
                       data = maxdat,
                       iter=2000,
                       chains = 1, 
                       chain_id = i, 
                       refresh = -1))
    )
  )

We see that the elapsed time is considerably less than the user time. This is because four processes are running in parallel.

A close examination of the timing shows that much more time is spent in the "warmup" phase than in actually generating samples. Stan defaults to using a form of Hamiltonian Monte Carlo (HMC) sampling called a "No U-Turn Sampler" (NUTS) and it is tuning these samplers that is taking most of the time.

KB07_results<- summary(KB07_stan,
                       pars=c("beta", "sigma",
                              "taus","taui",
                              "corrs","corri"),
                       probs = c(0.025,  0.975), digits_summary = 3)
rownames(KB07_results$summary)

Note that all of the correlation matrix elements are monitored even though the diagonal elements are, by definition, unity and the matrix must be symmetric. An arduous extraction provides the table

print(xtable(KB07_results$summary[c(1:25,27:33,36:41,45:49,55:57,62:64,81,91:97,100:105),c(1,4,5)]), type="html")

Note that most of the correlations, especially those for the item-specific random effects, have a mean close to zero and upper and lower limits that are more-or-less symmetric. This is strong evidence that these could be zero.

Because of the way the priors are defined the taus and taui values cannot become zero. However, many of these values are very close to zero. The only standard deviations that are substantion are the by-subject intercept and the by-item intercept and coefficient for P.

maxres<-KB07_results$summary[c(1:25,27:33,36:41,45:49,55:57,62:64,81,91:97,100:105),c(1,4,5)]

stanfixefmax<-maxres[1:8,]
stanvarcompmax<-maxres[c(9,10:25),]
stanvarcorrsmax<-maxres[c(26:50),]
stanvarcorrimax<-maxres[c(51:63),]

Final model

The final, reduced model has a single column (the intercept) in Zs and two columns (intercept and main-effect for P) in Zi.

Notice that the model specification is the same as the one used above for the maximal model. The only thing that has changed is the specification in the data of the different variables. In particular, notice that J and L have changed to reflect the number of subject and item random effects, respectively, and the Zs and Zi matrices are reduced versions of the original matrices used in the maximal model above.

finaldat <- 
  within(list(), {
    N <- nrow(X)
    K <- ncol(X)
    J <- 1L
    I <- 2L
    M <- length(levels(kb07$subj))
    L <- length(levels(kb07$item))
    X <- X
    Zs <- X[, 1, drop=FALSE]
    Zi <- X[, 1:2]
    y <- kb07$RTtrunc
    subj <- as.integer(kb07$subj)
    item <- as.integer(kb07$item)
    }
    )
str(finaldat)

system.time(KB07_finalstan <-
  sflist2stanfit(
    mclapply(1:4, mc.cores = 4,    # adjust number of cores to suit 
      function(i) stan(fit = maxmodel, 
                       data = finaldat,
                       iter=2000,
                       chains = 1, 
                       chain_id = i, 
                       refresh = -1))
    )
  )