In debruine/data-sim-workshops: Data Simulation Workshops

knitr::opts_chunk$set(
  collapse = TRUE,
  out.width = "100%",
  fig.width = 5,
  fig.height = 3,
  dpi = 144
)
set.seed(8675309) # Jenny, I've got your number

This tutorial will work through how to simulate multilevel data from a cross-classified design, and use this simulated data to calculate power.

Setup

We'll use four packages, although the tidyverse is a collection of packages (we'll be using functions from dplyr, ggplot2 and purrr) and afex is just being used as a wrapper for lme4 that gives you p-values.

library(tidyverse)   # for data wrangling, pipes, and good dataviz
library(afex)        # for mixed effect models with p-values (uses lme4)
library(broom.mixed) # for getting tidy data tables from mixed models
library(faux)        # for simulating correlated variables

options(digits = 4, scipen = 10)

Simulate Data

We'll generate data for a Stroop task where people (sub) say the colour of colour words (stim) shown in each of two versions (congruent and incongruent). Subjects are in one of two conditions (easy and hard). The dependent variable is reaction time (rt).

We expect people to have faster reaction times for congruent stimuli than incongruent stimuli (main effect of version) and to be faster in the easy condition than the hard condition (main effect of condition). We'll look at some different interaction patterns below.

Random Samples

First, we need to set up the number and pattern of the random samples. You may choose the number of subjects or stimuli based on convention or availability. If your power analyses are being used to determine the required sample size for your desired power, start with numbers that seem reasonable to you and we can vary them later. Set these as variables so they are easy to change in only one place in the code.

# numbers
sub_n  <- 200 # number of subjects in this simulation
stim_n  <- 50 # number of stimuli in this simulation

Set up the random sample using the add_random() function.

You can simulate both nested and crossed design, such as students nested in grades and grades crossed with districts. You can name the random levels (e.g., district) or just give an integer (e.g., grade) and they will be named automatically. Nested groups can be given a vector of integers (or names) to have a different number in each "nest", such as the students, with 1 in A1, 2 in A2, 3 in B1 and 4 in B2.

add_random(district = c("A", "B")) %>%
  add_random(grade = 2) %>%
  add_random(student = c(1, 2, 3, 4), 
             .nested_in = c("district", "grade"))

Our design is cross-classified, with each subject seeing each stimulus.

dat <- add_random(sub = sub_n) %>%
  add_random(stim = stim_n)

Fixed Effects

What are your factors, and are they between or within each of your random groups? Condition is a between-subjects factor, as each subject in the study is only in one condition. It is within-stimuli, though, as the same stimuli are used in the easy and hard conditions in this design (although we could have made a different design decision). Version is a within-subjects and within-stimulus factor; all subject see both congruent and incongruent stimuli, and all stimuli are shown in both congruent and incongruent forms.

# between subject, within stimuli
cond_levels <- c("easy", "hard")

# within-subject, within stimuli
vers_levels <- c("con", "incon")

Now add in the fixed effects. There are separate functions for between and within factors. Between factors need to specify which random group they are between.

If we had chosen a different design where there were different words for the easy and hard conditions, so that condition was both a between-subject and between-stimulus factor, you add a between factor for both subjects and stimuli, then filter the result to keep only the observations (rows) where they match. You can use this method to set up counterbalanced versions, too.

add_random(sub = 4) %>%
  add_random(stim = 4) %>%
  add_between(.by = "sub", cond_sub = cond_levels) %>%
  add_between(.by = "stim", cond_stim = cond_levels) %>%
  filter(cond_sub == cond_stim)

Our design, however, has a between-subject factor of condition and a within-subject/stimulus factor of version.

dat <- add_random(sub = sub_n) %>%
  add_random(stim = stim_n) %>%
  add_between(.by = "sub", cond = cond_levels) %>%
  add_within(vers = vers_levels)

head(dat)

Contrast Coding

Now that we have columns for the fixed effects, we need to add the contrast codes. See the faux contrast coding vignette for further explanation of different coding schemes.

Your coding scheme is absolutely crucial to simulating the data correctly and interpreting the results correctly. You need to use the same coding for pilot analyses that generate parameters as for the data simulation.

For 2-level variables, the most common options are treatment coding (0 and 1), sum coding (-1 and +1), or ANOVA-coding (-0.5 and +0.5). If your hypothesised effects are meant to be positive, set the predicted lower level to the smaller code and the predicted higher level to the larger code.

People most commonly treatment-code factors with 3 or more levels, which generates levels-1 coded factors representing each non-baseline level. My advice it to avoid factors with more than 2 levels as much as you can.

The add_contrast() function's default coding scheme is ANOVA coding, which lets you interpret the main effects and interactions similarity to an ANOVA. The new coded version will be saved as a new column (or columns in the case of factors with 3+ levels) with names that reflect how you should interpret or calculate them, but you can change this with the colnames argument. Here, we'll set the coded versions to cond.a and vers.a

dat <- add_random(sub = sub_n) %>%
  add_random(stim = stim_n) %>%
  add_between(.by = "sub", cond = cond_levels) %>%
  add_within(vers = vers_levels) %>%
  add_contrast(col = "cond", 
               contrast = "anova", 
               colnames = "cond.a") %>%
  add_contrast(col = "vers", 
               contrast = "anova", 
               colnames = "vers.a")

head(dat)

Random Effects

Random effects describe how the fixed effects vary among the sampled random groups. For example, each subject will have slightly faster or slower reaction times on average; this is their random intercept. We'll model it from a normal distribution with a mean of 0 and SD of 100ms, so we set sub_i_sd = 100

You can get these parameters from existing or pilot data. Depending on your study design, these can be fairly complex. For now, you can just use the values below. We'll learn how to extract values from pilot analyses later.

sub_i_sd   <- 100 # SD for the subjects' random intercept
stim_i_sd  <- 50  # SD for the stimuli's random intercept
error_sd <- 25  # residual (error) SD

Now we can simulate the random effects. The specific random effects will change for each new sample of subjects and stimuli, so we can only describe how they tend to vary, and to sample their values from a normal distribution with a mean of 0 and the specified SD.

You need a separate add_ranef() function for each random group , so here, one for subjects and one for stimuli. For our first model, we're just going to add random intercepts for subjects and stimuli (we'll get to random slopes later). You also need to add a separate add_ranef() for the residual error.

Every time you run this code, the subject and stimulus random intercepts will be re-sampled, so their values will differ between simulations (unless you set a seed), but will be sampled from the same population distribution.

dat <- add_random(sub = sub_n) %>%
  add_random(stim = stim_n) %>%
  add_between(.by = "sub", cond = cond_levels) %>%
  add_within(vers = vers_levels) %>%
  add_contrast(col = "cond", 
               contrast = "anova", 
               colnames = "cond.a") %>%
  add_contrast(col = "vers", 
               contrast = "anova", 
               colnames = "vers.a") %>%
  add_ranef(.by = "sub", sub_i = sub_i_sd) %>%
  add_ranef(.by = "stim", stim_i = stim_i_sd) %>%
  add_ranef(error = error_sd)

head(dat)

Effect Sizes

The fixed effect sizes can be set from pilot data, meta-analyses, or by setting a smallest effect size of interest.

# fixed effects
grand_i       <- 400 # overall mean DV
cond_eff      <- 50  # mean difference between conditions: hard - easy
vers_eff      <- 50  # mean difference between versions: incongruent - congruent
cond_vers_eff <-  0  # interaction between version and condition

Now we're ready to use all of these values to calculate the DV. Our DV is reaction time (rt) and is the sum of all the intercepts (grand_i + sub_i + stim_i), plus the coded condition (cond.a) multiplied by the condition effect (cond_eff), plus the coded version (vers.a) multiplied by the version effect (vers_eff), plus cond.a times vers.a times cond_vers_eff, plus the residual (trial-level) error.

Use the mutate() function to create this new column.

dat <- add_random(sub = sub_n) %>%
  add_random(stim = stim_n) %>%
  add_between(.by = "sub", cond = cond_levels) %>%
  add_within(vers = vers_levels) %>%
  add_contrast(col = "cond", 
               contrast = "anova", 
               colnames= "cond.a") %>%
  add_contrast(col = "vers", 
               contrast = "anova", 
               colnames= "vers.a") %>%
  add_ranef(.by = "sub", sub_i = sub_i_sd) %>%
  add_ranef(.by = "stim", stim_i = stim_i_sd) %>%
  add_ranef(error = error_sd) %>%
  mutate(rt = 
    grand_i + sub_i + stim_i + # intercepts
    (cond_eff) * cond.a + # condition effect
    (vers_eff) * vers.a + # version effect
    (cond_vers_eff) * cond.a * vers.a + # condition * version interaction
    error # residual (trial-level) error
  )

head(dat)

Plot to check

Plot to check that all of the effects are simulated in the right direction. The pattern we intended to simulate was that the RTs for the congruent versions are lower than the incongruent versions and the RTs for the easy condition are lower than the RTs for the hard condition.

ggplot(dat, aes(cond, rt, color = vers)) +
  geom_hline(yintercept = grand_i) +
  geom_violin(alpha = 0.5) +
  stat_summary(position = position_dodge(width = 0.9))

Data Function

You can put the code above in a function so you can run it more easily and change the parameters. You're unlikely to change cond_levels or vers_levels, or change the contrast coding, so these don't have to be arguments in the function, but can be hard-coded in the function. However, put anything you might want to vary as an argument.

sim_data <- function(sub_n         = 200,
                     stim_n        =  50,
                     sub_i_sd      = 100,
                     stim_i_sd     =  50,
                     error_sd        =  25,
                     grand_i       = 400,
                     cond_eff      =  50,
                     vers_eff      =  50,
                     cond_vers_eff =   0) {
  cond_levels <- c("easy", "hard")
  vers_levels <- c("con", "incon")

  dat <- add_random(sub = sub_n) %>%
    add_random(stim = stim_n) %>%
    add_between(.by = "sub", cond = cond_levels) %>%
    add_within(vers = vers_levels) %>%
    add_contrast(col = "cond", 
                 contrast = "anova", 
                 colnames= "cond.a") %>%
    add_contrast(col = "vers", 
                 contrast = "anova", 
                 colnames= "vers.a") %>%
    add_ranef(.by = "sub", sub_i = sub_i_sd) %>%
    add_ranef(.by = "stim", stim_i = stim_i_sd) %>%
    add_ranef(error = error_sd) %>%
    mutate(rt = 
      grand_i + sub_i + stim_i + # intercepts
      (cond_eff) * cond.a + # condition effect
      (vers_eff) * vers.a + # version effect
      (cond_vers_eff) * cond.a * vers.a + # condition * version interaction
      error # residual (trial-level) error
    )

  return(dat)
}

Now you can easily simulate data with a different pattern of fixed effects, such as no effects.

null_dat <- sim_data(cond_eff = 0, 
        vers_eff = 0, 
        cond_vers_eff = 0)

ggplot(null_dat, aes(cond, rt, color = vers)) +
  geom_hline(yintercept = grand_i) +
  geom_violin(alpha = 0.5) +
  stat_summary(position = position_dodge(width = 0.9))

Interactions

If you want to simulate an interaction, it can be tricky to figure out what to set the main effects and interaction effect to. It can be easier to think about the simple main effects for each cell. Create four new variables and set them to the deviations from the overall mean you'd expect for each condition (so they should add up to 0). Here, we're simulating a small effect of version in the hard condition (50ms difference) and double that effect of version in the easy condition (100ms difference).

# set variables to use in calculations below
hard_congr <- -25
hard_incon <- +25
easy_congr <- -50
easy_incon <- +50

Use the code below to transform the simple main effects above into main effects and interactions for use in the equations below.

# mean difference between easy and hard conditions
new_cond_eff      <- (hard_congr + hard_incon)/2 -
                     (easy_congr + easy_incon)/2

# mean difference between incongruent and congruent versions
new_vers_eff      <- (hard_incon + easy_incon)/2 - 
                     (hard_congr + easy_congr)/2

# interaction between version and condition
new_cond_vers_eff <- (hard_incon - hard_congr) -
                     (easy_incon - easy_congr)

Then generate the DV the same way we did above.

dat_ixn <- sim_data(cond_eff = new_cond_eff, 
                    vers_eff = new_vers_eff,
                    cond_vers_eff = new_cond_vers_eff)

ggplot(dat_ixn, aes(cond, rt, color = vers)) +
  geom_hline(yintercept = grand_i) +
  geom_violin(alpha = 0.5) +
  stat_summary(position = position_dodge(width = 0.9))

group_by(dat_ixn, cond, vers) %>%
  summarise(m = mean(rt) - grand_i %>% round(1),
            .groups = "drop") %>%
  spread(vers, m)

Analysis

New we will run a linear mixed effects model with lmer and look at the summary. Simulate a new data set with the default parameters (50ms main effects and no interaction).

dat <- sim_data()
mod <- lmer(rt ~ cond.a * vers.a +
              (1 | sub) + 
              (1 | stim),
            data = dat)

mod.sum <- summary(mod)

mod.sum

Sense checks

First, check that your groups make sense.

The number of obs should be the total number of trials analysed.
sub should be what we set sub_n to above.
stim should be what we set stim_n to above.

mod.sum$ngrps

Next, look at the random effects.

The SD for sub should be near sub_i_sd.
The SD for stim should be near stim_i_sd.
The residual SD should be near error_sd.

mod.sum$varcor

Finally, look at the fixed effects.

The estimate for the Intercept should be near the grand_i.
The main effect of cond.a should be near what we calculated for cond_eff.
The main effect of vers.a should be near what we calculated for vers_eff.
The interaction cond.a:vers.a should be near what we calculated for cond_vers_eff.

mod.sum$coefficients

Random effects

Plot the subject intercepts from our code above (dat$sub_i) against the subject intercepts calculated by lmer (ranef(mod)$sub).

model_i <- ranef(mod)$sub %>%
  as_tibble(rownames = "sub") %>%
  rename(mod_sub_i = `(Intercept)`)

data_i <- dat %>%
  select(sub, sub_i) %>%
  distinct()

left_join(model_i, data_i, by = "sub") %>%
  ggplot(aes(sub_i,mod_sub_i)) +
  geom_point() +
  geom_smooth(method = lm, formula = y~x) +
  xlab("Simulated random intercepts (sub_i)") +
  ylab("Modeled random intercepts")

Plot the stimulus intercepts from our code above (stim$stim_i) against the stimulus intercepts calculcated by lmer (ranef(mod)$stim_id).

model_i <- ranef(mod)$stim %>%
  as_tibble(rownames = "stim") %>%
  rename(mod_stim_i = `(Intercept)`)

data_i <- dat %>%
  select(stim, stim_i) %>%
  distinct()

left_join(model_i, data_i, by = "stim") %>%
  ggplot(aes(stim_i,mod_stim_i)) +
  geom_point() +
  geom_smooth(method = lm, formula = y~x) +
  xlab("Simulated random intercepts (stim_i)") +
  ylab("Modeled random intercepts")

Analysis Function

You can put the code above in a function so you can run it more easily and change the parameters. The argument ... lets you set arguments with any name that can be passed onto another function (here, sim_data(). Return a tidied version of the results.

sim_lmer <- function(...) {
  dat <- sim_data(...)

  mod <- lmer(rt ~ cond.a * vers.a +
                (1 | sub) + 
                (1 | stim),
              data = dat)

  broom::tidy(mod)
}

Run the function with the default values.

sim_lmer()

Try changing some variables to simulate different patterns of fixed effects.

sim_lmer(cond_eff = 0,
         vers_eff = 75, 
         cond_vers_eff = 50)

Or simulate different patterns of random effects.

sim_lmer(sub_i_sd = 50,
         stim_i_sd = 10,
         error_sd = 250)

Power analysis

First, update the analysis function to add the argument rep (the replication number) and to return a data table of just the fixed effects.

sim_lmer <- function(rep, ...) {
  dat <- sim_data(...)

  mod <- lmer(rt ~ cond.a * vers.a +
                (1 | sub) + 
                (1 | stim),
              data = dat)

  broom::tidy(mod, "fixed") %>%
    mutate(rep = rep)
}

You can use purrr's map_df() function to create a data table of results from multiple replications of this function. We're only running 10 replications here in the interests of time, but you'll want to run 100 or more for a proper power calculation.

my_sims <- map_df(1:10, sim_lmer)

head(my_sims)

You can then plot the distribution of estimates across your simulations.

ggplot(my_sims, aes(estimate, color = term)) +
  geom_density() +
  facet_wrap(~term, scales = "free")

You can also just calculate power as the proportion of p-values less than your alpha.

my_sims %>%
  group_by(term) %>%
  summarise(power = mean(p.value < 0.05),
            .groups = "drop")

Random slopes

In the example so far we've ignored random variation among subjects or stimuli in the size of the fixed effects (i.e., random slopes).

# subject ranefs
sub_sd <- 100 # SD for the subjects' random intercept
sub_vers_sd <- 20 # SD for the subjects' random slope for version
sub_i_vers_cor <- -0.2 # correlation between intercept and slope

# stimulus ranefs
stim_sd <- 50 # SD for the stimuli's random intercept
stim_vers_sd <- 10 # SD for the stimuli's random slope for stim_version
stim_cond_sd <- 30 # SD for the stimuli's random slope for sub_cond
stim_cond_vers_sd <- 15 # SD for the stimuli's random slope for sub_cond:stim_version

# specify correlations for rnorm_multi (one of several methods)
stim_i_cor <- -0.4 # correlations between intercept and slopes
stim_s_cor <- +0.2 # correlations among slopes
stim_cors <- c(stim_i_cor, stim_i_cor, stim_i_cor,
                           stim_s_cor, stim_s_cor,
                                       stim_s_cor)

# residual ranef
error_sd         <- 25  # residual (error) SD

Subjects

In addition to generating a random intercept for each subject, now we will also generate a random slope for any within-subject factors. The only within-subject factor in this design is stim_version. The main effect of stim_version is set to 50 above, but different subjects will show variation in the size of this effect. That's what the random slope captures. We'll set sub_version_sd below to the SD of this variation and use this to calculate the random slope (sub_version_slope) for each subject.

Also, it's likely that the variation between subjects in the size of the effect of version is related in some way to between-subject variation in the intercept. So we want the random intercept and slope to be correlated. Here, we'll simulate a case where subjects who have slower (larger) reaction times across the board show a smaller effect of condition, so we set sub_i_version_cor below to a negative number (-0.2).

The code below creates two variables (sub_i, sub_version_slope) that are correlated with r = -0.2, means of 0, and SDs equal to what we set sub_sd above and sub_version_sd below.

sub_version_sd <- 20
sub_i_version_cor <- -0.2

sub <- faux::rnorm_multi(
  n = sub_n, 
  r = sub_i_version_cor,
  mu = 0, # means of random intercepts and slopes are always 0
  sd = c(sub_sd, sub_version_sd),
  varnames = c("sub_i", "sub_version_slope")
) %>%
  mutate(
    sub_id = 1:sub_n,
    sub_cond = rep(c("easy","hard"), each = sub_n/2) # between-subjects factor
  )

Plot to double-check it looks sensible.

ggplot(sub, aes(sub_i, sub_version_slope, color = sub_cond)) +
  geom_point() +
  geom_smooth(method = lm, formula = y~x)

Stimuli

In addition to generating a random intercept for each stimulus, we will also generate a random slope for any within-stimulus factors. Both stim_version and sub_condition are within-stimulus factors (i.e., all stimuli are seen in both congruent and incongruent versions and both easy and hard conditions). So the main effects of version and condition (and their interaction) will vary depending on the stimulus.

They will also be correlated, but in a more complex way than above. You need to set the correlations for all pairs of slopes and intercept. Let's set the correlation between the random intercept and each of the slopes to -0.4 and the slopes all correlate with each other +0.2 (You could set each of the six correlations separately if you want, though).

stim_version_sd <- 10 # SD for the stimuli's random slope for stim_version
stim_cond_sd <- 30 # SD for the stimuli's random slope for sub_cond
stim_cond_version_sd <- 15 # SD for the stimuli's random slope for sub_cond:stim_version
stim_i_cor <- -0.4 # correlations between intercept and slopes
stim_s_cor <- +0.2 # correlations among slopes

# specify correlations for rnorm_multi (one of several methods)
stim_cors <- c(stim_i_cor, stim_i_cor, stim_i_cor,
                           stim_s_cor, stim_s_cor,
                                       stim_s_cor)
stim <- rnorm_multi(
  n = stim_n, 
  r = stim_cors, 
  mu = 0, # means of random intercepts and slopes are always 0
  sd = c(stim_sd, stim_version_sd, stim_cond_sd, stim_cond_version_sd),
  varnames = c("stim_i", "stim_version_slope", "stim_cond_slope", "stim_cond_version_slope")
) %>%
  mutate(
    stim_id = 1:stim_n
  )

Check your stimulated data using faux's check_sim_stats() function. Here, we're simulating different SDs for different effects, so our summary table should reflect this.

select(stim, -stim_id) %>% # remove id
  check_sim_stats() %>%    # calculates means, SDs and correlations
  rename(i = 3, v = 4, c = 5, cv = 6) # rename columns to fit width

Trials

Now we put the subjects and stimuli together in the same way as before.

trials <- crossing(
  sub_id = sub$sub_id, # get subject IDs from the sub data table
  stim_id = stim$stim_id, # get stimulus IDs from the stim data table
  stim_version = c("congruent", "incongruent") # all subjects see both congruent and incongruent versions of all stimuli
) %>%
  left_join(sub, by = "sub_id") %>% # includes the intercept, slope, and conditin for each subject
  left_join(stim, by = "stim_id")   # includes the intercept and slopes for each stimulus

Calculate DV

Now we can calculate the DV by adding together an overall intercept (mean RT for all trials), the subject-specific intercept, the stimulus-specific intercept, the effect of subject condition, the stimulus-specific slope for condition, the effect of stimulus version, the stimulus-specific slope for version, the subject-specific slope for condition, the interaction between condition and version (set to 0 for this example), the stimulus-specific slope for the interaction between condition and version, and an error term.

conditions <- c("easy" = -0.5, "hard" = +0.5)
versions <- c("congruent" = -0.5, "incongruent" = +0.5)

dat <- trials %>%
  mutate(
    # effect-code subject condition and stimulus version
    sub_cond.e = recode(sub_cond, !!!conditions),
    stim_version.e = recode(stim_version, !!!versions),
    # calculate trial-specific effects by adding overall effects and slopes
    cond_eff = sub_cond_eff + stim_cond_slope,
    version_eff = stim_version_eff + stim_version_slope + sub_version_slope,
    cond_version_eff = cond_version_ixn + stim_cond_version_slope,
    # calculate error term (normally distributed residual with SD set above)
    err = rnorm(nrow(.), 0, error_sd),
    # calculate DV from intercepts, effects, and error
    dv = grand_i + sub_i + stim_i + err +
         (sub_cond.e * cond_eff) + 
         (stim_version.e * version_eff) + 
         (sub_cond.e * stim_version.e * cond_version_eff)
  )

As always, graph to make sure you've simulated the general pattern you expected.

ggplot(dat, aes(sub_cond, dv, color = stim_version)) +
  geom_hline(yintercept = grand_i) +
  geom_violin(alpha = 0.5) +
  geom_boxplot(width = 0.2, position = position_dodge(width = 0.9))

Analysis

New we'll run a linear mixed effects model with lmer and look at the summary. You specify random slopes by adding the within-level effects to the random intercept specifications. Since the only within-subject factor is version, the random effects specification for subjects is (1 + stim_version.e | sub_id). Since both condition and version are within-stimuli factors, the random effects specification for stimuli is (1 + stim_version.e*sub_cond.e | stim_id).

This model will take a lot longer to run than one without random slopes specified. This might be a good time for a coffee break.

mod <- lmer(rt ~ sub_cond.e * stim_version.e +
              (1 + stim_version.e || sub_id) + 
              (1 + stim_version.e*sub_cond.e || stim_id),
            data = dat)

mod.sum <- summary(mod)

mod.sum

Sense checks

First, check that your groups make sense.

sub_id = sub_n (r sub_n)
stim_id = stim_n (r stim_n)

mod.sum$ngrps

Next, look at the SDs for the random effects.

Group:sub_id
(Intercept) ~= sub_sd
stim_version.e ~= sub_version_sd
Group: stim_id
(Intercept) ~= stim_sd
stim_version.e ~= stim_version_sd
sub_cond.e ~= stim_cond_sd
stim_version.e:sub_cond.e ~= stim_cond_version_sd
Residual ~= error_sd

The correlations are a bit more difficult to parse. The first column under Corr shows the correlation between the random slope for that row and the random intercept. So for stim_version.e under sub_id, the correlation should be close to sub_i_version_cor. For all three random slopes under stim_id, the correlation with the random intercept should be near stim_i_cor and their correlations with each other should be near stim_s_cor.

mod.sum$varcor

Finally, look at the fixed effects.

(Intercept) ~= grand_i
sub_cond.e ~= sub_cond_eff
stim_version.e ~= stim_version_eff
sub_cond.e:stim_version.e ~= cond_version_ixn

mod.sum$coefficients

Random effects

Compare the subject intercepts and slopes from our code above (sub$sub_i) against the subject intercepts and slopes calculcated by lmer (ranef(mod)$sub_id).

ranef(mod)$sub_id %>%
  as_tibble(rownames = "sub_id") %>%
  rename(mod_i = `(Intercept)`,
         mod_version_slope = stim_version.e) %>%
  mutate(sub_id = as.integer(sub_id)) %>%
  left_join(sub, by = "sub_id") %>%
  select(mod_i, sub_i, 
         mod_version_slope,  sub_version_slope) %>%
  get_params() %>%    # calculates means, SDs and correlations
  rename(mod_v = 5, sub_v = 6) %>% # rename columns to fit width
  select(n, var, mod_i, mod_v, sub_i, sub_v, mean, sd)

Compare the stimulus intercepts and slopes from our code above (stim$stim_i) against the stimulus intercepts and slopes calculated by lmer (ranef(mod)$stim_id).

x <- ranef(mod)$stim_id %>%
  as_tibble(rownames = "stim_id") %>%
  rename(mod_i = `(Intercept)`,
         mod_version_slope = stim_version.e,
         mod_cond_slope = sub_cond.e,
         mod_cond_version_slope = `stim_version.e:sub_cond.e`) %>%
  mutate(stim_id = as.integer(stim_id)) %>%
  left_join(stim, by = "stim_id") %>%
  select(mod_i, stim_i, 
         mod_version_slope, stim_version_slope, 
         mod_cond_slope, stim_cond_slope, 
         mod_cond_version_slope, stim_cond_version_slope) %>%
  get_params()    # calculates means, SDs and correlations

nm <- names(x)[3:10]
x$var <- factor(x$var, levels = nm)
x %>%
  arrange(var) %>%
  rename(mod_v = 5, stim_v = 6, 
         mod_c = 7, stim_c = 8, 
         mod_cv = 9, stim_cv = 10) # rename columns to fit width

Function

You can put the code above in a function so you can run it more easily and change the parameters. I removed the plot and set the argument defaults to the same as the example above, but you can set them to other patterns.

sim_lmer_slope <- function( sub_n = 200,
                            sub_sd = 100,
                            sub_version_sd = 20, 
                            sub_i_version_cor = -0.2,
                            stim_n = 50,
                            stim_sd = 50,
                            stim_version_sd = 10,
                            stim_cond_sd = 30,
                            stim_cond_version_sd = 15,
                            stim_i_cor = -0.4,
                            stim_s_cor = +0.2,
                            grand_i = 400,
                            sub_cond_eff = 0,
                            stim_version_eff = 0, 
                            cond_version_ixn = 0,
                            error_sd = 25) {
  sub <- rnorm_multi(
    n = sub_n, 
    mu = 0, # means of random intercepts and slopes are always 0
    sd = c(sub_sd, sub_version_sd),
    r = sub_i_version_cor,
    varnames = c("sub_i", "sub_version_slope")
  ) %>%
    mutate(
      sub_id = 1:sub_n,
      sub_cond = rep(c("easy","hard"), each = sub_n/2) # between-subjects factor
    )

  stim_cors <- c(stim_i_cor, stim_i_cor, stim_i_cor,
                             stim_s_cor, stim_s_cor,
                                         stim_s_cor)
  stim <- rnorm_multi(
    n = stim_n,
    mu = 0, # means of random intercepts and slopes are always 0
    sd = c(stim_sd, stim_version_sd, stim_cond_sd, stim_cond_version_sd),
    r = stim_cors, 
    varnames = c("stim_i", "stim_version_slope", "stim_cond_slope", "stim_cond_version_slope")
  ) %>%
    mutate(
      stim_id = 1:stim_n
    )

  trials <- crossing(
    sub_id = sub$sub_id, # get subject IDs from the sub data table
    stim_id = stim$stim_id, # get stimulus IDs from the stim data table
    stim_version = c("congruent", "incongruent") # all subjects see both congruent and incongruent versions of all stimuli
  ) %>%
    left_join(sub, by = "sub_id") %>% # includes the intercept, slope, and conditin for each subject
    left_join(stim, by = "stim_id")   # includes the intercept and slopes for each stimulus

  conditions <- c("easy" = -0.5, "hard" = +0.5)
  versions   <- c("congruent" = -0.5, "incongruent" = +0.5)

  dat <- trials %>%
    mutate(
      # effect-code subject condition and stimulus version
      sub_cond.e = recode(sub_cond, !!!conditions),
      stim_version.e = recode(stim_version, !!!versions),
      # calculate trial-specific effects by adding overall effects and slopes
      cond_eff = sub_cond_eff + stim_cond_slope,
      version_eff = stim_version_eff + stim_version_slope + sub_version_slope,
      cond_version_eff = cond_version_ixn + stim_cond_version_slope,
      # calculate error term (normally distributed residual with SD set above)
      err = rnorm(nrow(.), 0, error_sd),
      # calculate DV from intercepts, effects, and error
      dv = grand_i + sub_i + stim_i + err +
           (sub_cond.e * cond_eff) + 
           (stim_version.e * version_eff) + 
           (sub_cond.e * stim_version.e * cond_version_eff)
    )

  mod <- lmer(dv ~ sub_cond.e * stim_version.e +
                (1 + stim_version.e || sub_id) + 
                (1 + stim_version.e*sub_cond.e || stim_id),
              data = dat)

  return(mod)
}

Run the function with the default values (null fixed effects).

sim_lmer_slope() %>% summary()

Try changing some variables to simulate fixed effects.

sim_lmer_slope(sub_cond_eff = 50,
               stim_version_eff = 50, 
               cond_version_ixn = 0)

Exercises

Calculate power for the parameters in the last example using the sim_lmer_slope() function.

sim_lmer_slope_pwr <- function(rep) {
  s <- sim_lmer_slope(sub_cond_eff = 50,
                      stim_version_eff = 50, 
                      cond_version_ixn = 0)

  # put just the fixed effects into a data table
  broom.mixed::tidy(s, "fixed") %>%
    mutate(rep = rep) # add a column for which rep
}

# run it only twice to test first in the interests of time
my_power_s <- map_df(1:2, sim_lmer_slope_pwr)

Simulate data for the following design:
100 raters rate 50 faces from group A and 50 faces from group B
The DV is a rating on a 0-100 scale with a mean value of 50
Rater intercepts have an SD of 5
Face intercepts have an SD of 10
The residual error has an SD of 8

grand_i <- 50
sub_n <- 50
stim_n <- 50 # in each group
sub_sd <- 5
stim_sd <- 10
err_sd <- 8
grp_effect <- 5

sub <- tibble(sub_id = 1:sub_n, 
              sub_i = rnorm(sub_n, 0, sub_sd))

stim <- tibble(stim_id = 1:(stim_n*2),
               stim_grp = rep(c("A", "B"), each = stim_n),
               stim_i = rnorm(stim_n*2, 0, stim_sd))

grps <- c("A" = -0.5, "B" = +0.5)

trials <- crossing(sub_id = sub$sub_id, 
                  stim_id = stim$stim_id) %>%
  left_join(sub, by = "sub_id") %>%
  left_join(stim, by = "stim_id") %>%
  mutate(grp.e = recode(stim_grp, !!!grps),
         error = rnorm(nrow(.), 0, err_sd),
         dv = grand_i + sub_i + stim_i + (grp.e * grp_effect) + error,
         # round and make sure all values are 0-100
         dv = norm2trunc(dv, 0, 100) %>% round())

ggplot(trials, aes(dv, color = stim_grp)) + geom_density()

For the design from exercise 2, write a function that simulates data and runs a mixed effects analysis on it.

# Lisa write this soon

The package faux has a built-in dataset called fr4. Type ?faux::fr4 into the console to view the help for this dataset. Run a mixed effects model on this dataset looking at the effect of face_sex on ratings. Rememnber to include a random slope for the effect of face sex.

sex_code <- c("male" = 0.5, "female" = -0.5)

fr_coded <- faux::fr4 %>%
  select(rater_id, face_id, face_sex, rating) %>%
  mutate(face_sex.e = recode(face_sex, !!!sex_code))

mod <- lmer(rating ~ face_sex.e + 
              (1 + face_sex.e | rater_id) + 
              (1 | face_id), data = fr_coded)

summary(mod)

Use the parameters from this analysis to simulate a new dataset with 50 male and 50 female faces, and 100 raters.

# get data into a table
stats <- broom.mixed::tidy(mod)

grand_i <- filter(stats, term == "(Intercept)") %>% pull(estimate)
sub_n <- 100
stim_n <- 50 # in each group
sub_sd <- filter(stats, group == "rater_id", term == "sd__(Intercept)") %>% pull(estimate)
sub_grp_sd <- filter(stats, group == "rater_id", term == "sd__face_sex.e") %>% pull(estimate)
stim_sd <- filter(stats, group == "face_id", term == "sd__(Intercept)") %>% pull(estimate)
err_sd <- filter(stats, group == "Residual") %>% pull(estimate)
grp_effect <- filter(stats, term == "face_sex.e") %>% pull(estimate)
sex_code <- c("male" = 0.5, "female" = -0.5)


rater <- tibble(rater_id = 1:sub_n, 
                rater_i = rnorm(sub_n, 0, sub_sd))

face <- tibble(face_id = 1:(stim_n*2),
               face_sex = rep(names(sex_code), each = stim_n),
               face_i = rnorm(stim_n*2, 0, stim_sd))


trials <- crossing(rater_id = rater$rater_id, 
                   face_id = face$face_id) %>%
  left_join(rater, by = "rater_id") %>%
  left_join(face, by = "face_id") %>%
  mutate(face_sex.e = recode(face_sex, !!!sex_code),
         error = rnorm(nrow(.), 0, err_sd),
         dv = grand_i + rater_i + face_i + (face_sex.e * grp_effect) + error,
         # round and make sure all values are 1-7
         dv = round(dv) %>% pmax(1) %>% pmin(7))

ggplot(trials, aes(dv, fill = face_sex)) + 
  geom_histogram(binwidth = 1, color = "black") +
  facet_wrap(~face_sex)

debruine/data-sim-workshops documentation built on April 27, 2024, 8:13 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

debruine/data-sim-workshops
Data Simulation Workshops

In debruine/data-sim-workshops: Data Simulation Workshops

Setup

Simulate Data

Random Samples

Fixed Effects

Contrast Coding

Random Effects

Effect Sizes

Plot to check

Data Function

Interactions

Analysis

Sense checks

Random effects

Analysis Function

Power analysis

Random slopes

Subjects

Stimuli

Trials

Calculate DV

Analysis

Sense checks

Random effects

Function

Exercises

R Package Documentation

Browse R Packages

We want your feedback!

debruine/data-sim-workshops Data Simulation Workshops

In debruine/data-sim-workshops: Data Simulation Workshops

Setup

Simulate Data

Random Samples

Fixed Effects

Contrast Coding

Random Effects

Effect Sizes

Plot to check

Data Function

Interactions

Analysis

Sense checks

Random effects

Analysis Function

Power analysis

Random slopes

Subjects

Stimuli

Trials

Calculate DV

Analysis

Sense checks

Random effects

Function

Exercises

R Package Documentation

Browse R Packages

We want your feedback!

debruine/data-sim-workshops
Data Simulation Workshops