In debruine/data-sim-workshops: Data Simulation Workshops

knitr::opts_chunk$set(
  collapse = TRUE,
  out.width = "100%",
  fig.width = 5,
  fig.height = 3,
  dpi = 144
)
set.seed(8675309) # Jenny, I've got your number

library(tidyverse)
library(faux)
library(broom)
library(afex)

In this tutorial, we'll learn how to simulate data for factorial designs using {faux}. There are more extensive examples at https://debruine.github.io/faux/.

Multivariate normal

You can create sets of correlated normally distributed values using rnorm_multi().

dat3 <- rnorm_multi(
  n = 50,
  vars = 3,
  mu = c(1, 2, 3),
  sd = c(0.5, 1, 1.5),
  r = c(0, .25, .5),
  varnames = c("A", "B", "C")
)

The function get_params() gives you a quick way to see the means, SDs and correlations in the simulated data set to make sure you set the parameters correctly.

get_params(dat3)

If you set empirical to TRUE, the values you set will be the sample parameters, not the population parameters. This isn't usually what you want for a simulation, but can be useful to check you set the simulation parameters correctly.

dat3 <- rnorm_multi(
  n = 50,
  vars = 3,
  mu = c(1, 2, 3),
  sd = c(0.5, 1, 1.5),
  r = c(0, .25, .5),
  varnames = c("A", "B", "C"),
  empirical = TRUE
)

get_params(dat3)

Shortcuts

There are a few shortcuts you can use. Run the following and see if you can guess how they work.

guess1 <- rnorm_multi(50, mu = c(x = 1, y = 2, z = 3), empirical = TRUE)

get_params(guess1)

guess2 <- rnorm_multi(50, vars = 4, r = 0.5, empirical = TRUE)

get_params(guess2)

iris_r <- cor(iris[, 1:4])
iris_mu <- summarise_all(iris[, 1:4], mean) %>% t()
iris_sd <- summarise_all(iris[, 1:4], sd) %>% t()

guess3 <- rnorm_multi(50, 
                      mu = iris_mu, 
                      sd = iris_sd, 
                      r = iris_r)

get_params(guess3)

You can set the r for correlations is a few different ways.

# all correlations the same value
rho_same <- rnorm_multi(50, 4, r = .5, empirical = TRUE)
get_params(rho_same)

# upper right triangle
rho_urt <- rnorm_multi(50, 4, 
                      #      X2   X3   X4
                       r = c(0.5, 0.4, 0.3, # X1
                                  0.2, 0.1, # X2
                                       0.0), # X3
                       empirical = TRUE)
get_params(rho_urt)

# full correlation matrix
rho_cormat <- rnorm_multi(50, 4, 
                        #      X1   X2   X3   X4
                          r = c(1.0, 0.5, 0.4, 0.3,  # X1
                                0.5, 1.0, 0.2, 0.1,  # X2
                                0.4, 0.2, 1.0, 0.0,  # X3
                                0.3, 0.1, 0.0, 1.0), # X4
                       empirical = TRUE)
get_params(rho_cormat)

rnorm_multi(10, 3, r = c(.9, .9, -.9))

Factorial Designs

You can just use rnorm_multi() to simulate data for each between-subjects cell of a factorial design and manually combine the tables, but faux has a function that better maps onto how we usually think and teach about factorial designs.

The default design is 100 observations of one variable (named y) with a mean of 0 and SD of 1. Unless you set plot = FALSE or run faux_options(plot = FALSE), this function will show you a plot of your design so you can check that it looks like you expect.

simdat1 <- sim_design()

Factors

Use lists to set the names and levels of within- and between-subject factors.

pettime <- sim_design(
  within = list(time = c("pre", "post"),
                condition = c("A", "B")),
  between = list(pet = c("cat", "dog", "ferret"))
)

You can set mu and sd with unnamed vectors, but getting the order right can be tricky.

pettime <- sim_design(
  within = list(time = c("pre", "post")),
  between = list(pet = c("cat", "dog", "ferret")),
  mu = 1:6
)

You can set values with a named vector for a single type of factor. The values do not have to be in the right order if they're named.

pettime <- sim_design(
  within = list(time = c("pre", "post")),
  between = list(pet = c("cat", "dog", "ferret")),
  mu = c(cat = 1, ferret = 5, dog = 3),
  sd = c(pre = 1, post = 2)
)

Or use a data frame for within- and between-subject factors.

mu <- data.frame(
    pre_A = c(1, 3, 5),
    post_A = c(2, 4, 6),
    pre_B = c(10, 30, 50),
    post_B = c(20, 40, 60),
    row.names = c("cat", "dog", "ferret")
  )

pettime <- sim_design(
  within = list(time = c("pre", "post"),
                condition = c("A","B")),
  between = list(pet = c("cat", "dog", "ferret")),
  mu = mu
)

If you have within-subject factors, set the correlations for each between-subject cell like this. You need to tell get_params() if you have any between-subject columns.

pettime <- sim_design(
  within = list(time = c("pre", "post")),
  between = list(pet = c("cat", "dog", "ferret")),
  r = list(cat = 0.5,
           dog = 0.25,
           ferret = 0),
  empirical = TRUE,
  plot = FALSE
)

get_params(pettime, between = "pet")

You can also change the name of the dv and id columns and output the data in long format. If you do this, you also need to tell get_params() what columns contain the between- and within-subject factors, the dv, and the id.

dat_long <- sim_design(
  within = list(time = c("pre", "post")),
  between = list(pet = c("cat", "dog", "ferret")),
  id = "subj_id",
  dv = "score",
  long = TRUE,
  plot = FALSE
)

get_params(dat_long, 
           between = "pet", 
           within = "time",
           id = "subj_id",
           dv = "score",
           digits = 3)

Anonymous Factors

If you need to make a quick demo, you can set factors anonymously with integer vectors.

dat_anon <- sim_design(
  n = 50,
  between = list(pet = c(dog = "Doggies", cat = "Kittens")),
  dv = c(score = "Happiness Score")
)

x <- attr(dat_anon, "design")

Faux has a quick plotting function for visualising data made with sim_design.

plot(dat_anon)

You can change the order of plotting and the types of geoms plotted. This takes a little trial and error, so this function will probably be refined in later versions.

plot(dat_anon, "B", "A", "C", geoms = c("jitter"))

Replications

You often want to simulate data repeatedly to do things like calculate power. The sim_design() function has a lot of overhead for checking if a design makes sense and if the correlation matrix is possible, so you can speed up the creation of multiple datasets with the same design using the rep argument. This will give you a nested data from with each dataset in the data column.

dat_rep <- sim_design(
  within = 2,
  n = 20,
  mu = c(0, 0.25),
  rep = 10,
  plot = FALSE
)

You can run analyses on the nested data like this:

map_df(dat_rep$data, ~{
 t.test(.x$A1, .x$A2, paired = TRUE) %>% broom::tidy()
})

Exercises

Multivariate normal

Sample 40 values of three variables named J, K and L from a population with means of 10, 20 and 30, and SDs of 5. J and K are correlated 0.5, J and L are correlated 0.25, and K and L are not correlated.

ex1 <- rnorm_multi(n = 40, mu = c(J = 10, K = 20, L = 30),
                   sd = 5, r = c(0.5, 0.25, 0))

get_params(ex1)

From existing data

Using the data from the built-in dataset attitude, simulate a new set of 20 observations drawn from a population with the same means, SDs and correlations for each column as the original data.

dat_r <- cor(attitude)
dat_mu <- summarise_all(attitude, mean) %>% t()
dat_sd <- summarise_all(attitude, sd) %>% t()

ex2 <- rnorm_multi(20, mu = dat_mu, sd = dat_sd,r = dat_r)

get_params(ex2)

2b

Create a dataset with a between-subject factor of "pet" having two levels, "cat", and "dog". The DV is "happiness" score. There are 20 cat-owners with a mean happiness score of 10 (SD = 3) and there are 30 dog-owners with a mean happiness score of 11 (SD = 3).

dat2b <- sim_design(
  between = list(pet = c("cat", "dog")),
  dv = "happiness",
  n = list(cat = 20, dog = 30),
  mu = list(cat = 10, dog = 11),
  sd = 3
)

get_params(dat2b, between = "pet")

3w

Create a dataset of 20 observations with 1 within-subject variable ("condition") having 3 levels ("A", "B", "C") with means of 10, 20 and 30 and SD of 5. The correlations between each level have r = 0.4. The dataset should look like this:

| id | condition | score | |:---|:----------|------:| |S01 | A | 9.17 | |... | ... | ... | |S20 | A | 11.57 | |S01 | B | 18.44 | |... | ... | ... | |S20 | B | 20.04 | |S01 | C | 35.11 | |... | ... | ... | |S20 | C | 29.16 |

dat3w <- sim_design(
  within = list(condition = c("A", "B", "C")),
  n = 20,
  mu = c(10, 20, 30),
  sd = 5,
  r = .4,
  dv = "score",
  long = TRUE
)

get_params(dat3w)

2w*2w

Create a dataset with 50 observations of 2 within-subject variables ("A" and "B") each having 2 levels. The mean for all cells is 10 and the SD is 2. The dataset should have 20 subjects. The correlations look like this:

| | A1_B1 | A1_B2 | A2_B1 | A2_B2 | |:------|------:|------:|------:|------:| | A1_B1 | 1.0 | 0.5 | 0.5 | 0.2 | | A1_B2 | 0.5 | 1.0 | 0.2 | 0.5 | | A2_B1 | 0.5 | 0.2 | 1.0 | 0.5 | | A2_B2 | 0.2 | 0.5 | 0.5 | 1.0 |

dat2w2w <- sim_design(
  within = c(2,2),
  n = 50, 
  mu = 10,
  sd = 2,
  r = c(.5, .5, .2, 
            .2, .5, 
                .5)
)

get_params(dat2w2w)

2w*3b

Create a dataset with a between-subject factor of "pet" having 3 levels ("cat", "dog", and "ferret") and a within-subject factor of "time" having 2 levels ("pre" and "post"). The N in each group should be 10. Means are:

cats: pre = 10, post = 12
dogs: pre = 14, post = 16
ferrets: pre = 18, post = 20

SDs are all 5 and within-cell correlations are all 0.25.

mu <- data.frame(
  cat = c(10, 12),
  dog = c(14, 16),
  ferret = c(18, 20)
)

dat2w3b <- sim_design(
  within = list(time = c("pre", "post")),
  between = list(pet = c("cat", "dog", "ferret")),
  n = 10,
  mu = mu,
  sd = 5,
  r = 0.25
)

get_params(dat2w3b)

Replications

Create 5 datasets with a 2b*2b design, 30 participants in each cell. Each cell's mean should be 0, except A1_B1, which should be 0.5. The SD should be 1. Make the resulting data in long format.

dat2b2b <- sim_design(
  between = c(2,2),
  n = 30,
  mu = c(0.5, 0, 0, 0),
  rep = 5,
  long = TRUE
)

Power

Simulate 100 datasets like the one above and use lm() or afex::aov_ez() to look at the interaction between A and B. What is the power of this design?

dat2b2b_100 <- sim_design(
  between = c(2, 2),
  n = 30,
  mu = c(0.5, 0, 0, 0),
  rep = 100,
  long = TRUE
)

ana_lm <- map_df(dat2b2b_100$data, ~{
  lm(y ~ A*B, data = .x) %>% broom::tidy()
})

afex::set_sum_contrasts() # avoids annoying afex message
ana_aov <- map_df(dat2b2b_100$data, ~{
 afex::aov_ez(id = "id",
              dv = "y",
              between = c("A", "B"),
              data = .x,
              return = "aov") %>% broom::tidy()
})

ana_aov %>%
  group_by(term) %>%
  summarise(power = mean(p.value < .05),
            .groups = "drop")

debruine/data-sim-workshops documentation built on April 27, 2024, 8:13 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

debruine/data-sim-workshops
Data Simulation Workshops

In debruine/data-sim-workshops: Data Simulation Workshops

Multivariate normal

Shortcuts

Factorial Designs

Factors

Anonymous Factors

Replications

Exercises

Multivariate normal

From existing data

2b

3w

2w*2w

2w*3b

Replications

Power

R Package Documentation

Browse R Packages

We want your feedback!

debruine/data-sim-workshops Data Simulation Workshops

In debruine/data-sim-workshops: Data Simulation Workshops

Multivariate normal

Shortcuts

Factorial Designs

Factors

Anonymous Factors

Replications

Exercises

Multivariate normal

From existing data

2b

3w

2w*2w

2w*3b

Replications

Power

R Package Documentation

Browse R Packages

We want your feedback!

debruine/data-sim-workshops
Data Simulation Workshops