knitr::opts_chunk$set( collapse = TRUE, out.width = "100%", fig.width = 5, fig.height = 3, dpi = 144 )
library(tidyverse) library(faux) library(afex) # for anova and lmer library(broom) library(broom.mixed) # to make tidy tables of lmer output theme_set(theme_minimal(base_size = 14))
The functions below are commonly used when you're setting up a simulated dataset.
The function rep()
lets you repeat the first argument a number of times.
Use rep()
to create a vector of alternating "A"
and "B"
values of length 24.
rep(c("A", "B"), times = 12)
If the second argument is a vector that is the same length as the first argument, each element in the first vector is repeated that many times. Use rep()
to create a vector of 11 "A"
values followed by 3 "B"
values.
rep(c("A", "B"), c(11, 3))
You can repeat each element of the vector a specified number of times using the each
argument, Use rep()
to create a vector of 12 "A"
values followed by 12 "B"
values.
rep(c("A", "B"), each = 12)
What do you think will happen if you set times
to 3 and each
to 2?
rep(c("A", "B"), times = 3, each = 2)
The function seq()
is useful for generating a sequence of numbers with some pattern.
Use seq()
to create a vector of the integers 0 to 10.
seq(0, 10)
You can set the by
argument to count by numbers other than 1 (the default). Use seq()
to create a vector of the numbers 0 to 100 by 10s.
seq(0, 100, by = 10)
The argument length.out
is useful if you know how many steps you want to divide something into. Use seq()
to create a vector that starts with 0, ends with 100, and has 12 equally spaced steps (hint: how many numbers would be in a vector with 2 steps?).
seq(0, 100, length.out = 13)
The uniform distribution is the simplest distribution. All numbers in the range have an equal probability of being sampled. Use runif()
to sample from a continuous uniform distribution.
runif(n = 10, min = 0, max = 1)
Pipe the result to hist()
to make a quick histogram of your simulated data.
runif(100000, min = 0, max = 1) %>% hist()
You can use sample()
to simulate events like rolling dice or choosing from a deck of cards. The code below simulates rolling a 6-sided die 10000 times. We set replace
to TRUE
so that each event is independent. See what happens if you set replace
to FALSE
.
rolls <- sample(1:6, 10000, replace = TRUE) # plot the results as.factor(rolls) %>% plot()
You can also use sample to sample from a list of named outcomes.
pet_types <- c("cat", "dog", "ferret", "bird", "fish") sample(pet_types, 10, replace = TRUE)
Ferrets, while the best pet, are a much less common pet than cats and dogs, so our sample isn't very realistic. You can set the probabilities of each item in the list with the prob
argument.
pet_types <- c("cat", "dog", "ferret", "bird", "fish") pet_prob <- c(0.3, 0.4, 0.1, 0.1, 0.1) pet_data <- sample(pet_types, 100, replace = TRUE, prob = pet_prob) as.factor(pet_data) %>% plot()
The rbinom
function will generate a random binomial distribution.
n
= number of observationssize
= number of trialsprob
= probability of success on each trialCoin flips are a typical example of a binomial distribution, where we can assign heads to 1 and tails to 0.
# 20 individual coin flips of a fair coin rbinom(20, 1, 0.5)
# 20 individual coin flips of a baised (0.75) coin rbinom(20, 1, 0.75)
You can generate the total number of heads in 1 set of 20 coin flips by setting size
to 20 and n
to 1.
# 1 set of 20 fair coin flips rbinom(1, 20, 0.75)
You can generate more sets of 20 coin flips by increasing the n
.
# 10 sets of 20 fair coin flips rbinom(10, 20, 0.5)
We can simulate a normal distribution of size n
if we know the mean
and standard deviation (sd
).
# 10 samples from a normal distribution with a mean of 0 and SD of 1 rnorm(10, 0, 1)
A density plot is usually the best way to visualise this type of data.
# 100 samples from a normal distribution with a mean of 10 and SD of 2 dv <- rnorm(100, 10, 2) # use sample to get a random colour fill_colour <- sample(colours(), 1) ggplot() + geom_density(aes(dv), fill = fill_colour) + scale_x_continuous( limits = c(0,20), breaks = seq(0,20) )
Run the simulation above several times, noting how the density plot changes. Try changing the values of n
, mean
, and sd
.
Now we're ready to start simulating some data. Let's start with a simple independent-samples design where the variables are from a normal distribution. Each subject produces one score (in condition A or B). What we need to know about these scores is:
First, set parameters for these values. This way, you can use these variables wherever you need them in the rest of the code and you can easily change them.
A_sub_n <- 50 B_sub_n <- 50 A_mean <- 10 B_mean <- 11 A_sd <- 2.5 B_sd <- 2.5
We can the generate the scores using the rnorm()
function.
A_scores <- rnorm(A_sub_n, A_mean, A_sd) B_scores <- rnorm(B_sub_n, B_mean, B_sd)
You can stop here and just analyse your simulated data with t.test(A_scores, B_scores)
, but usually you want to get your simulated data into a data table that looks like what you might eventually import from a CSV file with your actual experimental data.
dat <- tibble( sub_condition = rep( c("A", "B"), c(A_sub_n, B_sub_n) ), score = c(A_scores, B_scores) )
If you're simulating data for a script where you will eventually import data from a csv file, you can save these data to a csv file and then re-read them in, so when you get your real data, all you need to do is comment out the simulation steps.
# make a data directory if there isn't one already if (!dir.exists("data")) dir.create("data") # save your simulated data write_csv(dat, "data/sim-data-ind-samples.csv") # start your analysis here dat <- read_csv("data/sim-data-ind-samples.csv")
Always examine your simulated data after you generate it to make sure it looks like you want.
dat %>% group_by(sub_condition) %>% summarise(n = n() , mean = mean(score), sd = sd(score), .groups = "drop")
t.test(score~sub_condition, dat)
You can wrap all this in a function so you can run it many times to do a power calculation. Put all your parameters as arguments to the function.
ind_sim <- function(A_sub_n, B_sub_n, A_mean, B_mean, A_sd, B_sd) { # simulate data for groups A and B A_scores <- rnorm(A_sub_n, A_mean, A_sd) B_scores <- rnorm(B_sub_n, B_mean, B_sd) # put the data into a table dat <- tibble( sub_condition = rep( c("A", "B"), c(A_sub_n, B_sub_n) ), score = c(A_scores, B_scores) ) # analyse the data t <- t.test(score~sub_condition, dat) # return a list of the values you care about # the double brackets ([[]]) get rid of the name of named numbers list( t = t$statistic[[1]], ci_lower = t$conf.int[[1]], ci_upper = t$conf.int[[2]], p = t$p.value[[1]], estimate = t$estimate[[1]] - t$estimate[[2]] ) }
Now run your new function with the values you used above.
# str() prints the resulting list in a shorter format ind_sim(50, 50, 10, 11, 2.5, 2.5) %>% str()
Now you can use this function to run many simulations. The function map_df
from the purrr
package (loaded with tidyverse
) is one of many ways to run a function many times and organise the results into a table.
mysim <- map_df(1:1000, ~ind_sim(50, 50, 10, 11, 2.5, 2.5))
Now you can graph the data from your simulations.
# set boundary = 0 when plotting p-values ggplot(mysim, aes(p)) + geom_histogram(binwidth = 0.05, boundary = 0, fill = "white", colour = "black")
mysim %>% gather(stat, value, t:estimate) %>% ggplot() + geom_density(aes(value, color = stat), show.legend = FALSE) + facet_wrap(~stat, scales = "free")
You can calculate power as the proportion of simulations on which the p-value was less than your alpha.
alpha <- 0.05 power <- mean(mysim$p < alpha) power
Now let's try a paired-samples design where the variables are from a normal distribution. Each subject produces two scores (in conditions A and B). What we need to know about these two scores is:
sub_n <- 100 A_mean <- 10 B_mean <- 11 A_sd <- 2.5 B_sd <- 2.5 AB_r <- 0.5
You can then use rnorm_multi()
to generate a data table with simulated values for correlated scores:
dat <- faux::rnorm_multi( n = sub_n, vars = 2, r = AB_r, mu = c(A_mean, B_mean), sd = c(A_sd, B_sd), varnames = c("A", "B") )
You can also do this using the MASS::mvrnorm
function, but faux::rnorm_multi
is easier when you have more variables to simulate.
# make the correlation matrix cormat <- matrix(c( 1, AB_r, AB_r, 1), nrow = 2, byrow = TRUE) # make a corresponding matrix of the variance # (multiply the SDs for each cell) varmat <- matrix(c(A_sd * A_sd, A_sd * B_sd, A_sd * B_sd, B_sd * B_sd), nrow = 2, byrow = TRUE) # create correlated variables with the specified parameters S <- MASS::mvrnorm(n = sub_n, mu = c(A_mean, B_mean), Sigma = cormat * varmat) dat <- data.frame( A = S[, 1], B = S[, 2] )
Now check your data; faux
has a function get_params()
that gives you the correlation table, means, and SDs for each numeric column in a data table.
faux::get_params(dat)
# paired-samples t-test t.test(dat$A, dat$B, paired = TRUE)
paired_sim <- function(sub_n, A_mean, B_mean, A_sd, B_sd, AB_r) { dat <- faux::rnorm_multi( n = sub_n, vars = 2, r = AB_r, mu = c(A_mean, B_mean), sd = c(A_sd, B_sd), varnames = c("A", "B") ) t <- t.test(dat$A, dat$B, paired = TRUE) # return just the values you care about list( t = t$statistic[[1]], ci_lower = t$conf.int[[1]], ci_upper = t$conf.int[[2]], p = t$p.value[[1]], estimate = t$estimate[[1]] ) }
Run 1000 simulations and graph the results.
mysim_p <- map_df(1:1000, ~paired_sim(100, 10, 11, 2.5, 2.5, .5))
mysim_p %>% gather(stat, value, t:estimate) %>% ggplot() + geom_density(aes(value, color = stat), show.legend = FALSE) + facet_wrap(~stat, scales = "free")
alpha <- 0.05 power <- mean(mysim_p$p < alpha) power
Now I'm going to show you a different way to simulate the same design. This might seem excessively complicated, but you will need this pattern when you start simulating data for mixed effects models.
Remember, we used the following parameters to set up our simulation above:
sub_n <- 100 A_mean <- 10 B_mean <- 11 A_sd <- 2.5 B_sd <- 2.5 AB_r <- 0.5
From these, we can calculate the grand intercept (the overall mean regardless of condition), and the effect of condition (the mean of B minus A).
grand_i <- (A_mean + B_mean)/2 AB_effect <- B_mean - A_mean
We also need to think about variance a little differently. First, calculate the pooled variance as the mean of the variances for A and B (remember, variance is SD squared).
pooled_var <- (A_sd^2 + B_sd^2)/2
The variance of the subject intercepts is r
times this pooled variance and the error variance is what is left over. We take the square root (sqrt()
) to set the subject intercept and error SDs for simulation later.
sub_sd <- sqrt(pooled_var * AB_r) error_sd <- sqrt(pooled_var * (1-AB_r))
Now we use these variables to create a data table for our subjects. Each subject gets an ID and a random intercept (sub_i
). The intercept is simulated from a random normal distribution with a mean of 0 and an SD of sub_sd
. This represents how much higher or lower than the average score each subject tends to be (regardless of condition).
sub <- tibble( sub_id = 1:sub_n, sub_i = rnorm(sub_n, 0, sub_sd) )
Next, set up a table where each row represents one observation. We'll use one of my favourite functions for simulation: crossing()
. This creates every possible combination of the listed factors (it works the same as expand.grid()
, but the results are in a more intuitive order). Here, we're using it to create a row for each subject in each condition, since this is a fully within-subjects design.
obs <- crossing( sub_id = 1:sub_n, condition = c("A", "B") )
Next, we join the subject table so each row has the information about the subject's random intercept and then calculate the score. I've done it in a few steps below for clarity. The score is just the sum of:
grand_i
)sub_i
)effect
): the numeric code for condition (condition.e
) multiplied by the effect of condition (AB_effect
)error_sd
)dat <- obs %>% left_join(sub, by = "sub_id") %>% mutate( condition.e = recode(condition, "A" = -0.5, "B" = 0.5), effect = AB_effect * condition.e, error = rnorm(nrow(.), 0, error_sd), score = grand_i + sub_i + effect + error )
Use get_params
to check the data. With data in long format, you need to specify the columns that contain the id, dv, and within-id variables.
# check the data faux::get_params(dat, id = "sub_id", dv = "score", within = "condition")
You can use the following code to put the data table into a more familiar "wide" format.
dat_wide <- dat %>% select(sub_id, condition, score) %>% spread(condition, score)
You can analyse the data with a paired-samples t-test from the wide format:
# paired-samples t-test from dat_wide t.test(dat_wide$A, dat_wide$B, paired = TRUE)
Or in the long format:
# paired-samples t-test from dat (long) t.test(score ~ condition, dat, paired = TRUE)
You can analyse the data with ANOVA using the aov_4()
function from afex
. (Notice how the F-value is the square of the t-value above.)
# anova using afex::aov_4 aov <- afex::aov_4(score ~ (condition | sub_id), data = dat) aov$anova_table
You can even analyse the data with a mixed effects model using the lmer
function (the afex
version gives you p-values, but the lme4
version does not).
# mixed effect model using afex::lmer lmem <- afex::lmer(score ~ condition.e + (1 | sub_id), data = dat) # displays a tidy table of the fixed effects broom.mixed::tidy(lmem, effects = "fixed")
Simulate a dataset from the parameters of an analysis. We'll use the built-in dataset mtcars
to predict miles per gallon (mpg
) from transmission type (am
) and engine type (vs
).
model <- lm(mpg ~ am * vs, data = mtcars) broom::tidy(model)
We can now simulate a dataset with 50 observations from each transmission type (am
) and engine type (vs
) combination, then use the model parameters to generate predicted values for mpg
.
err_sd <- sigma(model) # SD of the error term from the model fx <- coefficients(model) # fixed effect coefficients sim_mtcars <- tibble( am = rep(c(0, 0, 1, 1), each = 50), vs = rep(c(0, 1, 0, 1), each = 50) ) %>% mutate(err = rnorm(200, 0, err_sd), mpg = fx[1] + fx["am"]*am + fx["vs"]*vs + fx["am:vs"]*am*vs + err)
Analyse the simulated data with lm()
and output the results as a table using broom::tidy()
sim_model <- lm(mpg ~ am * vs, data = sim_mtcars) broom::tidy(sim_model)
carsim <- function(n, b0, b_am, b_vs, b_am_vs, err_sd) { sim_mtcars <- tibble( am = rep(c(0, 0, 1, 1), each = n), vs = rep(c(0, 1, 0, 1), each = n) ) %>% mutate(err = rnorm(n*4, 0, err_sd), mpg = b0 + b_am*am + b_vs*vs + b_am_vs*am*vs + err) sim_model <- lm(mpg ~ am * vs, data = sim_mtcars) broom::tidy(sim_model) }
Run the function with the values from the original model, but cut the fixed effect sizes in half.
err_sd <- sigma(model) fx2 <- coefficients(model)/2 carsim(50, fx2[1], fx2[2], fx2[3], fx2[4], err_sd)
Repeat this 100 time and calculate power for each effect.
simstats <- map_df(1:100, ~carsim(50, fx2[1], fx2[2], fx2[3], fx2[4], err_sd)) simstats %>% group_by(term) %>% summarise(power = mean(p.value < .05), .groups = "drop")
Using the dataset below, predict moral
disgust from the interaction between pathogen
and sexual
disgust using lm()
.
disgust <- read_csv("https://psyteachr.github.io/msc-data-skills/data/disgust_scores.csv")
Simulate a new dataset of 100 people with a similar pathogen and sexual disgust distribution to the original dataset. Remember that these are likely to be correlated and that scores can only range from 0 to 6. (Hint: look at the help for norm2trunc
)
Write a function to simulate data, analyse it, and return a table of results. Make sure you can vary the important parameters using arguments.
Calculate power for the same fixed effects as in the original analysis. Adjust the N until the dsign has around .80 power to detect a main effect of pathogen disgust.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.