Simulate Correlated Variables

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  out.width = "100%"
)
ggplot2::theme_set(ggplot2::theme_bw())
set.seed(8675309)
library(ggplot2)
library(dplyr)
library(tidyr)
library(faux)

The rnorm_multi() function makes multiple normally distributed vectors with specified parameters and relationships.

Quick example

For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.

dat <- rnorm_multi(n = 100, 
                  mu = c(0, 20, 20),
                  sd = c(1, 5, 5),
                  r = c(0.5, 0.5, 0.25), 
                  varnames = c("A", "B", "C"),
                  empirical = FALSE)

r get_params(dat) %>% knitr::kable() Table: Sample stats

Specify correlations {#spec_r}

You can specify the correlations in one of four ways:

One Number

If you want all the pairs to have the same correlation, just specify a single number.

bvn <- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5])

r get_params(bvn) %>% knitr::kable() Table: Sample stats from a single rho

Matrix

If you already have a correlation matrix, such as the output of cor(), you can specify the simulated data with that.

cmat <- cor(iris[,1:4])
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = colnames(cmat))

r get_params(bvn) %>% knitr::kable() Table: Sample stats from a correlation matrix

Vector (vars*vars)

You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.

cmat <- c(1, .3, .5,
          .3, 1, 0,
          .5, 0, 1)
bvn <- rnorm_multi(100, 3, 0, 1, cmat, 
                  varnames = c("first", "second", "third"))

r get_params(bvn) %>% knitr::kable() Table: Sample stats from a vars*vars vector

Vector (vars*(vars-1)/2)

You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.

rho1_2 <- .3
rho1_3 <- .5
rho1_4 <- .5
rho2_3 <- .2
rho2_4 <- 0
rho3_4 <- -.3
cmat <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = letters[1:4])

r get_params(bvn) %>% knitr::kable() Table: Sample stats from a (vars*(vars-1)/2) vector

empirical

If you want your samples to have the exact correlations, means, and SDs you entered, set empirical to TRUE.

bvn <- rnorm_multi(100, 5, 0, 1, .3, 
                  varnames = letters[1:5], 
                  empirical = T)

r get_params(bvn) %>% knitr::kable() Table: Sample stats with empirical = TRUE

Pre-existing variables

Us rnorm_pre() to create a vector with a specified correlation to one or more pre-existing variables. The following code creates a new column called B with a mean of 10, SD of 2 and a correlation of r = 0.5 to the A column.

dat <- rnorm_multi(varnames = "A") %>%
  mutate(B = rnorm_pre(A, mu = 10, sd = 2, r = 0.5))
get_params(dat) %>% knitr::kable(digits = 3)

Set empirical = TRUE to return a vector with the exact specified parameters.

dat$C <- rnorm_pre(dat$A, mu = 10, sd = 2, r = 0.5, empirical = TRUE)
get_params(dat) %>% knitr::kable(digits = 3)

You can also specify correlations to more than one vector by setting the first argument to a data frame containing only the continuous columns and r to the correlation with each column.

dat$D <- rnorm_pre(dat, r = c(.1, .2, .3), empirical = TRUE)
get_params(dat) %>% knitr::kable(digits = 3)

Not all correlation patterns are possible, so you'll get an error message if the correlations you ask for are impossible.

dat$E <- rnorm_pre(dat, r = .9)


Try the faux package in your browser

Any scripts or data that you put into this service are public.

faux documentation built on Sept. 14, 2021, 1:08 a.m.