knitr::opts_chunk$set(echo = TRUE, fig.align = "center")
knitr::opts_chunk$set(fig.width = 6, fig.height = 4)
knitr::opts_chunk$set(comment = "#>")
options(width = 100)

In this vignette, we show how to use the three main functions of the package datasim which are sim_data, model_frame and model_response. This is an introductory tutorial and only the simulation of linear Gaussian models are presented.

Required packages

library(datasim)

Linear Model

First, we need to define a list of formulas specifying the type of effect that are included in the linear predictor of each parameter. For example, this list can be defined as follows.

f <- list(
  mean ~ I(5) + I(0.5 * x1) + fa(sex, beta = c(0, 1)),
  sd ~ I(1)
  )

In this formula, it can be seen that an intercept, a linear effect on x1 and a factor effect on sex are being included on the mean parameter, while the standard deviation sd is constant. The simulation of the dataset can be done with the function sim_model, which implements the simulation in two parts:

The name of these two functions model_frame and model_response were defined in similarity to the functions model.frame and model.response, which return the predictors and response variable for a given formula and data.frame.

Simulate the dataset

The data for our model can be simulated with the function sim_model. The two main arguments of sim_model function, when working with linear Gaussian models, are the formula and the sample size n. In order to obtain a reproducible dataset, a seed must be defined with the function set.seed or by using the argument seed in sim_model.

data_model <- sim_model(formula = f, n = 100, seed = 1)

The first 10 rows of the generated dataset looks as follows:

knitr::kable(head(data_model, 10))

it contains an unique id for each individual, all the predictors included in the formula (i.e. x1 and sex), the parameters (mean and sd), and the simulated response variable.

Some customization for the effects can be used, for instance the labels of the factor can be included with the option levels in the function fa.

f <- list(
  mean ~ I(5) + I(0.5 * x1) + fa(sex, beta = c(0, 1), levels = c("male", "female")),
  sd ~ I(1)
  )
data_model <- sim_model(formula = f, n = 100, seed = 1)

The modified formula, generates the same dataset, but with labeled levels for the factor sex.

knitr::kable(head(data_model, 10))

Simulate predictors only

sim_model simulate the entire dataset. If only predictors want to be simulated to have more control, the function model_frame can be used. The two main arguments of this function are the formula and the sample size n.

data_frame <- model_frame(formula = f, n = 100, seed = 1)

The first 10 rows of the generated dataset looks as follows. As expected, only the covariates are simulated.

knitr::kable(head(data_frame, 10))

Simulate the response variable only

If the covariates are already obtained, we can simulate the response variable using the function model_response.

data_frame <- model_response(data_frame, formula = f)

The first 10 rows of the generated dataset looks as follows. As expected, only the response variable and associated parameters are simulated.

knitr::kable(head(data_frame, 10))

Simulate dataset step by step

Notice that the same results can be obtained using the either sim_model alone or model_frame and model_response together.

data_model <- model_frame(f, n = 100, seed = 1) %>%
  model_response()
knitr::kable(head(data_model, 10))

Fitting a model to compare the parameters

Once the data is simulated, we can use it to compare models, effects, etc. For example, we can fit a linear model to our simulated data with.

lm_data <- lm(response ~ x1 + sex, data_frame)
lm_sum <- summary(lm_data)
knitr::kable(lm_sum$coefficients)

The estimated effects are as expected, close to 0.5 for the effect of x1, close to 1 for the effect of females and an intercept close to 5.



ErickChacon/datasim documentation built on March 25, 2020, 7:53 p.m.