library(knitr) library(dplyr) library(simglm) knit_print.data.frame = function(x, ...) { res = paste(c('', '', kable(x, output = FALSE)), collapse = '\n') asis_output(res) }
The tidy simulation framework uses simulation arguments as the basis for specifying the models to be simulated. The goal of this vignette is to document more thoroughly the various arguments that are possible within each function. It is recommended that new users start with the "Tidy Simulation with simglm
" vignette prior to working through this vignette.
Arguments associated with the fixed portion of the model are needed for each fixed variable that needs to be simulated. The fixed variables specified come from the formula argument. Interactions or the intercept are not included in the fixed arguments as these are generated automatically. Let's start with an example.
library(simglm) set.seed(321) # To-DO: Add knot variable and debug sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), sample_size = list(level1 = 10, level2 = 20) ) fixed_data <- simulate_fixed(data = NULL, sim_arguments) head(fixed_data, n = 20)
The following example shows the five types of variables that can be generated. These types are specified with the var_type
argument and can be one of the following five types:
rnorm
distribution function for generation.sample
function.Each variable type will be explored in more detail.
Finally, another common argument for all fixed variable types is the argument var_level
. This defaults to var_level = 1
which would be a variable defined at the level 1 of the model (i.e. unique value for each row in the data). These can be changed to reflect the data level that is desired. For example, var_level = 2
would repeat values for each level 2 cluster found in the data and var_level = 3
would do the same for each level 3 cluster. Therefore for variables that are at level 2 or level 3, there will be fewer unique values in the data compared to level 1 variables.
For time variables used in longitudinal or repeated measures designs, the default metric is 0 to level1 sample size minus 1. This can be seen above in the output. To change the time variable metric the time_levels
argument can be used. A vector of values to specify for the time variable can be given directly. For example, the following could be passed to alter the metric of the time variable: time_levels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 4, 5, 6)
where now the time variable would increment in 0.5 units for the first 7 measurements and 1 unit increments for the last three measurements. This could represent a case when the measurements are collected every 6 months for the first 7 measurements and yearly after that.
Below is the output including the manual time variable specification. The only requirement is that the length of the time_levels
argument must match the level1 sample size.
set.seed(321) # To-DO: Add knot variable and debug sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time', time_levels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 4, 5, 6)), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), sample_size = list(level1 = 10, level2 = 20) ) fixed_data <- simulate_fixed(data = NULL, sim_arguments) head(fixed_data, n = 20)
Continuous variables are generating using distribution functions (i.e. rnorm
). Any distribution function found within R, or user written distribution functions can be used, however the default used in rnorm
. To change the distribution function used, the argument dist
can be specified. For example, if the Gamma distribution is desired for the weight variable, the following code would achieve this:
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time', time_levels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 4, 5, 6)), weight = list(var_type = 'continuous', dist = 'rgamma', shape = 3), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), sample_size = list(level1 = 10, level2 = 20) ) fixed_data <- simulate_fixed(data = NULL, sim_arguments) head(fixed_data, n = 20)
This would be the resulting distribution of the weight variable generated.
library(ggplot2) ggplot(fixed_data, aes(x = weight)) + geom_density() + theme_bw()
Ordinal variable specification uses the sample
function within R to generate ordinal variables that are whole integers. The required argument for these types of variables is levels
. The levels
argument takes a range or vector of integer values to be passed to the sample
function within R. For example, these three specifications for the levels
argument are valid: 3:60
, seq(4, 60, 2)
, c(3, 10, 18, 24, 54, 60)
.
An additional optional argument is replace
. The replace
argument specified whether the sampling is done with or without replacement. The default behavior is to do sampling with replacement. If sampling without replacement is desired set replace = FALSE
. See sample
for more details on this argument.
Finally, the probability of selecting each value specified to the levels
argument is also able to be specified through the prob
argument. If prob
is specified, it takes a vector of probabilities that must be the same length as the levels argument. The default behavior is for each value specified with levels
to be equally likely to be sampled.
Factor variables are generated similarly to ordinal variables with the sample
function, however factor variables allow the generation of text or categorical variables in addition to numeric grouping variables. Therefore the only needed argument for factor variables is a vector of numeric or text strings representing the different groups to be generated. For example, both of these specifications are equivalent: c(1, 2, 3, 4)
and c('Freshman', 'Sophomore', 'Junior', 'Senior')
. Both of these specifications would generate data for these four groups, the difference is that the text labels will be used when text strings are specified.
An additional optional argument is replace
. The replace
argument specified whether the sampling is done with or without replacement. The default behavior is to do sampling with replacement. If sampling without replacement is desired set replace = FALSE
. See sample
for more details on this argument.
Finally, the probability of selecting each value specified to the levels
argument is also able to be specified through the prob
argument. If prob
is specified, it takes a vector of probabilities that must be the same length as the levels argument. The default behavior is for each value specified with levels
to be equally likely to be sampled.
Knot variables are defined as those that are a prominent point based on a specific spot of another variable. A common example of these types of variables would occur in interrupted time series data, where there is a place in time where the treatment is given after a series of baselines. The place where the treatment is given is the knot location.
In simglm, knots are specified within the formula, it is made particularly explicit below with the name: age_knot
. Adding _knot
at the end of a knot variable is not needed, but is used here to be more illustrative. When specifying knot attributes, the specific behavior of these attributes are controlled through the knot
named element of the simulation arguments. See the example below with the simulation arguments for the variable, age_knot
.
sim_args <- list( formula = y ~ 1 + age + age_knot, fixed = list(age = list(var_type = 'ordinal', levels = 30:60)), knot = list(age_knot = list(variable = 'age', knot_locations = 50)), sample_size = 500, error = list(variance = 10), reg_weights = c(2, .5, 1.5) )
The required elements to simulate a knot variable is the variable to base the knot location on. This is specified with, variable = 'age'
above. The second required element is the argument, knot_locations
which specifies the location where the knot is placed. The knot variable itself is a variable that takes on the number of knot locations plus one, therefore in this example, the age_knot
variable takes on two values. This would be similar to an indicator or dummy variable used in a linear regression model.
Similar to above, the simulate_fixed()
function does the simulation of this fixed effect term.
simulate_fixed(data = NULL, sim_args = sim_args) %>% head()
As specified, this term would represent a change in the intercept for those that are larger than the knot location, above this was specified as 50 years. More complicated structures can be obtained. One example shown here is that an interaction between the knot variable and the age variable can be made. This term would then represent a change in slope for those that are older than 50. The only change needed is to add this term to the specification within the formula and also add another term to the reg_weights
argument to specify on average the magnitude of the slope change.
sim_args <- list( formula = y ~ 1 + age + age_knot + age:age_knot, fixed = list(age = list(var_type = 'ordinal', levels = 30:60)), knot = list(age_knot = list(variable = 'age', knot_locations = 50)), sample_size = 500, error = list(variance = 1000), reg_weights = c(2, .5, 1.5, 10) ) simulate_fixed(data = NULL, sim_args = sim_args) %>% head()
By default, the random error is generated as random normal with a mean of 0 and standard deviation of 1. If this is the desired behavior, no additional simulation arguments need to be specified in the simulation arguments. For example, the code below generates random error using the fixed arguments already shown above.
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), sample_size = list(level1 = 10, level2 = 20) ) error_data <- simulate_error(data = NULL, sim_arguments) head(error_data, n = 20)
The two main optional arguments for the random error component include a dist
and variance
that represent the error generating function and variance of the random error respectively. For example, if a t-distribution is desired for simulation of random error, dist = 'rt'
can be used to specify the t-distribution as the generating distribution. When distributions other than random normal are specified, it is common that additional arguments will need to be specified for the generating distribution function (i.e. rt
). In the rt
example, the df
argument is needed to specify the degrees of freedom for the t-distribution. Below is an example using a t-distribution.
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), error = list(dist = 'rt', df = 4), sample_size = list(level1 = 10, level2 = 20) ) error_data <- simulate_error(data = NULL, sim_arguments) head(error_data, n = 20)
Finally, the variance of the random error can also be specified to be something other than the default 1. For example, if the variance is desired to be 10, then the argument variance = 10
will set this variance. This variance works with any distribution function. For example, the following will generate data as a t-distribution with a variance of 10.
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), error = list(dist = 'rt', df = 4, variance = 10), sample_size = list(level1 = 10, level2 = 20) ) error_data <- simulate_error(data = NULL, sim_arguments) var(error_data$error)
Note that the variance does not actually equal 10 here. This is due to the generating distribution specified (i.e. t-distribution with 4 degrees of freedom) has a theoretical variance of 2. In cases when the random error has a variance other than 1, the variable needs to be standardized prior to converting to the desired variance value. The standardization can be done in two ways. One, an empirical mean and standard deviation of the generating distribution can be obtained by setting the argument ther_sim = TRUE
. For example:
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), error = list(dist = 'rt', df = 4, variance = 10, ther_sim = TRUE), sample_size = list(level1 = 10, level2 = 20) ) error_data <- simulate_error(data = NULL, sim_arguments) var(error_data$error)
This simulation takes longer to run as the generated empirical mean and standard deviation for the standardization draws a large number of random values from the specified distribution. To speed up this process, the theoretical values can be specified directly within the ther_val
argument as a vector with the mean specified first followed by the standard deviation. This is shown below:
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), error = list(dist = 'rt', df = 4, variance = 10, ther_val = c(0, sqrt(2))), sample_size = list(level1 = 10, level2 = 20) ) error_data <- simulate_error(data = NULL, sim_arguments) var(error_data$error)
It is also possible to generate heterogeneity of variance for the random error.
simulation_arguments <- list( formula = y ~ 1 + group, fixed = list(group = list(var_type = 'factor', levels = c('male', 'female'))), sample_size = 500, error = list(variance = 1), heterogeneity = list(variable = 'group', variance = c(1, 8)), reg_weights = c(0, .15) ) hetero_data <- simulate_fixed(data = NULL, simulation_arguments) %>% simulate_error(simulation_arguments) %>% simulate_heterogeneity(simulation_arguments)
hetero_data %>% group_by(group) %>% summarise(var_error = var(error), var_o_error = var(orig_error))
The random effect arguments are passed through a named element of the list to the simulation arguments called 'randomeffect'. Within this element, each random effect specified in the formula must be specified as a named list. The named list allows the user to specify the names that will be used in the data simulation for the random effects. The default behavior of these random effects is to be simulated from a random normal distribution with mean 0 and variance of 1.
Many of the same arguments that have been discussed with the error simulation are possible for the random effect simulation arguments. These include: dist
, variance
, ther_sim
, ther_val
and other arguments that are needed to pass on the data generating function. In addition, the argument var_level
first introduced with the fixed effects are present here indicating which level of the data structure the random effect should be generated at. These arguments will not be discussed in more detail here, but readers can see examples of these arguments within the fixed effects and random error sections above.
Below is an example that simulated two random effects in a two-level nested model and specifies the variance of 8 and 3 for the two random effects respectively. Finally, you will notice in the final output, the names of the random effect columns are specified by the names included in the list associated with the "randomeffect" portion of the simulation arguments.
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), randomeffect = list(int_id = list(variance = 8, var_level = 2), time_id = list(variance = 3, var_level = 2)), sample_size = list(level1 = 10, level2 = 20) ) random_data <- simulate_randomeffect(data = NULL, sim_arguments) head(random_data, n = 20)
If cross-classified random effects are desired, these can be specified directly within the formula syntax as you would with lme4. For example, y ~ 1 + time + weight + age + treat + (1 + time| id) + (1 | neighborhood_id)
. When documenting the simulation arguments for the additional cross-classified random effect (i.e. (1 | neighborhood_id)
), specifying cross_class = TRUE
with that random effect to identify that this is indeed a cross-classified random effect. Secondly, you can also specify directly the number of clusters that are associated with this cross-classified factor, this can be different than those specified in the sample_size
argument. This can be done with the num_ids
argument. Below is an example with a single cross-classified random effect representing neighborhoods individuals belong in
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id) + (1 | neighborhood_id), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), randomeffect = list(int_id = list(variance = 8, var_level = 2), time_id = list(variance = 3, var_level = 2), int_nid = list(variance = 5, var_level = 2, cross_class = TRUE, num_ids = 12)), sample_size = list(level1 = 10, level2 = 20) ) random_data <- simulate_randomeffect(data = NULL, sim_arguments) head(random_data, n = 20)
Data attributes are often correlated, therefore, in order to really mimic real-world data, simulated data that are correlated would be desired. This can be done with simglm
by using the function correlated_variables()
. This function processes a new set of simulation arguments specified with the named correlate
. In what follows, discussion of correlating fixed effects will be explored first, followed by correlating random effects next.
sim_args <- list(formula = y ~ 1 + act + gpa + commute_time, fixed = list(act = list(var_type = 'ordinal', levels = 15:36), gpa = list(var_type = 'continuous', mean = 2, sd = .5), commute_time = list(var_type = 'continuous', mean = 15, sd = 6)), correlate = list(fixed = data.frame(x = c('act', 'act', 'gpa'), y = c('gpa', 'commute_time', 'commute_time'), corr = c(0.5, .6, .2))), sample_size = 1000) correlate_attribute <- simulate_fixed(data = NULL, sim_args) %>% correlate_variables(sim_args) head(correlate_attribute)
select(correlate_attribute, -X.Intercept., -level1_id) %>% cor(.)
sim_args <- list(formula = y ~ 1 + act + gpa + sat + (1 + act | id), fixed = list(act = list(var_type = 'continuous', mean = 20, sd = 4), gpa = list(var_type = 'continuous', mean = 2, sd = .5), sat = list(var_type = 'continuous', mean = 500, sd = 100)), randomeffect = list(int_id = list(variance = 8, var_level = 2), act_id = list(variance = 3, var_level = 2)), sample_size = list(level1 = 100, level2 = 5000), correlate = list(random = data.frame(x = 'int_id', y = 'act_id', corr = .3)) ) random_correlate <- simulate_randomeffect(data = NULL, sim_args) %>% correlate_variables(sim_args) head(random_correlate)
select(random_correlate, -level1_id, -id) %>% cor(.)
Missing data is the standard rather than the exception in real world data. Therefore, generating data that include missing is important for generating data that are representative of real world data. In the power analysis context, most power analyses do not include missing data in the power computations (particularly if these are closed form solutions), but power can be negatively affected by missing data.
Fortunatly, the simglm package contains many useful frameworks for generating missing data. To generate missing data, the generate_missing
function can be used for this. This function is called after the complete data are generated. This means that the complete data and missing data will be generated. Currently three types of missing data are supported, dropout, random, and mar (missing at random). The specification of missing data arguments is done using a named element to the simulation argument list called "missing_data".
First, lets explore random missing data. This structure is equivalent to the missing completely at random framework if you are familiar with Rubin's missing data classifications. To generate random missing data the type = 'random'
is specified. Two additional arguments are needed to generate the missing data, first the missing proportion needs to be specified with miss_prop
and the name of the new response variable with missing data needs to be specified with new_outcome
. Below is an example generating data with random missing data.
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), reg_weights = c(4, 0.5, 0.75, 0, 0.33), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), randomeffect = list(int_id = list(variance = 8, var_level = 2), time_id = list(variance = 3, var_level = 2)), missing_data = list(miss_prop = .25, new_outcome = 'y_missing', type = 'random'), sample_size = list(level1 = 10, level2 = 20) ) data_w_missing <- sim_arguments %>% simulate_fixed(data = NULL, .) %>% simulate_randomeffect(sim_arguments) %>% simulate_error(sim_arguments) %>% generate_response(sim_arguments) %>% generate_missing(sim_arguments) head(data_w_missing, n = 10)
We can look at the amount of missing data:
prop.table(table(is.na(data_w_missing$y_missing)))
Data that are missing at random (MAR) can be generated as well.
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), reg_weights = c(4, 0.5, 0.75, 0, 0.33), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30, var_level = 1), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), randomeffect = list(int_id = list(variance = 8, var_level = 2), time_id = list(variance = 3, var_level = 2)), missing_data = list(new_outcome = 'y_missing', miss_cov = 'weight', mar_prop = seq(from = 0, to = .9, length.out = 200), type = 'mar'), sample_size = list(level1 = 10, level2 = 20) ) data_w_missing <- sim_arguments %>% simulate_fixed(data = NULL, .) %>% simulate_randomeffect(sim_arguments) %>% simulate_error(sim_arguments) %>% generate_response(sim_arguments) %>% generate_missing(sim_arguments) head(data_w_missing, n = 10)
We can look at the amount of missing data:
prop.table(table(is.na(data_w_missing$y_missing)))
Dropout missing data is possible when simulating data from a longitudinal design. When requesting dropout missing data, observations are removed after a specific point in the time cycle. For example, perhaps an individual stops being a part of the study after the third measurement. To request this type of missing data, the type = 'dropout'
can be specified. In addition, one new argument is needed, clust_var
. This argument represents the id associated with the clusters with longitudinal data. Below is an example:
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), reg_weights = c(4, 0.5, 0.75, 0, 0.33), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), randomeffect = list(int_id = list(variance = 8, var_level = 2), time_id = list(variance = 3, var_level = 2)), missing_data = list(miss_prop = .25, new_outcome = 'y_missing', clust_var = 'id', type = 'dropout'), sample_size = list(level1 = 10, level2 = 20) ) data_w_missing <- sim_arguments %>% simulate_fixed(data = NULL, .) %>% simulate_randomeffect(sim_arguments) %>% simulate_error(sim_arguments) %>% generate_response(sim_arguments) %>% generate_missing(sim_arguments) head(data_w_missing, n = 10)
We can look at the amount of missing data:
prop.table(table(is.na(data_w_missing$y_missing)))
We can also look now at the amount of missing by the time variable.
prop.table(table(is.na(data_w_missing$y_missing), data_w_missing$time))
If additional control is desired, specifying the location of the dropout is possible. In this situation, a vector is passed to dropout_location
that specifies for each individual (more generally each level 2 unit) where the dropout occurs. Below is an example of this.
set.seed(321) sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), reg_weights = c(4, 0.5, 0.75, 0, 0.33), fixed = list(time = list(var_type = 'time'), weight = list(var_type = 'continuous', mean = 180, sd = 30), age = list(var_type = 'ordinal', levels = 30:60, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), randomeffect = list(int_id = list(variance = 8, var_level = 2), time_id = list(variance = 3, var_level = 2)), missing_data = list(new_outcome = 'y_missing', dropout_location = c(3, 9, 1, 6, 7, 8, 6, 9, 2, 4, 6, 5, 8, 9, 4, 5, 6, 7, 2, 9), clust_var = 'id', type = 'dropout'), sample_size = list(level1 = 10, level2 = 20) ) data_w_missing <- sim_arguments %>% simulate_fixed(data = NULL, .) %>% simulate_randomeffect(sim_arguments) %>% simulate_error(sim_arguments) %>% generate_response(sim_arguments) %>% generate_missing(sim_arguments) head(data_w_missing, n = 10)
We can look at the amount of missing data:
prop.table(table(is.na(data_w_missing$y_missing)))
We can also look now at the amount of missing by the time variable.
prop.table(table(is.na(data_w_missing$y_missing), data_w_missing$time))
The default behavior when fitting models to the generated data is to fit a model with the same formula as the generated data (i.e. formula
simulation arguments). Secondly, the default model functions are lm
/glm
for single level simulation and lmer
/glmer
for multilevel simulation. Finally, the default regression weights specified in reg_weights
are to compute statistics for power, type I error rates, and precision.
These defaults can be overriden by specifying simulation arguments through the named list called model_fit
. Some examples were given in the Tidy Simulation vignette and the options that are able to be specified will depend on the model fitting function specified. A few examples are given for various model types, however, the user is directed to the documentation of the modeling functions specified through the model_function
argument.
For GLM models, users may wish to directly specify the family argument or change the link function within a specific family. For example, when generating data with a binary dependent variable, the logistic or probit links could be used when specifying the binomial family. Below is an example of using the binomial family with a logit link (the default link function).
set.seed(321) sim_arguments <- list( formula = y ~ 1 + weight + age + sex, fixed = list(weight = list(var_type = 'continuous', mean = 0, sd = 30), age = list(var_type = 'ordinal', levels = 0:30), sex = list(var_type = 'factor', levels = c('male', 'female'))), error = list(variance = 25), sample_size = 10, reg_weights = c(2, 0.3, -0.1, 0.5), outcome_type = 'binary', model_fit = list( model_function = 'glm', family = binomial ) ) simulate_fixed(data = NULL, sim_arguments) %>% simulate_error(sim_arguments) %>% generate_response(sim_arguments) %>% model_fit(sim_arguments) %>% .$family
The link function could then be changed to a probit link function with the following code:
set.seed(321) sim_arguments <- list( formula = y ~ 1 + weight + age + sex, fixed = list(weight = list(var_type = 'continuous', mean = 0, sd = 30), age = list(var_type = 'ordinal', levels = 0:30), sex = list(var_type = 'factor', levels = c('male', 'female'))), error = list(variance = 25), sample_size = 10, reg_weights = c(2, 0.3, -0.1, 0.5), outcome_type = 'binary', model_fit = list( model_function = 'glm', family = binomial(link = 'probit') ) ) simulate_fixed(data = NULL, sim_arguments) %>% simulate_error(sim_arguments) %>% generate_response(sim_arguments) %>% model_fit(sim_arguments) %>% .$family
Serial correlation can be added to longitudinal designs and fitted using the nlme R package. Below is an example of implementing this functionality in the data generation and model fitting.
To come...
It is also possible to change the default model fitting functions. For example, an alternative model approach to mixed models is generalized estimating equations (GEE) or marginal models. One implementation of these models in R is the geepack package. Below is the code to fit a marginal model in a longitudinal context. The code is not evaluated, so no coefficients are shown, but if you have the geepack
package installed, this code should run as written.
set.seed(321) # To-DO: Add knot variable and debug sim_arguments <- list( formula = y ~ 1 + time + weight + age + treat + (1 + time| id), fixed = list(time = list(var_type = 'time', time_levels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 4, 5, 6)), weight = list(var_type = 'continuous', mean = 0, sd = 30), age = list(var_type = 'ordinal', levels = 0:30, var_level = 2), treat = list(var_type = 'factor', levels = c('Treatment', 'Control'), var_level = 2)), reg_weights = c(0.4, 0.2, -0.5, 1, -0.6), randomeffect = list(int_id = list(variance = 8, var_level = 2), time_id = list(variance = 3, var_level = 2)), error = list(variance = 5), outcome_type = 'binary', sample_size = list(level1 = 10, level2 = 20), model_fit = list( model_function = geepack::geeglm, formula = y ~ 1 + time + weight + age + treat, id = 'level1_id', family = binomial, corstr = 'ar1' ) ) simulate_fixed(data = NULL, sim_arguments) %>% simulate_error(sim_arguments) %>% simulate_randomeffect(sim_arguments) %>% generate_response(sim_arguments) %>% model_fit(sim_arguments) %>% extract_coefficients()
The replicate_simulation
function first shown in the "tidy_simulation" vignette, can directly allow for the varying of simulation arguments. In order to vary simulation arguments, an additional named list called vary_arguments
can be added to the simulation arguments passed to the functions. The vary_arguments
elements of the simulation arguments is a nested named list. Within the named list, the arguments that are wished to be varied are specified by name. For example, a common argument varied is sample size. The following example varies the sample size to return multiple simulated data sets with different sample sizes.
Note: A user would likely want to change plan(sequential)
to something like plan(mutisession)
or plan(multicore)
to run these in parallel. plan(sequential)
is used here for vignette processing.
library(future) plan(sequential) sim_arguments <- list( formula = resp_var ~ 1 + time + factor(trt) + time:factor(trt) + (1 + time | individual), reg_weights = c(4, 0.5, 0.75, 0), fixed = list(time = list(var_type = 'time'), trt = list(var_type = 'factor', levels = c('Drug', 'Placebo'), var_level = 2) ), randomeffect = list(int_clust = list(variance = .2, var_level = 2), time_clust = list(variance = .3, var_level = 2)), replications = 3, error = list(variance = 1), vary_arguments = list( sample_size = list(list(level1 = 5, level2 = 50), list(level1 = 5, level2 = 60)) ) ) replicate_simulation(sim_arguments)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.