questionnaire_gen: Generation of ordinal and continuous variables
In tmatta/lsasim: Functions to Facilitate the Simulation of Large Scale Assessment Data

questionnaire_gen

R Documentation

Generation of ordinal and continuous variables

Description

Creates a data frame of discrete and continuous variables based on several arguments.

Usage

questionnaire_gen(
  n_obs,
  cat_prop = NULL,
  n_vars = NULL,
  n_X = NULL,
  n_W = NULL,
  cor_matrix = NULL,
  cov_matrix = NULL,
  c_mean = NULL,
  c_sd = NULL,
  theta = FALSE,
  family = NULL,
  full_output = FALSE,
  verbose = TRUE
)

Arguments

`n_obs`	number of observations to generate.
`cat_prop`	list of cumulative proportions for each item. If `theta = TRUE`, the first element of `cat_prop` must be a scalar 1, which corresponds to the `theta`.
`n_vars`	total number of variables in the questionnaire, including the continuous and the discrete covariates (`X` and `W`, respectively), as well as the latent trait (`Y`, which is equivalent to `\theta`).
`n_X`	number of continuous background variables. If not provided, a random number of continuous variables will be generated.
`n_W`	either a scalar corresponding to the number of categorical background variables or a list of scalars representing the number of categories for each categorical variable. If not provided, a random number of categorical variables will be generated.
`cor_matrix`	latent correlation matrix. The first row/column corresponds to the latent trait (`Y`). The other rows/columns correspond to the continuous (`X` or `Z`) or the discrete (`W`) background variables, in the same order as `cat_prop`.
`cov_matrix`	latent covariance matrix, formatted as `cor_matrix`.
`c_mean`	is a vector of population means for each continuous variable (`Y` and `X`). Defaults to 0.
`c_sd`	is a vector of population standard deviations for each continuous variable (`Y` and `X`). Defaults to 1.
`theta`	if `TRUE`, the first continuous variable will be labeled 'theta'. Otherwise, it will be labeled 'q1'.
`family`	distribution of the background variables. Can be NULL (default) or 'gaussian'.
`full_output`	if `TRUE`, output will be a list containing the questionnaire data as well as several objects that might be of interest for further analysis of the data.
`verbose`	if `FALSE`, output messages will be suppressed (useful for simulations). Defaults to `TRUE`

Details

In essence, this function begins by checking the validity of the arguments provided and randomly generating those that are not. Then, it will call one of two internal functions, questionnaire_gen_polychoric or questionnaire_gen_family. The former corresponds to the exact functionality of questionnaire_gen on lsasim 1.0.1, where the polychoric correlations are used to generate the background questionnaire data. If family != NULL, however, questionnaire_gen_family is called to generate data based on a joint probability distribution. Additionally, if full_output == TRUE, the external function beta_gen is called to generate the correlation coefficients based on the true covariance matrix. The latter argument also changes the class of the output of this function.

What follows are some notes on the input parameters.

cat_prop is a list where length(cat_prop) is the number of items to be generated. Each element of the list is a vector containing the marginal cumulative proportions for each category, summing to 1. For continuous items, the associated element in the list should be 1.

cor_matrix and cov_matrix are the correlation and covariance matrices that are the same size as length(cat_prop). The correlations related to the correlation between variables on the latent scale.

c_mean and c_sd are each vectors whose length is equal to the number of continuous variables as specified by cat_prop. The default is to keep the continuous variables with mean zero and standard deviation of one.

theta is a logical indicator that determines if the first continuous item should be labeled theta. If theta == TRUE but there are no continuous variables generated, a random number of background variables will be generated.

If cat_prop is a named list, those names will be used as variable names for the returned data.frame. Generic names will be provided to the variables if cat_prop is not named.

As an alternative to providing cat_prop, the user can call this function by specifying the total number of variables using n_vars or the specific number of continuous and categorical variables through n_X and n_W. All three arguments should be provided as scalars; n_W may also be provided as a list, where each element contains the number of categories for one background variable. Alternatively, n_W may be provided as a one-element list, in which case it will be interpreted as all the categorical variables having the same number of categories.

If family == "gaussian", the questionnaire will be generated assuming that all the variables are jointly-distributed as a multivariate normal. The default behavior is family == NULL, where the data is generated using the polychoric correlation matrix, with no distributional assumptions.

When data is generated using the Gaussian distribution, the matrices provided correspond to the relations between the latent variable \theta, the continuous covariates X and the continuous covariates—Z ~ N(0, 1)—that will later be discretized into categorical covariates W. That is why there will be a difference between labels and lengths between cov_matrix and vcov_YXW. For more information, check the references cited later in this document.

Value

By default, the function returns a data.frame object where the first column ("subject") is a 1,\ldots,n ordered list of the n observations and the other columns correspond to the questionnaire answers. If theta = TRUE, the first column after "subject" will be the latent variable \theta; in any case, the continuous variables always come before the categorical ones.

If full_output = TRUE, the output will be a list containing the following objects:

`bg`	a data frame containing the background questionnaire answers (i.e., the same object as described above).
`c_mean`	identical to the input argument of the same name. Read the Details section for more information.
`c_sd`	identical to the input argument of the same name. Read the Details section for more information.
`cat_prop`	identical to the input argument of the same name. Read the Details section for more information.
`cat_prop_W_p`	a list containing the probabilities for each category of the categorical variables (`cat_prop_W` contains the cumulative probabilities).
`cor_matrix`	identical to the input argument of the same name. Read the Details section for more information.
`cov_matrix`	identical to the input argument of the same name. Read the Details section for more information.
`family`	identical to the input argument of the same name.
`n_obs`	identical to the input argument of the same name.
`n_tot`	named vector containing the number of total variables, the number of continuous background variables (i.e., the total number of background variables except `\theta`) and the number of categorical variables.
`n_W`	vector containing the number of categorical variables.
`n_X`	vector containing the number of continuous variables (except `\theta`).
`sd_YXW`	vector with the standard deviations of all the variables
`sd_YXZ`	vector containing the standard deviations of `\theta`, the background continuous variables (`X`) and the Normally-distributed variables `Z` which will generate the background categorical variables (`W`).
`theta`	identical to the input argument of the same name.
`var_W`	list containing the variances of the categorical variables.
`var_YX`	list containing the variances of the continuous variables (including `\theta`)
`linear_regression`	This list is printed only if `theta = TRUE`, `family = "gaussian"` and `full_output = TRUE`. It contains one vector named `betas` and one tabled named `cov_YXW`. The former displays the true linear regression coefficients of `theta` on the background questionnaire answers; the latter contains the covariance matrix between all these variables.

Note

If family == NULL, the number of levels for each categorical variables will be determined by the number of categories observed in the generated data. This means it might be smaller than the number of categories determined by cat_prop, which is more likely to happen with small values of n_obs. If family == "gaussian", however, the number of levels for the categorical variables will always be equivalent to the number of possible categories, even if they are not observed in the data.

It is important to note that all arguments directly related to variable parameters (e.g. cat_prop, cov_matrix, cor_matrix, c_mean, c_sd) have the following order: Y, X, W (missing variables are skipped). This must be kept in mind when using real-life data as input to questionnaire_gen, as the input might need to be reordered to fit the expectations of the function.

By definition, the expected order of the variables is theta, followed by X and then W. The reference category of the categorical variables W is always the first one.

For very small means/sigmas (e.g. 0.005) and multiple levels, estimates may have differing levels of accuracy (e.g. school level estimates will not be as accurate as the student levels ones). In general, one should expect naturally worse estimation on higher hierarchical setups.

References

Matta, T. H., Rutkowski, L., Rutkowski, D., & Liaw, Y. L. (2018). lsasim: an R package for simulating large-scale assessment data. Large-scale Assessments in Education, 6(1), 15.

Examples

# Using polychoric correlations
props <- list(c(1), c(.25, .6, 1))  # one continuous, one with 3 categories
questionnaire_gen(n_obs = 10, cat_prop = props,
                  cor_matrix = matrix(c(1, .6, .6, 1), nrow = 2),
                  c_mean = 2, c_sd = 1.5, theta = TRUE)

# Using the multinomial distribution
# two categorical variables W: one has 2 categories, the other has 3
props <- list(1, c(.25, 1), c(.2, .8, 1))
yw_cov <- matrix(c(1, .5, .5, .5, 1, .8, .5, .8, 1), nrow = 3)
questionnaire_gen(n_obs = 10, cat_prop = props, cov_matrix = yw_cov,
                  family = "gaussian")

# Not providing covariance matrix
questionnaire_gen(n_obs = 10,
                  cat_prop = list(c(.25, 1), c(.6, 1), c(.2, 1)),
                  family = "gaussian")

tmatta/lsasim documentation built on Jan. 24, 2025, 1:39 p.m.