Ex. 4 - Generating cluster samples
In lsasim: Functions to Facilitate the Simulation of Large Scale Assessment Data

library(knitr)
library(formatR)
options(width = 90, tidy = TRUE, warning = FALSE, message = FALSE)
opts_chunk$set(
  comment = "", warning = FALSE, message = FALSE,
  echo = TRUE, tidy = TRUE
)

library(lsasim)

packageVersion("lsasim")

Generating background questionnaire data

cluster_gen(n,
  N = 1, cluster_labels = NULL, resp_labels = NULL,
  cat_prop = NULL, n_X = NULL, n_W = NULL, c_mean = NULL,
  sigma = NULL, cor_matrix = NULL, separate_questionnaires = TRUE,
  collapse = "none", sum_pop = sapply(N, sum), calc_weights = TRUE,
  sampling_method = "mixed", rho = NULL, theta = FALSE,
  verbose = TRUE, print_pop_structure = verbose
)

As its single mandatory argument, cluster_gen requires a numeric list or vector containing the hierarchical structure of the data. As a general rule, as far as this first argument (n) as well as the second argument (N, representing the population structure) are concerned, vectors can be used to represent symmetric structures and lists can be used for asymmetric structures. What follows are some examples.

The function cluster_gen generates clustered samples which resembles the composition of international large-scale assessments participants. The required argument is n and the other optional arguments include

n: a numeric vector with the number of sampled observations (clusters or subjects) on each level.
N: a list of numeric vector(s) with the population size of each sampled cluster element on each level.
cluster_labels: a character vector with the names of each cluster level.
resp_labels: a character vector with the names of the questionnaire respondents on each level.
cat_prop: a list of vectors where each vector contains the cumulative proportions for each category of a given item. If theta = TRUE, the first element of cat_prop must be a scalar 1, which corresponds to the theta.
n_X: the number of continuous (X) variables per cluster level.
n_W: the number of ordinal (W) variables per cluster level.
cor_matrix: a correlation matrix between all variables (except weights).
c_mean: the vector of means for the continuous variables or list of vectors for the continuous variables for each level.
sigma: the vector of of standard deviations for the continuous variables or list of vectors for the continuous variables for each level.
separate_questionnaires: if the logical argument separate_questionnaires 'TRUE', each level will have its own questionnaire. Otherwise, it will be labeled 'q1'.
theta: if the logical argument theta is TRUE then the latent trait will be generated as the first continuous variable and labeled 'theta'.
collapse: if the logical argument collapse is 'TRUE', then function output contains only one data frame with all answers.
sum_pop: is the specification of the total population at each level (sampled or not)
calc_weights: if the logical argument calc_weights is 'TRUE', then sampling weights are calculated.
sampling_method: can be "SRS" for Simple Random Sampling or "PPS" for Probabilities Proportional to Size.
rho: specifies the estimated intraclass correlation.
verbose: if the logical argument verbose is 'TRUE', then messages are printed in the output.
print_pop_structure: if print_pop_structure is 'TRUE', then the population hierarchical structure is printed out (as long as it differs from the sample structure).
...: additional parameterss to be passed to questionnaire_gen().

Example 1

We can specify a simple structure of 3 schools with 5 students in each school. That is, n = 3 and N = 5.

set.seed(4388)
cg <- cluster_gen(c(n = 3, N = 5))

cg$n[[1]]
cg$n[[2]]
cg$n[[3]]

Example 2

We can specify a more complex structure of 2 schools with different numbers of students, sampling weights, and custom numbers of questions.

set.seed(4388)
n <- list(3, c(20, 15, 25))
N <- list(5, c(200, 500, 400, 100, 100))
cg <- cluster_gen(n, N, n_X = 5, n_W = 2)

str(cg$school[[1]])
str(cg$school[[2]])
str(cg$school[[3]])

Example 3

We can also control the intra-class correlations and the grand mean.

set.seed(4388)
cg <- cluster_gen(c(5, 1000), rho = .9, n_X = 2, n_W = 0, c_mean = 10)
sapply(1:5, function(s) mean(cg$school[[s]]$q1)) # means per school != 10
mean(sapply(1:5, function(s) mean(cg$school[[s]]$q1))) # closer to c_mean

str(cg)

Example 4

We can make the intraclass variance explode by forcing "incompatible" rho and c_mean.

x <- cluster_gen(c(5, 1000), rho = .5, n_X = 2, n_W = 0, c_mean = 1:5)

anova(x)

Other specifications of cluster_gen.

Example 5

The named vector below represents a sampling structure of 1 country, 2 schools, 5 students per school. The naming of the vector is optional.

set.seed(4388)
n <- c(cnt = 1, sch = 2, stu = 5)
cg <- cluster_gen(n = n)

cg

Example 6

The named vector below represents a sampling structure of 1 country, 2 schools, 5 students per school. In the example, the number of continuous variables have been specified as n_X = 10. Only 5 means have been expressed to correspond to the 10 continuous variables. That is, c_mean = c(0.3, 0.4, 0.5, 0.6, 0.7). The function will still run by recycling the means over the other, five, variables. In this case, a warning message that reads Warning: c_mean recycled to fit all continuous variables will be reported.

set.seed(4388)
n <- c(cnt = 1, sch = 2, stu = 5)
cg <- cluster_gen(n = n, n_X = 10, c_mean = c(0.3, 0.4, 0.5, 0.6, 0.7))

cg

Example 7

The named vector below represents a sampling structure of 3 schools, 2 classes, and 5 students per class. Again, the naming of the vector is optional. However, n_X and sigma can be expressed as lists that coincide with the different levels (i.e., schools and classes). For example, n_X = c(1, 2) and sigma = list(.1, c(1, 2) can be represented to represent the school and classroom levels. Note that, sigma = list(.1, c(1, 2) means that at cluster 1 (class), the standard deviations are .1, where as the standard deviations for level 2 (class) are 1 and 2.

set.seed(4388)
n <- c(school = 3, class = 2, student = 5)
cg <- cluster_gen(n, n_X = c(1, 2), sigma = list(.1, c(1, 2)))

summary(cg)

Example 8

The named vector below represents a sampling structure of 3 schools, 2 classes, and 5 students per class. Again, the naming of the vector is optional. However, c_mean can also be expressed as a list that coincide with the different levels (i.e., schools and classes). For example, c_mean = list(.1, c(0.55, 0.32) can be represented to represent the school and classroom levels. Note that, c_mean = list(.1, c(0.55, 0.32)) means that at cluster 1 (class), the means for the continuous variables are .1, where as the means for level 2 (class) are 0.55 and 0.32.

set.seed(4388)
n <- c(school = 3, class = 2, student = 5)
cg <- cluster_gen(n, n_X = c(1, 2), n_W = c(0, 1), c_mean = list(.1, c(0.55, 0.32)))

cg

Any scripts or data that you put into this service are public.

lsasim documentation built on Aug. 22, 2023, 5:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

lsasim
Functions to Facilitate the Simulation of Large Scale Assessment Data

Ex. 4 - Generating cluster samples
In lsasim: Functions to Facilitate the Simulation of Large Scale Assessment Data

Generating background questionnaire data

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Example 7

Example 8

Try the lsasim package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

lsasim Functions to Facilitate the Simulation of Large Scale Assessment Data

Ex. 4 - Generating cluster samples In lsasim: Functions to Facilitate the Simulation of Large Scale Assessment Data

Generating background questionnaire data

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

Example 7

Example 8

Try the lsasim package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

lsasim
Functions to Facilitate the Simulation of Large Scale Assessment Data

Ex. 4 - Generating cluster samples
In lsasim: Functions to Facilitate the Simulation of Large Scale Assessment Data