Ex. 4 - Generating cluster samples

library(knitr)
library(formatR)
options(width = 90, tidy = TRUE, warning = FALSE, message = FALSE)
opts_chunk$set(
  comment = "", warning = FALSE, message = FALSE,
  echo = TRUE, tidy = TRUE
)
library(lsasim)
packageVersion("lsasim")

Generating background questionnaire data

cluster_gen(n,
  N = 1, cluster_labels = NULL, resp_labels = NULL,
  cat_prop = NULL, n_X = NULL, n_W = NULL, c_mean = NULL,
  sigma = NULL, cor_matrix = NULL, separate_questionnaires = TRUE,
  collapse = "none", sum_pop = sapply(N, sum), calc_weights = TRUE,
  sampling_method = "mixed", rho = NULL, theta = FALSE,
  verbose = TRUE, print_pop_structure = verbose
)

As its single mandatory argument, cluster_gen requires a numeric list or vector containing the hierarchical structure of the data. As a general rule, as far as this first argument (n) as well as the second argument (N, representing the population structure) are concerned, vectors can be used to represent symmetric structures and lists can be used for asymmetric structures. What follows are some examples.

The function cluster_gen generates clustered samples which resembles the composition of international large-scale assessments participants. The required argument is n and the other optional arguments include


Example 1

We can specify a simple structure of 3 schools with 5 students in each school. That is, n = 3 and N = 5.

set.seed(4388)
cg <- cluster_gen(c(n = 3, N = 5))
cg$n[[1]]
cg$n[[2]]
cg$n[[3]]

Example 2

We can specify a more complex structure of 2 schools with different numbers of students, sampling weights, and custom numbers of questions.

set.seed(4388)
n <- list(3, c(20, 15, 25))
N <- list(5, c(200, 500, 400, 100, 100))
cg <- cluster_gen(n, N, n_X = 5, n_W = 2)
str(cg$school[[1]])
str(cg$school[[2]])
str(cg$school[[3]])

Example 3

We can also control the intra-class correlations and the grand mean.

set.seed(4388)
cg <- cluster_gen(c(5, 1000), rho = .9, n_X = 2, n_W = 0, c_mean = 10)
sapply(1:5, function(s) mean(cg$school[[s]]$q1)) # means per school != 10
mean(sapply(1:5, function(s) mean(cg$school[[s]]$q1))) # closer to c_mean
str(cg)

Example 4

We can make the intraclass variance explode by forcing "incompatible" rho and c_mean.

x <- cluster_gen(c(5, 1000), rho = .5, n_X = 2, n_W = 0, c_mean = 1:5)
anova(x)

Example 5

The named vector below represents a sampling structure of 1 country, 2 schools, 5 students per school. The naming of the vector is optional.

set.seed(4388)
n <- c(cnt = 1, sch = 2, stu = 5)
cg <- cluster_gen(n = n)
cg

Example 6

The named vector below represents a sampling structure of 1 country, 2 schools, 5 students per school. In the example, the number of continuous variables have been specified as n_X = 10. Only 5 means have been expressed to correspond to the 10 continuous variables. That is, c_mean = c(0.3, 0.4, 0.5, 0.6, 0.7). The function will still run by recycling the means over the other, five, variables. In this case, a warning message that reads Warning: c_mean recycled to fit all continuous variables will be reported.

set.seed(4388)
n <- c(cnt = 1, sch = 2, stu = 5)
cg <- cluster_gen(n = n, n_X = 10, c_mean = c(0.3, 0.4, 0.5, 0.6, 0.7))
cg

Example 7

The named vector below represents a sampling structure of 3 schools, 2 classes, and 5 students per class. Again, the naming of the vector is optional. However, n_X and sigma can be expressed as lists that coincide with the different levels (i.e., schools and classes). For example, n_X = c(1, 2) and sigma = list(.1, c(1, 2) can be represented to represent the school and classroom levels. Note that, sigma = list(.1, c(1, 2) means that at cluster 1 (class), the standard deviations are .1, where as the standard deviations for level 2 (class) are 1 and 2.

set.seed(4388)
n <- c(school = 3, class = 2, student = 5)
cg <- cluster_gen(n, n_X = c(1, 2), sigma = list(.1, c(1, 2)))
summary(cg)

Example 8

The named vector below represents a sampling structure of 3 schools, 2 classes, and 5 students per class. Again, the naming of the vector is optional. However, c_mean can also be expressed as a list that coincide with the different levels (i.e., schools and classes). For example, c_mean = list(.1, c(0.55, 0.32) can be represented to represent the school and classroom levels. Note that, c_mean = list(.1, c(0.55, 0.32)) means that at cluster 1 (class), the means for the continuous variables are .1, where as the means for level 2 (class) are 0.55 and 0.32.

set.seed(4388)
n <- c(school = 3, class = 2, student = 5)
cg <- cluster_gen(n, n_X = c(1, 2), n_W = c(0, 1), c_mean = list(.1, c(0.55, 0.32)))
cg


Try the lsasim package in your browser

Any scripts or data that you put into this service are public.

lsasim documentation built on Aug. 22, 2023, 5:09 p.m.