cluster_gen | R Documentation |
Generate cluster sample
cluster_gen(
n,
N = 1,
cluster_labels = NULL,
resp_labels = NULL,
cat_prop = NULL,
n_X = NULL,
n_W = NULL,
c_mean = NULL,
sigma = NULL,
cor_matrix = NULL,
separate_questionnaires = TRUE,
collapse = "none",
sum_pop = sapply(N, sum),
calc_weights = TRUE,
sampling_method = "mixed",
rho = NULL,
theta = FALSE,
verbose = TRUE,
print_pop_structure = verbose,
...
)
n |
numeric vector or list with the number of sampled observations (clusters or subjects) on each level |
N |
population size of each sampled cluster element on each level. Either a numeric vector or a list of numeric vectors. If |
cluster_labels |
character vector with the names of each cluster level |
resp_labels |
character vector with the names of the questionnaire respondents on each level |
cat_prop |
list of cumulative proportions for each item. If |
n_X |
list of |
n_W |
list of |
c_mean |
vector of means for the continuous variables or list of vectors for the continuous variables for each level. Defaults to 0, but may change if |
sigma |
vector of standard deviations for the continuous variables or list of vectors for the continuous variables for each level. Defaults to 1, but may change if |
cor_matrix |
Correlation matrix between all variables (except weights). By default, correlations are randomly generated. |
separate_questionnaires |
if |
collapse |
if |
sum_pop |
total population at each level (sampled or not) |
calc_weights |
if |
sampling_method |
can be "SRS" for Simple Random Sampling, "PPS" for Probabilities Proportional to Size, "mixed" to use PPS for schools and SRS otherwise, or a vector with the sampling method for each level |
rho |
intraclass correlation (scalar, vector or list, as appropriate) |
theta |
if |
verbose |
if |
print_pop_structure |
if |
... |
Additional parameters to be passed to |
This function relies heavily in two sub-functions—cluster_gen_separate
and cluster_gen_together
—which can be called independently. This does not make cluster_gen
a simple wrapper function, as it performs several operations prior to calling its sub-functions, such as randomly generating n_X
and n_W
if they are not determined by user.
n
can have unitary length, in which case all clusters will have the same size.
N
is not the population size across all elements of a level, but the population size for each element of one level.
Regarding the additional parameters to be passed to questionnaire_gen()
, they can be passed either in the same format as questionnaire_gen()
or as more complex objects that contain information for each cluster level.
list with background questionnaire data, grouped by level or not
For the purpose of this function, levels are counted starting from the top nesting/clustering level. This means that, for example, schools are the first cluster level, classes are the second, and students are the third and final level. This behavior can be customized by naming the n
argument or providing custom labels in cluster_labels
and resp_labels
.
Manually setting both c_mean
and rho
, while possible, may yield unexpected results due to how those parameters work together. A high intraclass correlation (rho
) theoretically means that each group will end up with different means so they can be better separated. If c_mean
is left untouched (i.e., at the default value of zero), then c_mean
will freely change between clusters in order to result in the expected intraclass correlation. For large samples, c_mean
will in practice correspond to the grand mean across that level, as the means of each element will be different no matter the sample size.
Moreover, if c_mean
, sigma
and rho
are passed to the function, the means will be recalculated as a function of the other two parameters. The three are interdependent and cannot be passed simultaneously.
If in addition to rho
the user also determine different means for each level, the only way the math can check out is if the variance in each group becomes very high. For examples of this scenario and the one described in the previous paragraph, check out the final section of this page.
The ranges()
function should always be put inside a list()
,as putting it inside a vector (c()
) will cancel its effect. For more details, please read the documentation of the ranges()
function.
The only arguments that can be used to label each level are n
, N
, cluster_labels
and resp_labels
. Labeling other arguments such as c_mean
and cat_prop
has no effect on the final results, but it is a recommended way for users to keep track of which value corresponds to which element in a complex hierarchical structure.
One of the extra arguments that can be passed by this function is family
.
If family == "gaussian"
, the questionnaire will be generated
assuming that all the variables are jointly-distributed as a multivariate
normal. The default behavior is family == NULL
, where the data is
generated using the polychoric correlation matrix, with no distributional
assumptions.
cluster_gen_separate()
cluster_gen_together()
questionnaire_gen()
# Simple structure of 3 schools with 5 students each
cluster_gen(c(3, 5))
# Complex structure of 2 schools with different number of students,
# sampling weights and custom number of questions
n <- list(3, c(20, 15, 25))
N <- list(5, c(200, 500, 400, 100, 100))
cluster_gen(n, N, n_X = 5, n_W = 2)
# Condensing the output
set.seed(0); cluster_gen(c(2, 4))
set.seed(0); cluster_gen(c(2, 4), collapse=TRUE) # same, but in one dataset
# Condensing the output: 3 levels
str(cluster_gen(c(2, 2, 1), collapse="none"))
str(cluster_gen(c(2, 2, 1), collapse="partial"))
str(cluster_gen(c(2, 2, 1), collapse="full"))
# Controlling the intra-class correlation and the grand mean
x <- cluster_gen(c(5, 1000), rho = .9, n_X = 2, n_W = 0, c_mean = 10)
sapply(1:5, function(s) mean(x$school[[s]]$q1)) # means per school != 10
mean(sapply(1:5, function(s) mean(x$school[[s]]$q1))) # closer to c_mean
# Making the intraclass variance explode by forcing "incompatible" rho and c_mean
x <- cluster_gen(c(5, 1000), rho = .5, n_X = 2, n_W = 0, c_mean = 1:5)
anova(x)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.