Clustered Data
In simstudy: Simulation of Study Data

data.table::setDTthreads(2)

library(simstudy)
library(ggplot2)
library(scales)
library(grid)
library(gridExtra)
library(survival)
library(gee)
library(data.table)

plotcolors <- c("#B84226", "#1B8445", "#1C5974")

cbbPalette <- c("#B84226","#B88F26", "#A5B435", "#1B8446",
                "#B87326","#B8A526", "#6CA723", "#1C5974") 

ggtheme <- function(panelback = "white") {

  ggplot2::theme(
    panel.background = element_rect(fill = panelback),
    panel.grid = element_blank(),
    axis.ticks =  element_line(colour = "black"),
    panel.spacing =unit(0.25, "lines"),  # requires package grid
    panel.border = element_rect(fill = NA, colour="gray90"), 
    plot.title = element_text(size = 8,vjust=.5,hjust=0),
    axis.text = element_text(size=8),
    axis.title = element_text(size = 8)
  )  

}

The function genCluster generates multilevel or clustered data based on a previously generated data set that is one "level" up from the clustered data. For example, if there is a data set that contains the school level (considered here to be level 2), classrooms (level 1) can be generated. And then, students (now level 1) can be generated within classrooms (now level 2)

In the example here, we do in fact generate school-, class-, and student-level data. There are eight schools, four of which are randomized to receive an intervention. The number of classes per school varies, as does the number of students per class. (It is straightforward to generate fully balanced data by using constant values.) The outcome of interest is a test score, which is influenced by gender and the intervention. In addition, test scores vary by schools, and by classrooms, so the simulation provides random effects at each of these levels.

We start by defining the school level data:

gen.school <- defData(varname="s0", dist = "normal", 
                      formula = 0, variance = 3, id = "idSchool"
)
gen.school <- defData(gen.school, varname = "nClasses", 
                      dist = "noZeroPoisson", formula = 3
)

set.seed(282721)

dtSchool <- genData(8, gen.school)
dtSchool <- trtAssign(dtSchool, n = 2)

dtSchool

The classroom level data are generated with a call to genCluster, and then school level data is added by a call to addColumns:

gen.class <- defDataAdd(varname = "c0", dist = "normal", formula = 0, 
                     variance = 2)
gen.class <- defDataAdd(gen.class, varname = "nStudents", dist = "noZeroPoisson", formula = 20
)

dtClass <- genCluster(dtSchool, "idSchool", numIndsVar = "nClasses",level1ID = "idClass")
dtClass <- addColumns(gen.class, dtClass)

head(dtClass, 10)

Finally, the student level data are added using the same process:

gen.student <- defDataAdd(varname="Male", dist="binary", formula=0.5)
gen.student <- defDataAdd(gen.student, varname="age", dist = "uniform", formula="9.5; 10.5")
gen.student <- defDataAdd(gen.student, varname="test", dist = "normal",
                       formula = "50 - 5*Male + s0 + c0 + 8 * trtGrp",                           variance = 2)
dtStudent <- genCluster(dtClass,cLevelVar="idClass", numIndsVar = "nStudents",                        level1ID = "idChild")

dtStudent <- addColumns(gen.student, dtStudent)

This is what the clustered data look like. Each classroom is represented by a box, and each school is represented by a color. The intervention group is highlighted by dark outlines:

ggplot(data=dtStudent,aes(x=factor(idClass),y=test,group=idClass)) +
  geom_boxplot(aes(color=factor(trtGrp), fill = factor(idSchool)))+
  xlab("Classes")+
  ylab("Test scores") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  scale_fill_manual(values = cbbPalette, guide = FALSE) +
  scale_color_manual(values = c("grey80", "#000000"),
                    labels = c("Ctrl", "Rx"),
                    guide = guide_legend(title = NULL,
                                         override.aes = list(shape = 1,
                                                             keywidth=8
                                                             )
                                         )
                    ) +
  theme(legend.key=element_rect(fill=NA)) +
  ggtheme()

Setting cluster sizes

It could be helpful to relax the assumption of exactly balanced cluster size when estimating statistical power using simulation. There is a "distribution" called clusterSize that facilitates these stochastic cluster sizes. As part of data definitions, you can specify the overall sample size N in the formula argument, and a dispersion parameter (in the variance field) Indicating a dispersion of 0 (the default) implies exact balance, and larger values of dispersion imply more variability in the cluster sizes.

Here are two examples with exact or close to exact balance:

d1 <- defData(varname = "clustSize", formula = 120, dist = "clusterSize")

genData(8, d1, id = "site")
genData(7, d1, id = "site")

This is a second example with variability across sites:

d1 <- defData(varname = "clustSize", formula = 120, 
  variance = .1, dist = "clusterSize")

genData(8, d1, id = "site")

Any scripts or data that you put into this service are public.

simstudy documentation built on Sept. 11, 2024, 6:38 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

simstudy
Simulation of Study Data

Clustered Data
In simstudy: Simulation of Study Data

Setting cluster sizes

Try the simstudy package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

simstudy Simulation of Study Data

Clustered Data In simstudy: Simulation of Study Data

Setting cluster sizes

Try the simstudy package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

simstudy
Simulation of Study Data

Clustered Data
In simstudy: Simulation of Study Data