View source: R/generateLongitudinalDatasets.R
| generateLongitudinalDatasets | R Documentation |
Generate balanced longitudinal datasets with random intercepts and slopes. Subjects are observed at multiple time points with optional treatment groups. Treatment and its interaction with time are coded as contrasts relative to the first level.
generateLongitudinalDatasets(
numberOfDatasetsToGenerate,
numberOfSubjects,
numberOfTimepoints,
numberOfTreatmentLevels = 1L,
timeRange = c(0, 1),
errorGenerator = rnorm,
randomEffectGenerator = rnorm,
trueBeta = 0,
trueSigma = 1,
trueTheta = c(1, 0, 1),
contamFun = NULL,
...
)
numberOfDatasetsToGenerate |
number of datasets to generate. |
numberOfSubjects |
number of subjects per dataset. |
numberOfTimepoints |
number of observation time points per subject. |
numberOfTreatmentLevels |
number of treatment levels. Default: 1 (no treatment effect, intercept and time only). |
timeRange |
numeric vector of length 2, range of time values (min, max).
Default: |
errorGenerator |
random number generator used for the errors. Called as
|
randomEffectGenerator |
random number generator used for the spherical
random effects. Called as |
trueBeta |
scalar or vector with the true values of the fixed effects coefficients. Can be of length one in which case it will be replicated to the required length. The order is: intercept, time, treatment contrasts (if any), treatment-by-time interactions (if any). |
trueSigma |
scalar with the true value of the error scale. |
trueTheta |
numeric vector of length 3 with the true values for the
Cholesky factor of the random effects covariance matrix (lme4 convention).
Default: |
contamFun |
optional contamination function. If provided, it receives the full dataset (a data frame with columns id, time, treatment, y) and an info list, and must return the (possibly modified) data frame. This allows arbitrary contamination including changing group assignments. See Details for the contents of the info list. |
... |
all additional arguments are added to the returned list. |
The generated data follows the model:
y_{ij} = \beta_0 + \beta_1 \cdot \text{time}_{ij} +
\sum_{k=1}^{K-1} \beta_{1+k} \cdot \text{treatment}_{k,i} +
\sum_{k=1}^{K-1} \beta_{K+k} \cdot \text{treatment}_{k,i} \cdot
\text{time}_{ij} + b_{0i} + b_{1i} \cdot \text{time}_{ij} +
\epsilon_{ij}
where K is the number of treatment groups,
b_i = (b_{0i}, b_{1i})^T \sim N(0, \sigma^2 \Lambda \Lambda^T)
with \Lambda being the lower-triangular Cholesky factor
reconstructed from the theta vector.
The theta parameterization follows lme4 conventions:
For a 2x2 random-effects covariance structure (intercept and slope),
theta has 3 elements:
\theta = (\lambda_{11}, \lambda_{21}, \lambda_{22})
The Cholesky factor is:
\Lambda = \begin{pmatrix} \lambda_{11} & 0 \\ \lambda_{21} & \lambda_{22} \end{pmatrix}
In order to save memory, only the generated random effects and the errors
are stored. The dataset is only created on demand when the method
generateData in the returned list is evaluated.
The random variables are generated in a way that one can simulate more
datasets easily. When starting from the same seed, the first generated
datasets will be the same as for a previous call of
generateLongitudinalDatasets with a smaller number of datasets to
generate, see examples.
Treatment Assignment:
Subjects are assigned to treatment groups in a balanced, deterministic manner.
Subject i is assigned to treatment (i - 1) mod numberOfTreatmentLevels + 1.
Contamination Function:
If contamFun is provided, it is called as contamFun(data, info)
after the response y is computed. The info list contains:
datasetIndex: the dataset index
randomEffects: the random effects vector for this dataset
errors: the error vector for this dataset
trueBeta: as passed to generateLongitudinalDatasets
trueSigma: as passed to generateLongitudinalDatasets
trueTheta: as passed to generateLongitudinalDatasets
The function must return a data frame with the same structure (columns id, time, treatment, y). This allows arbitrary modifications including:
Modifying the response y (e.g., adding outliers)
Changing group assignments (e.g., moving subjects between treatments)
Modifying time values
Any combination of the above
list with generators and the original arguments
generateData: |
function to generate data taking one argument, the dataset index. |
createXMatrix: |
function to generate X matrix taking one
argument, the result of |
createZMatrix: |
function to generate Z matrix taking one
argument, the result of |
createLambdaMatrix: |
function to generate Lambda matrix taking
one argument, the result of |
randomEffects: |
function to return the generated random effects taking one argument, the dataset index. |
sphericalRandomEffects: |
function to return the generated spherical random effects taking one argument, the dataset index. |
errors: |
function to return the generated errors taking one argument, the dataset index. |
allRandomEffects: |
function without arguments that returns the matrix of all generated random effects. |
allErrors: |
function without arguments that returns the matrix of all generated errors. |
numberOfDatasets: |
|
numberOfSubjects: |
|
numberOfTimepoints: |
|
numberOfTreatmentLevels: |
|
numberOfRows: |
number of rows in the generated dataset |
trueBeta: |
true values used for beta |
trueSigma: |
true value used for sigma |
trueTheta: |
true values used for theta |
formula: |
formula to fit the model using |
...: |
additional arguments passed via |
Manuel Koller
generateAnovaDatasets, generateMixedEffectDatasets
oneGroup <- generateLongitudinalDatasets(2, 10, 5)
head(oneGroup$generateData(1))
head(oneGroup$generateData(2))
oneGroup$formula
twoGroups <- generateLongitudinalDatasets(2, 20, 5, numberOfTreatmentLevels = 2)
head(twoGroups$generateData(1))
twoGroups$formula
## illustration how to generate more datasets
set.seed(1)
datasets1 <- generateLongitudinalDatasets(2, 10, 5)
set.seed(1)
datasets2 <- generateLongitudinalDatasets(3, 10, 5)
stopifnot(all.equal(datasets1$generateData(1), datasets2$generateData(1)),
all.equal(datasets1$generateData(2), datasets2$generateData(2)))
## contamination example: add outliers to 10% of observations
set.seed(42)
contam <- generateLongitudinalDatasets(
numberOfDatasetsToGenerate = 5,
numberOfSubjects = 20,
numberOfTimepoints = 5,
contamFun = function(data, info) {
n <- nrow(data)
idx <- sample(n, size = ceiling(0.1 * n))
data$y[idx] <- data$y[idx] + 10
data
}
)
head(contam$generateData(1))
## contamination example: reassign some subjects to different treatment
set.seed(42)
contamGroup <- generateLongitudinalDatasets(
numberOfDatasetsToGenerate = 5,
numberOfSubjects = 20,
numberOfTimepoints = 5,
numberOfTreatmentLevels = 2,
contamFun = function(data, info) {
## move first subject from T1 to T2
data$treatment[data$id == 1] <- "T2"
data
}
)
head(contamGroup$generateData(1), 10)
## medsim: simulation inspired by the medication dataset from confintROB
## Two subjects from treatment are mislabeled as control, and responses
## are truncated at a measurement floor of 100.
contaminateMedsim <- function(data, info) {
data$y <- pmax(data$y, 100) # measurement floor
data$treatment[data$id %in% c("2", "4")] <- "T1"
data
}
set.seed(2000)
medsim <- generateLongitudinalDatasets(
numberOfDatasetsToGenerate = 100,
numberOfSubjects = 60,
numberOfTimepoints = 7,
numberOfTreatmentLevels = 2,
timeRange = c(0, 18),
trueBeta = c(240, -3.11, -2.42, 4.00),
trueSigma = sqrt(1229.93),
trueTheta = c(1.310266, -0.07547461, 0.2147735),
contamFun = contaminateMedsim
)
head(medsim$generateData(1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.