generateLongitudinalDatasets: Generate Longitudinal Datasets

View source: R/generateLongitudinalDatasets.R

generateLongitudinalDatasetsR Documentation

Generate Longitudinal Datasets

Description

Generate balanced longitudinal datasets with random intercepts and slopes. Subjects are observed at multiple time points with optional treatment groups. Treatment and its interaction with time are coded as contrasts relative to the first level.

Usage

generateLongitudinalDatasets(
  numberOfDatasetsToGenerate,
  numberOfSubjects,
  numberOfTimepoints,
  numberOfTreatmentLevels = 1L,
  timeRange = c(0, 1),
  errorGenerator = rnorm,
  randomEffectGenerator = rnorm,
  trueBeta = 0,
  trueSigma = 1,
  trueTheta = c(1, 0, 1),
  contamFun = NULL,
  ...
)

Arguments

numberOfDatasetsToGenerate

number of datasets to generate.

numberOfSubjects

number of subjects per dataset.

numberOfTimepoints

number of observation time points per subject.

numberOfTreatmentLevels

number of treatment levels. Default: 1 (no treatment effect, intercept and time only).

timeRange

numeric vector of length 2, range of time values (min, max). Default: c(0, 1).

errorGenerator

random number generator used for the errors. Called as errorGenerator(n) * trueSigma.

randomEffectGenerator

random number generator used for the spherical random effects. Called as randomEffectGenerator(n) * trueSigma.

trueBeta

scalar or vector with the true values of the fixed effects coefficients. Can be of length one in which case it will be replicated to the required length. The order is: intercept, time, treatment contrasts (if any), treatment-by-time interactions (if any).

trueSigma

scalar with the true value of the error scale.

trueTheta

numeric vector of length 3 with the true values for the Cholesky factor of the random effects covariance matrix (lme4 convention). Default: c(1, 0, 1) (independent random intercepts and slopes).

contamFun

optional contamination function. If provided, it receives the full dataset (a data frame with columns id, time, treatment, y) and an info list, and must return the (possibly modified) data frame. This allows arbitrary contamination including changing group assignments. See Details for the contents of the info list.

...

all additional arguments are added to the returned list.

Details

The generated data follows the model:

y_{ij} = \beta_0 + \beta_1 \cdot \text{time}_{ij} + \sum_{k=1}^{K-1} \beta_{1+k} \cdot \text{treatment}_{k,i} + \sum_{k=1}^{K-1} \beta_{K+k} \cdot \text{treatment}_{k,i} \cdot \text{time}_{ij} + b_{0i} + b_{1i} \cdot \text{time}_{ij} + \epsilon_{ij}

where K is the number of treatment groups, b_i = (b_{0i}, b_{1i})^T \sim N(0, \sigma^2 \Lambda \Lambda^T) with \Lambda being the lower-triangular Cholesky factor reconstructed from the theta vector.

The theta parameterization follows lme4 conventions:

  • For a 2x2 random-effects covariance structure (intercept and slope), theta has 3 elements: \theta = (\lambda_{11}, \lambda_{21}, \lambda_{22})

  • The Cholesky factor is: \Lambda = \begin{pmatrix} \lambda_{11} & 0 \\ \lambda_{21} & \lambda_{22} \end{pmatrix}

In order to save memory, only the generated random effects and the errors are stored. The dataset is only created on demand when the method generateData in the returned list is evaluated.

The random variables are generated in a way that one can simulate more datasets easily. When starting from the same seed, the first generated datasets will be the same as for a previous call of generateLongitudinalDatasets with a smaller number of datasets to generate, see examples.

Treatment Assignment: Subjects are assigned to treatment groups in a balanced, deterministic manner. Subject i is assigned to treatment (i - 1) mod numberOfTreatmentLevels + 1.

Contamination Function: If contamFun is provided, it is called as contamFun(data, info) after the response y is computed. The info list contains:

  • datasetIndex: the dataset index

  • randomEffects: the random effects vector for this dataset

  • errors: the error vector for this dataset

  • trueBeta: as passed to generateLongitudinalDatasets

  • trueSigma: as passed to generateLongitudinalDatasets

  • trueTheta: as passed to generateLongitudinalDatasets

The function must return a data frame with the same structure (columns id, time, treatment, y). This allows arbitrary modifications including:

  • Modifying the response y (e.g., adding outliers)

  • Changing group assignments (e.g., moving subjects between treatments)

  • Modifying time values

  • Any combination of the above

Value

list with generators and the original arguments

generateData:

function to generate data taking one argument, the dataset index.

createXMatrix:

function to generate X matrix taking one argument, the result of generateData.

createZMatrix:

function to generate Z matrix taking one argument, the result of generateData.

createLambdaMatrix:

function to generate Lambda matrix taking one argument, the result of generateData.

randomEffects:

function to return the generated random effects taking one argument, the dataset index.

sphericalRandomEffects:

function to return the generated spherical random effects taking one argument, the dataset index.

errors:

function to return the generated errors taking one argument, the dataset index.

allRandomEffects:

function without arguments that returns the matrix of all generated random effects.

allErrors:

function without arguments that returns the matrix of all generated errors.

numberOfDatasets:

numberOfDatasetsToGenerate as supplied

numberOfSubjects:

numberOfSubjects as supplied

numberOfTimepoints:

numberOfTimepoints as supplied

numberOfTreatmentLevels:

numberOfTreatmentLevels as supplied

numberOfRows:

number of rows in the generated dataset

trueBeta:

true values used for beta

trueSigma:

true value used for sigma

trueTheta:

true values used for theta

formula:

formula to fit the model using lmer

...:

additional arguments passed via ...

Author(s)

Manuel Koller

See Also

generateAnovaDatasets, generateMixedEffectDatasets

Examples

  oneGroup <- generateLongitudinalDatasets(2, 10, 5)
  head(oneGroup$generateData(1))
  head(oneGroup$generateData(2))
  oneGroup$formula

  twoGroups <- generateLongitudinalDatasets(2, 20, 5, numberOfTreatmentLevels = 2)
  head(twoGroups$generateData(1))
  twoGroups$formula

  ## illustration how to generate more datasets
  set.seed(1)
  datasets1 <- generateLongitudinalDatasets(2, 10, 5)
  set.seed(1)
  datasets2 <- generateLongitudinalDatasets(3, 10, 5)
  stopifnot(all.equal(datasets1$generateData(1), datasets2$generateData(1)),
            all.equal(datasets1$generateData(2), datasets2$generateData(2)))

  ## contamination example: add outliers to 10% of observations
  set.seed(42)
  contam <- generateLongitudinalDatasets(
    numberOfDatasetsToGenerate = 5,
    numberOfSubjects = 20,
    numberOfTimepoints = 5,
    contamFun = function(data, info) {
      n <- nrow(data)
      idx <- sample(n, size = ceiling(0.1 * n))
      data$y[idx] <- data$y[idx] + 10
      data
    }
  )
  head(contam$generateData(1))

  ## contamination example: reassign some subjects to different treatment
  set.seed(42)
  contamGroup <- generateLongitudinalDatasets(
    numberOfDatasetsToGenerate = 5,
    numberOfSubjects = 20,
    numberOfTimepoints = 5,
    numberOfTreatmentLevels = 2,
    contamFun = function(data, info) {
      ## move first subject from T1 to T2
      data$treatment[data$id == 1] <- "T2"
      data
    }
  )
  head(contamGroup$generateData(1), 10)

  ## medsim: simulation inspired by the medication dataset from confintROB
  ## Two subjects from treatment are mislabeled as control, and responses
  ## are truncated at a measurement floor of 100.
  contaminateMedsim <- function(data, info) {
    data$y <- pmax(data$y, 100)  # measurement floor
    data$treatment[data$id %in% c("2", "4")] <- "T1"
    data
  }
  set.seed(2000)
  medsim <- generateLongitudinalDatasets(
    numberOfDatasetsToGenerate = 100,
    numberOfSubjects = 60,
    numberOfTimepoints = 7,
    numberOfTreatmentLevels = 2,
    timeRange = c(0, 18),
    trueBeta = c(240, -3.11, -2.42, 4.00),
    trueSigma = sqrt(1229.93),
    trueTheta = c(1.310266, -0.07547461, 0.2147735),
    contamFun = contaminateMedsim
  )
  head(medsim$generateData(1))


robustlmm documentation built on Jan. 29, 2026, 1:10 a.m.