simulation_prediction_conti: Simulate Continuous Longitudinal Data for Prediction

View source: R/simulation_prediction.R

simulation_prediction_contiR Documentation

Simulate Continuous Longitudinal Data for Prediction

Description

Generates synthetic longitudinal data with continuous outcomes, specifically designed for evaluating prediction models. The function creates a population of subjects with correlated covariates and outcomes, then splits them into training and testing sets. It offers flexible options for simulating non-normal random effects (e.g., skewed, mixtures, t-distributions) and residuals, as well as nonlinear relationships.

Usage

simulation_prediction_conti(
  train_prop = 0.7,
  n_subject = 1000,
  n_obs_per_sub = 5,
  seed = NULL,
  nonlinear = FALSE,
  residual = c("normal", "normal_mixture", "skewed_normal", "t3", "t2"),
  randeff = c("MVN", "MVN_mixture", "skewed_MVN", "MVT3", "MVT2")
)

Arguments

train_prop

A numeric value between 0 and 1 indicating the proportion of the population to be used for the training set. Default: 0.7.

n_subject

An integer specifying the total number of subjects in the population. Default: 1000.

n_obs_per_sub

An integer specifying the number of observations per subject. Default: 5.

seed

An optional integer for setting the random seed to ensure reproducibility. Default: NULL.

nonlinear

A logical value. If TRUE, the outcome Y is generated using a complex nonlinear function of the covariates. If FALSE, Y is a linear combination of covariates. Default: FALSE.

residual

A character string specifying the distribution of the residual errors added to the training outcome. Options are:

  • "normal": Standard normal distribution.

  • "normal_mixture": Mixture of two normal distributions.

  • "skewed_normal": Skew-normal distribution.

  • "t3": Student's t-distribution with 3 degrees of freedom.

  • "t2": Student's t-distribution with 2 degrees of freedom.

randeff

A character string specifying the distribution of the random effects. Options are:

  • "MVN": Multivariate Normal distribution.

  • "MVN_mixture": Mixture of Multivariate Normal distributions.

  • "skewed_MVN": Multivariate Skew-normal distribution.

  • "MVT3": Multivariate t-distribution with 3 degrees of freedom.

  • "MVT2": Multivariate t-distribution with 2 degrees of freedom.

Details

The function first simulates correlated covariates X using a multivariate normal distribution, adding subject-specific random variations. The outcome Y is then constructed based on X (either linearly or nonlinearly) and combined with random effects Z * Bi drawn from the specified randeff distribution.

The data is split into training and testing sets based on train_prop. Crucially, residual noise (specified by residual) is added only to Y_train. The Y_test values represent the conditional mean (Fixed + Random Effects) and serve as the ground truth for prediction tasks aiming to recover the de-noised signal.

Value

A list containing the following components:

subject_id_train

A vector of subject IDs for the training set.

Z_train

A matrix of random predictors (time/intercept) for the training set.

X_train

A matrix of covariates for the training set.

Y_train

A vector of observed outcomes for the training set (Signal + Random Effects + Residual Error).

subject_id_test

A vector of subject IDs for the testing set.

Z_test

A matrix of random predictors for the testing set.

X_test

A matrix of covariates for the testing set.

Y_test

A vector of "true" outcomes for the testing set (Signal + Random Effects), without residual error.

X_pop

A matrix of covariates for the entire population.

y_pop

A vector of "true" outcomes for the entire population (Signal + Random Effects).

I

A logical vector indicating which observations belong to the training set.

X_src

Duplicate of X_train, provided for convenience.

Y_src

Duplicate of Y_train, provided for convenience.

Examples

sim_data <- simulation_prediction_conti(
  train_prop = 0.7,
  n_subject = 200,
  n_obs_per_sub = 5,
  nonlinear = TRUE,
  residual = "normal",
  randeff = "skewed_MVN",
  seed = 123
)

SBMTrees documentation built on Feb. 6, 2026, 5:08 p.m.