simulation_prediction_binary: Simulate Binary Longitudinal Data for Prediction

View source: R/simulation_prediction.R

simulation_prediction_binaryR Documentation

Simulate Binary Longitudinal Data for Prediction

Description

Generates synthetic longitudinal data with binary outcomes, designed for evaluating classification and prediction models. The function creates a latent continuous variable based on covariates and random effects, then converts it into binary outcomes using various link functions (corresponding to the residual argument).

Usage

simulation_prediction_binary(
  train_prop = 0.7,
  n_subject = 1000,
  n_obs_per_sub = 5,
  seed = NULL,
  nonlinear = FALSE,
  residual = c("normal", "logistic", "t3", "t2"),
  randeff = c("MVN", "MVN_mixture", "skewed_MVN", "MVT3", "MVT2")
)

Arguments

train_prop

A numeric value between 0 and 1 indicating the proportion of the population to be used for the training set. Default: 0.7.

n_subject

An integer specifying the total number of subjects in the population. Default: 1000.

n_obs_per_sub

An integer specifying the number of observations per subject. Default: 5.

seed

An optional integer for setting the random seed to ensure reproducibility. Default: NULL.

nonlinear

A logical value. If TRUE, the latent variable is generated using a complex nonlinear function of the covariates. If FALSE, it is a linear combination. Default: FALSE.

residual

A character string specifying the link function (CDF) used to generate probabilities from the latent variable. This effectively acts as the error distribution assumption in a Generalized Linear Mixed Model (GLMM) context:

  • "normal": Uses the standard normal CDF (Probit link).

  • "logistic": Uses the logistic CDF (Logit link).

  • "t3": Uses the Student's t (df=3) CDF.

  • "t2": Uses the Student's t (df=2) CDF.

randeff

A character string specifying the distribution of the random effects added to the latent variable. Options are:

  • "MVN": Multivariate Normal distribution.

  • "MVN_mixture": Mixture of Multivariate Normal distributions.

  • "skewed_MVN": Multivariate Skew-normal distribution.

  • "MVT3": Multivariate t-distribution with 3 degrees of freedom.

  • "MVT2": Multivariate t-distribution with 2 degrees of freedom.

Details

The function simulates a latent continuous variable Y^* based on fixed effects (linear or nonlinear X) and random effects (Z * Bi). This latent variable is scaled and then transformed into a probability p using the CDF specified by residual.

For the training set, the observed outcome Y_train is sampled from a Bernoulli distribution with probability p. For the testing set, the function returns the probability p itself (Y_test), allowing for precise evaluation of the model's ability to estimate propensity scores or risk.

Value

A list containing the following components:

subject_id_train

A vector of subject IDs for the training set.

Z_train

A matrix of random predictors (time/intercept) for the training set.

X_train

A matrix of covariates for the training set.

Y_train

A vector of observed binary outcomes (0 or 1) for the training set.

subject_id_test

A vector of subject IDs for the testing set.

Z_test

A matrix of random predictors for the testing set.

X_test

A matrix of covariates for the testing set.

Y_test

A vector of true probabilities for the testing set. These represent the ground truth propensity scores (0 to 1) used for evaluation.

X_pop

A matrix of covariates for the entire population.

y_pop

A vector of true probabilities for the entire population.

I

A logical vector indicating which observations belong to the training set.

X_src

Duplicate of X_train, provided for convenience.

Y_src

Vector of true probabilities for the training set (unlike Y_train which is binary).

Examples

# Simulate data with logistic link (Logit) and mixture of normal random effects
sim_bin <- simulation_prediction_binary(
  train_prop = 0.7,
  n_subject = 500,
  residual = "logistic",
  randeff = "MVN_mixture",
  seed = 123
)

SBMTrees documentation built on Feb. 6, 2026, 5:08 p.m.