View source: R/simulation_prediction.R
| simulation_prediction_binary | R Documentation |
Generates synthetic longitudinal data with binary outcomes, designed for evaluating
classification and prediction models. The function creates a latent continuous variable based on
covariates and random effects, then converts it into binary outcomes using various link functions
(corresponding to the residual argument).
simulation_prediction_binary(
train_prop = 0.7,
n_subject = 1000,
n_obs_per_sub = 5,
seed = NULL,
nonlinear = FALSE,
residual = c("normal", "logistic", "t3", "t2"),
randeff = c("MVN", "MVN_mixture", "skewed_MVN", "MVT3", "MVT2")
)
train_prop |
A numeric value between 0 and 1 indicating the proportion of the population to be used
for the training set. Default: |
n_subject |
An integer specifying the total number of subjects in the population. Default: |
n_obs_per_sub |
An integer specifying the number of observations per subject. Default: |
seed |
An optional integer for setting the random seed to ensure reproducibility. Default: |
nonlinear |
A logical value. If |
residual |
A character string specifying the link function (CDF) used to generate probabilities from the latent variable. This effectively acts as the error distribution assumption in a Generalized Linear Mixed Model (GLMM) context:
|
randeff |
A character string specifying the distribution of the random effects added to the latent variable. Options are:
|
The function simulates a latent continuous variable Y^* based on fixed effects (linear or nonlinear X)
and random effects (Z * Bi). This latent variable is scaled and then transformed into a probability p
using the CDF specified by residual.
For the training set, the observed outcome Y_train is sampled from a Bernoulli distribution
with probability p. For the testing set, the function returns the probability p itself (Y_test),
allowing for precise evaluation of the model's ability to estimate propensity scores or risk.
A list containing the following components:
A vector of subject IDs for the training set.
A matrix of random predictors (time/intercept) for the training set.
A matrix of covariates for the training set.
A vector of observed binary outcomes (0 or 1) for the training set.
A vector of subject IDs for the testing set.
A matrix of random predictors for the testing set.
A matrix of covariates for the testing set.
A vector of true probabilities for the testing set. These represent the ground truth propensity scores (0 to 1) used for evaluation.
A matrix of covariates for the entire population.
A vector of true probabilities for the entire population.
A logical vector indicating which observations belong to the training set.
Duplicate of X_train, provided for convenience.
Vector of true probabilities for the training set (unlike Y_train which is binary).
# Simulate data with logistic link (Logit) and mixture of normal random effects
sim_bin <- simulation_prediction_binary(
train_prop = 0.7,
n_subject = 500,
residual = "logistic",
randeff = "MVN_mixture",
seed = 123
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.