gen.dat: gen.dat: Simulate Screening Data for a Prevalence-Incidence...
In BayesPIM: Bayesian Prevalence-Incidence Mixture Model

gen.dat

R Documentation

gen.dat: Simulate Screening Data for a Prevalence-Incidence Mixture Model

Description

Generates synthetic data according to the Bayesian prevalence-incidence mixture (PIM) framework of Klausch et al. (2025) with interval-censored screening outcomes. The function simulates continuous or discrete baseline covariates, event times from one of several parametric families, and irregular screening schedules, yielding interval-censored observations suitable for testing or demonstrating PIM-based or other interval-censored survival methods.

Usage

gen.dat(
  kappa = 0.7,
  n = 1000,
  p = 2,
  p.discrete = 0,
  r = 0,
  s = 1,
  sigma.X = 1/2,
  mu.X = 4,
  beta.X = NULL,
  beta.W = NULL,
  theta = 0.15,
  v.min = 1,
  v.max = 6,
  mean.rc = 40,
  dist.X = "weibull",
  k = 1,
  sel.mod = "probit",
  prob.r = 0
)

Arguments

`kappa`	Numeric. Test sensitivity parameter `\kappa` used when generating misclassification. A value of 1 implies perfect sensitivity.
`n`	Integer. Sample size.
`p`	Integer. Number of continuous baseline covariates to simulate.
`p.discrete`	Integer. If `1`, include an additional discrete covariate `Z_{\mathrm{discrete}}` from `\mathrm{Bernoulli}(0.5)`; otherwise, none.
`r`	Numeric. Correlation coefficient(s) used to build the covariance matrix of continuous covariates. If `p > 1`, off-diagonal entries of the correlation matrix are set to `r`.
`s`	Numeric. Standard deviation(s) of the continuous covariates. If `p > 1`, all continuous covariates share the same `s`.
`sigma.X`	Numeric. Scale parameter `\sigma_X` in the AFT model for `\log(x_i)`.
`mu.X`	Numeric. Intercept `\beta_{x0}` in the AFT model. In the linear predictor, it appears as `\log(x_i) = \beta_{x0} + \beta_{x}^\top Z_i + \sigma_X \epsilon_i`. Practically, `mu.X` is prepended to `beta.X` when forming the full parameter vector.
`beta.X`	Numeric vector. The coefficients `\beta_{x}` for the AFT model. Combined with `mu.X`, the log-scale model is `cbind(1, Z_i) %*% c(mu.X, beta.X)`.
`beta.W`	Numeric vector. The coefficients `\beta_{w}` for the prevalence model. The intercept `\beta_{w0}` is derived from `theta`.
`theta`	Numeric. Baseline prevalence parameter on the probability scale. Under: `sel.mod = "probit"`: `\beta_{w0} = \mathrm{qnorm}(\theta)`. `sel.mod = "logit"`: `\beta_{w0} = \log(\theta / (1 - \theta))`.
`v.min`	Numeric. Minimum spacing for irregular screening intervals.
`v.max`	Numeric. Maximum spacing for irregular screening intervals.
`mean.rc`	Numeric. Mean of the exponential distribution controlling a random right-censoring time `t_{\mathrm{rc}}` after the first screening.
`dist.X`	Character. Distribution for survival times `x_i`: `"weibull"`, `"lognormal"`, `"loglog"` (log-logistic), or `"gengamma"` (generalized gamma).
`k`	Numeric. Shape parameter for `"gengamma"` only.
`sel.mod`	Character. Either `"probit"` or `"logit"`, specifying the link function for the prevalence model.
`prob.r`	Numeric. Probability that a baseline test is performed (`r_i = 1`). If `prob.r = 0`, no baseline tests are done.

Details

The data-generating process includes:

Covariates Z: Continuous covariates are simulated using a correlation structure specified by r and a common standard deviation s. If p.discrete = 1, a single discrete covariate is added, drawn from \mathrm{Bernoulli}(0.5).
Event Times X: An Accelerated Failure Time (AFT) model is used:

\log(x_i) = \beta_{x0} + \beta_{x}^\top z_{xi} + \sigma_X \,\epsilon_i,

where \beta_{x0} is the intercept (set by mu.X) and \beta_{x} are the other regression coefficients (provided via beta.X). The error term \epsilon_i is drawn from the distribution chosen by dist.X: "weibull", "lognormal", "loglog" (log-logistic), or "gengamma" (generalized gamma). For "gengamma", the shape parameter k is additionally used.
Irregular Screening Schedules V_i: Each individual has multiple screening times generated randomly between v.min and v.max, ending in right censoring or the time of detection. These screening times (including a 0 for baseline and Inf for censoring) are returned in Vobs.
Prevalence Indicator g_i: Baseline prevalence is modeled via either a probit or logit link, consistent with:

w_i = \beta_{w0} + \beta_{w}^\top z_{wi} + \psi_i,

where \beta_{w0} is determined by theta, and \beta_{w} by beta.W. Specifically:
- If sel.mod = "probit", then \beta_{w0} = \mathrm{qnorm}(\theta).
- If sel.mod = "logit", then \beta_{w0} = \log(\theta / (1-\theta)).
We set g_i = 1 if w_i > 0, and g_i = 0 otherwise.
Baseline Test Missingness r_i: A baseline test indicator r_i \in \{0,1\} is generated via \mathrm{Bernoulli}(\text{prob.r}), so r_i = 1 means the baseline test is performed and r_i = 0 means it is missing.
Test Sensitivity \kappa: A misclassification parameter \kappa (test sensitivity) can be specified via kappa. If \kappa < 1, some truly positive cases are missed.

Value

A list with the following elements:

Vobs: A list of length n, each entry containing screening times. The first element is 0 (baseline), and Inf may indicate right censoring.
X.true: Numeric vector of length n giving the true (latent) event times x_i.
Z: Numeric matrix of dimension n \times p (plus an extra column if p.discrete = 1) containing the covariates.
C: Binary vector of length n, indicating whether an individual is truly positive at baseline (g_i = 1).
r: Binary vector of length n, indicating whether the baseline test was performed (r_i = 1) or missing (r_i = 0).
p.W: Numeric vector of length n giving the true prevalence probabilities, P(g_i = 1).

References

T. Klausch, B. I. Lissenberg-Witte, and V. M. Coupé, “A Bayesian prevalence-incidence mixture model for screening outcomes with misclassification,” arXiv:2412.16065.

Examples

# Generate a small dataset for testing
set.seed(2025)
sim_data <- gen.dat(n = 100, p = 1, p.discrete = 1,
                    sigma.X = 0.5, mu.X = 2,
                    beta.X = c(0.2, 0.2), beta.W = c(0.5, -0.2),
                    theta = 0.2,
                    dist.X = "weibull", sel.mod = "probit")
str(sim_data)

BayesPIM documentation built on April 12, 2025, 1:59 a.m.