gen.dat | R Documentation |
Generates synthetic data according to the Bayesian prevalence-incidence mixture (PIM) framework of Klausch et al. (2025) with interval-censored screening outcomes. The function simulates continuous or discrete baseline covariates, event times from one of several parametric families, and irregular screening schedules, yielding interval-censored observations suitable for testing or demonstrating PIM-based or other interval-censored survival methods.
gen.dat(
kappa = 0.7,
n = 1000,
p = 2,
p.discrete = 0,
r = 0,
s = 1,
sigma.X = 1/2,
mu.X = 4,
beta.X = NULL,
beta.W = NULL,
theta = 0.15,
v.min = 1,
v.max = 6,
mean.rc = 40,
dist.X = "weibull",
k = 1,
sel.mod = "probit",
prob.r = 0
)
kappa |
Numeric. Test sensitivity parameter |
n |
Integer. Sample size. |
p |
Integer. Number of continuous baseline covariates to simulate. |
p.discrete |
Integer. If |
r |
Numeric. Correlation coefficient(s) used to build the covariance matrix of continuous covariates. If |
s |
Numeric. Standard deviation(s) of the continuous covariates. If |
sigma.X |
Numeric. Scale parameter |
mu.X |
Numeric. Intercept |
beta.X |
Numeric vector. The coefficients |
beta.W |
Numeric vector. The coefficients |
theta |
Numeric. Baseline prevalence parameter on the probability scale. Under:
|
v.min |
Numeric. Minimum spacing for irregular screening intervals. |
v.max |
Numeric. Maximum spacing for irregular screening intervals. |
mean.rc |
Numeric. Mean of the exponential distribution controlling a random right-censoring time |
dist.X |
Character. Distribution for survival times |
k |
Numeric. Shape parameter for |
sel.mod |
Character. Either |
prob.r |
Numeric. Probability that a baseline test is performed ( |
The data-generating process includes:
Covariates Z
:
Continuous covariates are simulated using a correlation structure specified by r
and a common standard deviation s
.
If p.discrete = 1
, a single discrete covariate is added, drawn from \mathrm{Bernoulli}(0.5)
.
Event Times X
:
An Accelerated Failure Time (AFT) model is used:
\log(x_i) = \beta_{x0} + \beta_{x}^\top z_{xi} + \sigma_X \,\epsilon_i,
where \beta_{x0}
is the intercept (set by mu.X
) and \beta_{x}
are the other regression coefficients (provided via beta.X
).
The error term \epsilon_i
is drawn from the distribution chosen by dist.X
:
"weibull"
, "lognormal"
, "loglog"
(log-logistic), or "gengamma"
(generalized gamma).
For "gengamma"
, the shape parameter k
is additionally used.
Irregular Screening Schedules V_i
:
Each individual has multiple screening times generated randomly between v.min
and v.max
,
ending in right censoring or the time of detection.
These screening times (including a 0 for baseline and Inf
for censoring) are returned in Vobs
.
Prevalence Indicator g_i
:
Baseline prevalence is modeled via either a probit or logit link, consistent with:
w_i = \beta_{w0} + \beta_{w}^\top z_{wi} + \psi_i,
where \beta_{w0}
is determined by theta
, and \beta_{w}
by beta.W
.
Specifically:
If sel.mod = "probit"
, then \beta_{w0} = \mathrm{qnorm}(\theta)
.
If sel.mod = "logit"
, then \beta_{w0} = \log(\theta / (1-\theta))
.
We set g_i = 1
if w_i > 0
, and g_i = 0
otherwise.
Baseline Test Missingness r_i
:
A baseline test indicator r_i \in \{0,1\}
is generated via \mathrm{Bernoulli}(\text{prob.r})
,
so r_i = 1
means the baseline test is performed and r_i = 0
means it is missing.
Test Sensitivity \kappa
:
A misclassification parameter \kappa
(test sensitivity) can be specified via kappa
.
If \kappa < 1
, some truly positive cases are missed.
A list with the following elements:
Vobs
A list of length n
, each entry containing screening times.
The first element is 0 (baseline), and Inf
may indicate right censoring.
X.true
Numeric vector of length n
giving the true (latent) event times x_i
.
Z
Numeric matrix of dimension n \times p
(plus an extra column if p.discrete = 1
) containing the covariates.
C
Binary vector of length n
, indicating whether an individual is truly positive at baseline (g_i = 1
).
r
Binary vector of length n
, indicating whether the baseline test was performed (r_i = 1
) or missing (r_i = 0
).
p.W
Numeric vector of length n
giving the true prevalence probabilities, P(g_i = 1)
.
T. Klausch, B. I. Lissenberg-Witte, and V. M. Coupé, “A Bayesian prevalence-incidence mixture model for screening outcomes with misclassification,” arXiv:2412.16065.
# Generate a small dataset for testing
set.seed(2025)
sim_data <- gen.dat(n = 100, p = 1, p.discrete = 1,
sigma.X = 0.5, mu.X = 2,
beta.X = c(0.2, 0.2), beta.W = c(0.5, -0.2),
theta = 0.2,
dist.X = "weibull", sel.mod = "probit")
str(sim_data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.