View source: R/simulate_booami_data.R
| simulate_booami_data | R Documentation |
Generates a dataset with p predictors, of which the first p_inf
are informative. Predictors are drawn from a multivariate normal with a chosen
correlation structure, and the outcome can be continuous (type = "gaussian")
or binary (type = "logistic"). Missing values are introduced in the
predictors via MAR or MCAR; the outcome y is always fully observed (no NAs).
simulate_booami_data(
n = 300,
p = 25,
p_inf = 5,
rho = 0.3,
type = c("gaussian", "logistic"),
beta_range = c(1, 2),
intercept = 1,
corr_structure = c("all_ar1", "informative_cs", "blockdiag", "none"),
rho_noise = NULL,
noise_sd = 1,
miss = c("MAR", "MCAR"),
miss_prop = 0.25,
mar_drivers = c(1, 2, 3),
gamma_vec = NULL,
calibrate_mar = FALSE,
mar_scale = TRUE,
keep_observed = integer(0),
jitter_sd = 0.25,
keep_mar_drivers = TRUE
)
n |
Number of observations (default |
p |
Total number of predictors (default |
p_inf |
Number of informative predictors (default |
rho |
Correlation parameter (interpretation depends on |
type |
Either |
beta_range |
Length-2 numeric; coefficients for the first |
intercept |
Intercept added to the linear predictor (default |
corr_structure |
One of |
rho_noise |
Optional correlation for the noise block when |
noise_sd |
Std. dev. of Gaussian noise added to |
miss |
Missingness mechanism: |
miss_prop |
Target marginal missingness proportion (default |
mar_drivers |
Indices of predictors that drive MAR (default |
gamma_vec |
Coefficients for MAR drivers; length must equal the number of MAR drivers actually used
(i.e., |
calibrate_mar |
If |
mar_scale |
If |
keep_observed |
Indices of predictors kept fully observed (values outside |
jitter_sd |
Standard deviation of the per-row jitter added to the MAR logit to induce heterogeneity
(default |
keep_mar_drivers |
Logical; if |
Correlation structures:
"all_ar1": AR(1) correlation with parameter rho across all p predictors.
"informative_cs": compound symmetry (exchangeable) within the first p_inf
predictors with parameter rho; others independent.
"blockdiag": block-diagonal AR(1): the informative block (size p_inf) has AR(1) with rho;
the noise block (size p - p_inf) has AR(1) with rho_noise (defaults to rho).
"none": independent predictors.
Missingness (predictors only):
"MAR": for each row, a logit missingness score is computed from the
selected MAR drivers (see mar_drivers, gamma_vec, mar_scale);
an intercept is set via calibrate_mar to target the proportion miss_prop
(otherwise qlogis(miss_prop)),
and per-row jitter N(0, jitter_sd) adds heterogeneity. The resulting probability
is used to mask predictors (except those in keep_observed and—if keep_mar_drivers = TRUE—the drivers themselves).
The outcome y is not masked.
"MCAR": each predictor (except those in keep_observed) is masked independently with probability miss_prop.
The outcome y is not masked.
Note: In the simulation, missingness probabilities are computed using the
fully observed latent covariates before masking. From an analyst’s perspective after
masking, allowing the MAR drivers themselves to be missing makes missingness depend on
unobserved values—i.e., effectively non-ignorable (MNAR). Setting
keep_mar_drivers = TRUE keeps those drivers observed and yields a MAR mechanism.
A list with elements:
data: data.frame with columns X1..Xp and y.
Missing values are introduced in the predictors X1..Xp; y
is fully observed.
beta: numeric length-p vector of true coefficients (non-zeros in the first p_inf positions).
informative: integer vector 1:p_inf.
type: character, outcome type ("gaussian" or "logistic").
intercept: numeric intercept used.
The data element additionally carries attributes:
"true_beta", "informative",
"type", "corr_structure", "rho", "rho_noise" (if set),
"intercept", "noise_sd" (Gaussian; NA otherwise), "mar_scale",
and "keep_mar_drivers".
booami_simset.seed(123) sim <- simulate_booami_data( n = 300, p = 25, p_inf = 5, rho = 0.3, type = "gaussian", beta_range = c(1, 2), intercept = 1, corr_structure = "all_ar1", rho_noise = NULL, noise_sd = 1, miss = "MAR", miss_prop = 0.25, mar_drivers = c(1, 2, 3), gamma_vec = NULL, calibrate_mar = FALSE, mar_scale = TRUE, keep_observed = integer(0), jitter_sd = 0.25, keep_mar_drivers = TRUE ) booami_sim <- sim$data
booami_sim, cv_boost_raw,
cv_boost_imputed, impu_boost
set.seed(42)
sim <- simulate_booami_data(
n = 200, p = 15, p_inf = 4, rho = 0.25,
type = "gaussian", miss = "MAR", miss_prop = 0.20
)
d <- sim$data
dim(d)
mean(colSums(is.na(d)) > 0) # fraction of columns with any NAs
sum(is.na(d$y)) # should be 0
head(attr(d, "true_beta"))
attr(d, "informative")
# Example with block-diagonal correlation and protected MAR drivers
sim2 <- simulate_booami_data(
n = 150, p = 12, p_inf = 3, rho = 0.40, rho_noise = 0.10,
corr_structure = "blockdiag", miss = "MAR", miss_prop = 0.30,
mar_drivers = c(1, 2), keep_mar_drivers = TRUE
)
colSums(is.na(sim2$data))[1:4]
# Binary outcome example
sim3 <- simulate_booami_data(
n = 100, p = 10, p_inf = 2, rho = 0.2,
type = "logistic", miss = "MCAR", miss_prop = 0.15
)
table(sim3$data$y, useNA = "ifany")
sum(is.na(sim3$data$y)) # should be 0
utils::data(booami_sim)
dim(booami_sim)
head(attr(booami_sim, "true_beta"))
attr(booami_sim, "informative")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.