simulate_booami_data: Simulate a Booami Example Dataset with Missing Values

View source: R/simulate_booami_data.R

simulate_booami_dataR Documentation

Simulate a Booami Example Dataset with Missing Values

Description

Generates a dataset with p predictors, of which the first p_inf are informative. Predictors are drawn from a multivariate normal with a chosen correlation structure, and the outcome can be continuous (type = "gaussian") or binary (type = "logistic"). Missing values are introduced in the predictors via MAR or MCAR; the outcome y is always fully observed (no NAs).

Usage

simulate_booami_data(
  n = 300,
  p = 25,
  p_inf = 5,
  rho = 0.3,
  type = c("gaussian", "logistic"),
  beta_range = c(1, 2),
  intercept = 1,
  corr_structure = c("all_ar1", "informative_cs", "blockdiag", "none"),
  rho_noise = NULL,
  noise_sd = 1,
  miss = c("MAR", "MCAR"),
  miss_prop = 0.25,
  mar_drivers = c(1, 2, 3),
  gamma_vec = NULL,
  calibrate_mar = FALSE,
  mar_scale = TRUE,
  keep_observed = integer(0),
  jitter_sd = 0.25,
  keep_mar_drivers = TRUE
)

Arguments

n

Number of observations (default 300).

p

Total number of predictors (default 25).

p_inf

Number of informative predictors (default 5); must satisfy p_inf <= p.

rho

Correlation parameter (interpretation depends on corr_structure).

type

Either "gaussian" or "logistic" (default "gaussian").

beta_range

Length-2 numeric; coefficients for the first p_inf informative predictors are drawn i.i.d. Uniform(beta_range[1], beta_range[2]).

intercept

Intercept added to the linear predictor (default 1).

corr_structure

One of "all_ar1", "informative_cs", "blockdiag", "none".

rho_noise

Optional correlation for the noise block when corr_structure = "blockdiag" (defaults to rho).

noise_sd

Std. dev. of Gaussian noise added to y when type = "gaussian" (default 1); ignored for type = "logistic".

miss

Missingness mechanism: "MAR" or "MCAR" (default "MAR").

miss_prop

Target marginal missingness proportion (default 0.25).

mar_drivers

Indices of predictors that drive MAR (default c(1, 2, 3)). Must lie within 1..p. (Out-of-range indices are ignored; an empty set is not allowed.)

gamma_vec

Coefficients for MAR drivers; length must equal the number of MAR drivers actually used (i.e., length(mar_drivers) after restricting to 1..p). If NULL, heuristic defaults are used (starting from c(0.5, -0.35, 0.15) as available).

calibrate_mar

If TRUE, calibrates the MAR intercept by root-finding so that the average missingness matches miss_prop. If FALSE, uses qlogis(miss_prop).

mar_scale

If TRUE (default), standardize MAR drivers before applying gamma_vec.

keep_observed

Indices of predictors kept fully observed (values outside 1..p are ignored).

jitter_sd

Standard deviation of the per-row jitter added to the MAR logit to induce heterogeneity (default 0.25).

keep_mar_drivers

Logical; if TRUE (default), predictors in mar_drivers are kept fully observed under MAR so that missingness depends only on observed covariates (MAR). If FALSE, those drivers may be masked as well, making the mechanism effectively non-ignorable (MNAR) for variables whose missingness depends on them.

Details

Correlation structures:

  • "all_ar1": AR(1) correlation with parameter rho across all p predictors.

  • "informative_cs": compound symmetry (exchangeable) within the first p_inf predictors with parameter rho; others independent.

  • "blockdiag": block-diagonal AR(1): the informative block (size p_inf) has AR(1) with rho; the noise block (size p - p_inf) has AR(1) with rho_noise (defaults to rho).

  • "none": independent predictors.

Missingness (predictors only):

  • "MAR": for each row, a logit missingness score is computed from the selected MAR drivers (see mar_drivers, gamma_vec, mar_scale); an intercept is set via calibrate_mar to target the proportion miss_prop (otherwise qlogis(miss_prop)), and per-row jitter N(0, jitter_sd) adds heterogeneity. The resulting probability is used to mask predictors (except those in keep_observed and—if keep_mar_drivers = TRUE—the drivers themselves). The outcome y is not masked.

  • "MCAR": each predictor (except those in keep_observed) is masked independently with probability miss_prop. The outcome y is not masked.

Note: In the simulation, missingness probabilities are computed using the fully observed latent covariates before masking. From an analyst’s perspective after masking, allowing the MAR drivers themselves to be missing makes missingness depend on unobserved values—i.e., effectively non-ignorable (MNAR). Setting keep_mar_drivers = TRUE keeps those drivers observed and yields a MAR mechanism.

Value

A list with elements:

  • data: data.frame with columns X1..Xp and y. Missing values are introduced in the predictors X1..Xp; y is fully observed.

  • beta: numeric length-p vector of true coefficients (non-zeros in the first p_inf positions).

  • informative: integer vector 1:p_inf.

  • type: character, outcome type ("gaussian" or "logistic").

  • intercept: numeric intercept used.

The data element additionally carries attributes: "true_beta", "informative", "type", "corr_structure", "rho", "rho_noise" (if set), "intercept", "noise_sd" (Gaussian; NA otherwise), "mar_scale", and "keep_mar_drivers".

Reproducing the shipped dataset booami_sim

set.seed(123)
sim <- simulate_booami_data(
  n = 300, p = 25, p_inf = 5, rho = 0.3,
  type = "gaussian", beta_range = c(1, 2), intercept = 1,
  corr_structure = "all_ar1", rho_noise = NULL, noise_sd = 1,
  miss = "MAR", miss_prop = 0.25,
  mar_drivers = c(1, 2, 3), gamma_vec = NULL,
  calibrate_mar = FALSE, mar_scale = TRUE,
  keep_observed = integer(0), jitter_sd = 0.25,
  keep_mar_drivers = TRUE
)
booami_sim <- sim$data

See Also

booami_sim, cv_boost_raw, cv_boost_imputed, impu_boost

Examples

set.seed(42)
sim <- simulate_booami_data(
  n = 200, p = 15, p_inf = 4, rho = 0.25,
  type = "gaussian", miss = "MAR", miss_prop = 0.20
)
d <- sim$data
dim(d)
mean(colSums(is.na(d)) > 0)    # fraction of columns with any NAs
sum(is.na(d$y))                # should be 0
head(attr(d, "true_beta"))
attr(d, "informative")

# Example with block-diagonal correlation and protected MAR drivers
sim2 <- simulate_booami_data(
  n = 150, p = 12, p_inf = 3, rho = 0.40, rho_noise = 0.10,
  corr_structure = "blockdiag", miss = "MAR", miss_prop = 0.30,
  mar_drivers = c(1, 2), keep_mar_drivers = TRUE
)
colSums(is.na(sim2$data))[1:4]

# Binary outcome example
sim3 <- simulate_booami_data(
  n = 100, p = 10, p_inf = 2, rho = 0.2,
  type = "logistic", miss = "MCAR", miss_prop = 0.15
)
table(sim3$data$y, useNA = "ifany")
sum(is.na(sim3$data$y))        # should be 0


utils::data(booami_sim)
dim(booami_sim)
head(attr(booami_sim, "true_beta"))
attr(booami_sim, "informative")



booami documentation built on Feb. 19, 2026, 5:07 p.m.