generateData: Generate synthetic data with missing values for missoNet

View source: R/generateData.R

generateDataR Documentation

Generate synthetic data with missing values for missoNet

Description

Generates synthetic data from a conditional Gaussian graphical model with user-specified missing data mechanisms. This function is designed for simulation studies and testing of the missoNet package, supporting three types of missingness: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).

Usage

generateData(
  n,
  p,
  q,
  rho,
  missing.type = "MCAR",
  X = NULL,
  Beta = NULL,
  E = NULL,
  Theta = NULL,
  Sigma.X = NULL,
  Beta.row.sparsity = 0.2,
  Beta.elm.sparsity = 0.2,
  seed = NULL
)

Arguments

n

Integer. Sample size (number of observations). Must be at least 2.

p

Integer. Number of predictor variables. Must be at least 1.

q

Integer. Number of response variables. Must be at least 2.

rho

Numeric scalar or vector of length q. Proportion of missing values for each response variable. Values must be in [0, 1). If scalar, the same missing rate is applied to all responses.

missing.type

Character string specifying the missing data mechanism. One of:

  • "MCAR" (default): Missing Completely At Random

  • "MAR": Missing At Random (depends on predictors)

  • "MNAR": Missing Not At Random (depends on response values)

X

Optional n x p matrix. User-supplied predictor matrix. If NULL (default), predictors are simulated from a multivariate normal distribution with mean zero and covariance Sigma.X.

Beta

Optional p x q matrix. Regression coefficient matrix. If NULL (default), a sparse coefficient matrix is generated with sparsity controlled by Beta.row.sparsity and Beta.elm.sparsity.

E

Optional n x q matrix. Error/noise matrix. If NULL (default), errors are simulated from a multivariate normal distribution with mean zero and precision matrix Theta.

Theta

Optional q x q positive definite matrix. Precision matrix (inverse covariance) for the response variables. If NULL (default), a block-structured precision matrix is generated with four types of graph structures. Only used when E = NULL.

Sigma.X

Optional p x p positive definite matrix. Covariance matrix for the predictors. If NULL (default), an AR(1) covariance structure with correlation 0.7 is used. Only used when X = NULL.

Beta.row.sparsity

Numeric in [0, 1]. Proportion of rows in Beta that contain at least one non-zero element. Default is 0.2. Only used when Beta = NULL.

Beta.elm.sparsity

Numeric in [0, 1]. Proportion of non-zero elements within active rows of Beta. Default is 0.2. Only used when Beta = NULL.

seed

Optional integer. Random seed for reproducibility.

Details

The function generates data through the following model:

Y = XB + E

where:

  • X \in \mathbb{R}^{n \times p} is the predictor matrix

  • B \in \mathbb{R}^{p \times q} is the coefficient matrix

  • E \sim \mathcal{MVN}(0, \Theta^{-1}) is the error matrix

  • Y \in \mathbb{R}^{n \times q} is the complete response matrix

Missing values are then introduced to create Z (the observed response matrix with NAs) according to the specified mechanism:

MCAR: Each element has probability rho[j] of being missing, independent of all variables.

MAR: Missingness depends on the predictors through a logistic model:

P(Z_{ij} = NA) = \mathrm{logit}^{-1}(XB)_{ij} \times c_j

where c_j is calibrated to achieve the target missing rate.

MNAR: The lowest rho[j] proportion of values in each column are set as missing.

Value

A list containing:

X

n x p matrix. Predictor matrix (either user-supplied or simulated).

Y

n x q matrix. Complete response matrix without missing values.

Z

n x q matrix. Response matrix with missing values (coded as NA).

Beta

p x q matrix. Regression coefficient matrix used in generation.

Theta

q x q matrix or NULL. Precision matrix (if used in generation).

rho

Numeric vector of length q. Missing rates for each response.

missing.type

Character string. The missing mechanism used.

Author(s)

Yixiao Zeng yixiao.zeng@mail.mcgill.ca, Celia M. T. Greenwood

See Also

missoNet for fitting models to data with missing values, cv.missoNet for cross-validation

Examples

# Example 1: Basic usage with default settings
sim.dat <- generateData(n = 300, p = 50, q = 20, rho = 0.1, seed = 857)

# Check dimensions and missing rate
dim(sim.dat$X)      # 300 x 50
dim(sim.dat$Z)      # 300 x 20
mean(is.na(sim.dat$Z))  # approximately 0.1

# Example 2: Variable missing rates with MAR mechanism
rho.vec <- seq(0.05, 0.25, length.out = 20)
sim.dat <- generateData(n = 300, p = 50, q = 20, 
                       rho = rho.vec, 
                       missing.type = "MAR")

# Example 3: High sparsity in coefficient matrix
sim.dat <- generateData(n = 500, p = 100, q = 30,
                       rho = 0.15,
                       Beta.row.sparsity = 0.1,  # 10% active predictors
                       Beta.elm.sparsity = 0.3)  # 30% active in each row

# Example 4: User-supplied matrices
n <- 300; p <- 50; q <- 20
X <- matrix(rnorm(n*p), n, p)
Beta <- matrix(rnorm(p*q) * rbinom(p*q, 1, 0.1), p, q)  # 10% non-zero
Theta <- diag(q) + 0.1  # Simple precision structure

sim.dat <- generateData(X = X, Beta = Beta, Theta = Theta,
                       n = n, p = p, q = q,
                       rho = 0.2, missing.type = "MNAR")


# Example 5: Use generated data with missoNet
library(missoNet)
sim.dat <- generateData(n = 400, p = 50, q = 10, rho = 0.15)

# Split into training and test sets
train.idx <- 1:300
test.idx <- 301:400

# Fit missoNet model
fit <- missoNet(X = sim.dat$X[train.idx, ], 
               Y = sim.dat$Z[train.idx, ],
               lambda.beta = 0.1, 
               lambda.theta = 0.1)

# Evaluate on test set
pred <- predict(fit, newx = sim.dat$X[test.idx, ])



missoNet documentation built on Sept. 9, 2025, 5:55 p.m.