generateData | R Documentation |
Generates synthetic data from a conditional Gaussian graphical model with user-specified missing data mechanisms. This function is designed for simulation studies and testing of the missoNet package, supporting three types of missingness: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).
generateData(
n,
p,
q,
rho,
missing.type = "MCAR",
X = NULL,
Beta = NULL,
E = NULL,
Theta = NULL,
Sigma.X = NULL,
Beta.row.sparsity = 0.2,
Beta.elm.sparsity = 0.2,
seed = NULL
)
n |
Integer. Sample size (number of observations). Must be at least 2. |
p |
Integer. Number of predictor variables. Must be at least 1. |
q |
Integer. Number of response variables. Must be at least 2. |
rho |
Numeric scalar or vector of length |
missing.type |
Character string specifying the missing data mechanism. One of:
|
X |
Optional |
Beta |
Optional |
E |
Optional |
Theta |
Optional |
Sigma.X |
Optional |
Beta.row.sparsity |
Numeric in [0, 1]. Proportion of rows in Beta that
contain at least one non-zero element. Default is 0.2. Only used when
|
Beta.elm.sparsity |
Numeric in [0, 1]. Proportion of non-zero elements
within active rows of Beta. Default is 0.2. Only used when |
seed |
Optional integer. Random seed for reproducibility. |
The function generates data through the following model:
Y = XB + E
where:
X \in \mathbb{R}^{n \times p}
is the predictor matrix
B \in \mathbb{R}^{p \times q}
is the coefficient matrix
E \sim \mathcal{MVN}(0, \Theta^{-1})
is the error matrix
Y \in \mathbb{R}^{n \times q}
is the complete response matrix
Missing values are then introduced to create Z
(the observed response
matrix with NAs) according to the specified mechanism:
MCAR: Each element has probability rho[j]
of being missing,
independent of all variables.
MAR: Missingness depends on the predictors through a logistic model:
P(Z_{ij} = NA) = \mathrm{logit}^{-1}(XB)_{ij} \times c_j
where c_j
is calibrated to achieve the target missing rate.
MNAR: The lowest rho[j]
proportion of values in each column
are set as missing.
A list containing:
X |
|
Y |
|
Z |
|
Beta |
|
Theta |
|
rho |
Numeric vector of length |
missing.type |
Character string. The missing mechanism used. |
Yixiao Zeng yixiao.zeng@mail.mcgill.ca, Celia M. T. Greenwood
missoNet
for fitting models to data with missing values,
cv.missoNet
for cross-validation
# Example 1: Basic usage with default settings
sim.dat <- generateData(n = 300, p = 50, q = 20, rho = 0.1, seed = 857)
# Check dimensions and missing rate
dim(sim.dat$X) # 300 x 50
dim(sim.dat$Z) # 300 x 20
mean(is.na(sim.dat$Z)) # approximately 0.1
# Example 2: Variable missing rates with MAR mechanism
rho.vec <- seq(0.05, 0.25, length.out = 20)
sim.dat <- generateData(n = 300, p = 50, q = 20,
rho = rho.vec,
missing.type = "MAR")
# Example 3: High sparsity in coefficient matrix
sim.dat <- generateData(n = 500, p = 100, q = 30,
rho = 0.15,
Beta.row.sparsity = 0.1, # 10% active predictors
Beta.elm.sparsity = 0.3) # 30% active in each row
# Example 4: User-supplied matrices
n <- 300; p <- 50; q <- 20
X <- matrix(rnorm(n*p), n, p)
Beta <- matrix(rnorm(p*q) * rbinom(p*q, 1, 0.1), p, q) # 10% non-zero
Theta <- diag(q) + 0.1 # Simple precision structure
sim.dat <- generateData(X = X, Beta = Beta, Theta = Theta,
n = n, p = p, q = q,
rho = 0.2, missing.type = "MNAR")
# Example 5: Use generated data with missoNet
library(missoNet)
sim.dat <- generateData(n = 400, p = 50, q = 10, rho = 0.15)
# Split into training and test sets
train.idx <- 1:300
test.idx <- 301:400
# Fit missoNet model
fit <- missoNet(X = sim.dat$X[train.idx, ],
Y = sim.dat$Z[train.idx, ],
lambda.beta = 0.1,
lambda.theta = 0.1)
# Evaluate on test set
pred <- predict(fit, newx = sim.dat$X[test.idx, ])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.