simdat: Data generation function for various underlying models

View source: R/simdat.R

simdatR Documentation

Data generation function for various underlying models

Description

Data generation function for various underlying models

Usage

simdat(
  n = c(300, 300, 300),
  effect = 1,
  sigma_Y = 1,
  model = "ols",
  shift = 0,
  scale = 1
)

Arguments

n

Integer vector of size 3 indicating the sample sizes in the training, labeled, and unlabeled data sets, respectively

effect

Regression coefficient for the first variable of interest for inference. Defaults is 1.

sigma_Y

Residual variance for the generated outcome. Defaults is 1.

model

The type of model to be generated. Must be one of "mean", "quantile", "ols", "logistic", or "poisson". Default is "ols".

shift

Scalar shift of the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 0.

scale

Scaling factor for the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 1.

Details

The simdat function generates three datasets consisting of independent realizations of Y (for model = "mean" or "quantile"), or \{Y, \boldsymbol{X}\} (for model = "ols", "logistic", or "poisson"): a training dataset of size n_t, a labeled dataset of size n_l, and an unlabeled dataset of size n_u. These sizes are specified by the argument n.

NOTE: In the unlabeled data subset, outcome data are still generated to facilitate a benchmark for comparison with an "oracle" model that uses the true Y^{\mathcal{U}} values for estimation and inference.

Generating Data

For "mean" and "quantile", we simulate a continuous outcome, Y \in \mathbb{R}, with mean given by the effect argument and error variance given by the sigma_y argument.

For "ols", "logistic", or "poisson" models, predictor data, \boldsymbol{X} \in \mathbb{R}^4 are simulated such that the ith observation follows a standard multivariate normal distribution with a zero mean vector and identity covariance matrix:

\boldsymbol{X_i} = (X_{i1}, X_{i2}, X_{i3}, X_{i4}) \sim \mathcal{N}_4(\boldsymbol{0}, \boldsymbol{I}).

For "ols", a continuous outcome Y \in \mathbb{R} is simulated to depend on X_1 through a linear term with the effect size specified by the effect argument, while the other predictors, \boldsymbol{X} \setminus X_1, have nonlinear effects:

Y_i = effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y,

and \varepsilon_y \sim \mathcal{N}(0, sigma_y), where the sigma_y argument specifies the error variance.

For "logistic", we simulate:

\Pr(Y_i = 1 \mid \boldsymbol{X}) = logit^{-1}(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y)

and generate:

Y_i \sim Bern[1, \Pr(Y_i = 1 \mid \boldsymbol{X})]

where \varepsilon_y \sim \mathcal{N}(0, sigma\_y).

For "poisson", we simulate:

\lambda_Y = exp(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y)

and generate:

Y_i \sim Poisson(\lambda_Y)

Generating Predictions

To generate predicted outcomes for "mean" and "quantile", we simulate a continuous variable with mean given by the empirical mean of the training data and error variance given by the sigma_y argument.

For "ols", we fit a generalized additive model (GAM) on the simulated training dataset and calculate predictions for the labeled and unlabeled datasets as deterministic functions of \boldsymbol{X}. Specifically, we fit the following GAM:

Y^{\mathcal{T}} = s_0 + s_1(X_1^{\mathcal{T}}) + s_2(X_2^{\mathcal{T}}) + s_3(X_3^{\mathcal{T}}) + s_4(X_4^{\mathcal{T}}) + \varepsilon_p,

where \mathcal{T} denotes the training dataset, s_0 is an intercept term, and s_1(\cdot), s_2(\cdot), s_3(\cdot), and s_4(\cdot) are smoothing spline functions for X_1, X_2, X_3, and X_4, respectively, with three target equivalent degrees of freedom. Residual error is modeled as \varepsilon_p.

Predictions for labeled and unlabeled datasets are calculated as:

f(\boldsymbol{X}^{\mathcal{L}\cup\mathcal{U}}) = \hat{s}_0 + \hat{s}_1(X_1^{\mathcal{L}\cup\mathcal{U}}) + \hat{s}_2(X_2^{\mathcal{L}\cup\mathcal{U}}) + \hat{s}_3(X_3^{\mathcal{L}\cup\mathcal{U}}) + \hat{s}_4(X_4^{\mathcal{L}\cup\mathcal{U}}),

where \hat{s}_0, \hat{s}_1, \hat{s}_2, \hat{s}_3, and \hat{s}_4 are estimates of s_0, s_1, s_2, s_3, and s_4, respectively.

NOTE: For continuous outcomes, we provide optional arguments shift and scale to further apply a location shift and scaling factor, respectively, to the predicted outcomes. These default to shift = 0 and scale = 1, i.e., no location shift or scaling.

For "logistic", we train k-nearest neighbors (k-NN) classifiers on the simulated training dataset for values of k ranging from 1 to 10. The optimal k is chosen via cross-validation, minimizing the misclassification error on the validation folds. Predictions for the labeled and unlabeled datasets are obtained by applying the k-NN classifier with the optimal k to \boldsymbol{X}.

Specifically, for each observation in the labeled and unlabeled datasets:

\hat{Y} = \operatorname{argmax}_c \sum_{i \in \mathcal{N}_k} I(Y_i = c),

where \mathcal{N}_k represents the set of k nearest neighbors in the training dataset, c indexes the possible classes (0 or 1), and I(\cdot) is an indicator function.

For "poisson", we fit a generalized linear model (GLM) with a log link function to the simulated training dataset. The model is of the form:

\log(\mu^{\mathcal{T}}) = \gamma_0 + \gamma_1 X_1^{\mathcal{T}} + \gamma_2 X_2^{\mathcal{T}} + \gamma_3 X_3^{\mathcal{T}} + \gamma_4 X_4^{\mathcal{T}},

where \mu^{\mathcal{T}} is the expected count for the response variable in the training dataset, \gamma_0 is the intercept, and \gamma_1, \gamma_2, \gamma_3, and \gamma_4 are the regression coefficients for the predictors X_1, X_2, X_3, and X_4, respectively.

Predictions for the labeled and unlabeled datasets are calculated as:

\hat{\mu}^{\mathcal{L} \cup \mathcal{U}} = \exp(\hat{\gamma}_0 + \hat{\gamma}_1 X_1^{\mathcal{L} \cup \mathcal{U}} + \hat{\gamma}_2 X_2^{\mathcal{L} \cup \mathcal{U}} + \hat{\gamma}_3 X_3^{\mathcal{L} \cup \mathcal{U}} + \hat{\gamma}_4 X_4^{\mathcal{L} \cup \mathcal{U}}),

where \hat{\gamma}_0, \hat{\gamma}_1, \hat{\gamma}_2, \hat{\gamma}_3, and \hat{\gamma}_4 are the estimated coefficients.

Value

A data.frame containing n rows and columns corresponding to the labeled outcome (Y), the predicted outcome (f), a character variable (set_label) indicating which data set the observation belongs to (training, labeled, or unlabeled), and four independent, normally distributed predictors (X1, X2, X3, and X4), where applicable.

Examples


#-- Mean

dat_mean <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1,

  model = "mean")

head(dat_mean)

#-- Linear Regression

dat_ols <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1,

  model = "ols")

head(dat_ols)


ipd documentation built on April 4, 2025, 4:41 a.m.