simdat | R Documentation |
Data generation function for various underlying models
simdat(
n = c(300, 300, 300),
effect = 1,
sigma_Y = 1,
model = "ols",
shift = 0,
scale = 1
)
n |
Integer vector of size 3 indicating the sample sizes in the training, labeled, and unlabeled data sets, respectively |
effect |
Regression coefficient for the first variable of interest for inference. Defaults is 1. |
sigma_Y |
Residual variance for the generated outcome. Defaults is 1. |
model |
The type of model to be generated. Must be one of
|
shift |
Scalar shift of the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 0. |
scale |
Scaling factor for the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 1. |
The simdat
function generates three datasets consisting of independent
realizations of Y
(for model
= "mean"
or
"quantile"
), or \{Y, \boldsymbol{X}\}
(for model
=
"ols"
, "logistic"
, or "poisson"
): a training
dataset of size n_t
, a labeled dataset of size n_l
, and
an unlabeled dataset of size n_u
. These sizes are specified by
the argument n
.
NOTE: In the unlabeled data subset, outcome data are still generated
to facilitate a benchmark for comparison with an "oracle" model that uses
the true Y^{\mathcal{U}}
values for estimation and inference.
Generating Data
For "mean"
and "quantile"
, we simulate a continuous outcome,
Y \in \mathbb{R}
, with mean given by the effect
argument and
error variance given by the sigma_y
argument.
For "ols"
, "logistic"
, or "poisson"
models, predictor
data, \boldsymbol{X} \in \mathbb{R}^4
are simulated such that the
i
th observation follows a standard multivariate normal distribution
with a zero mean vector and identity covariance matrix:
\boldsymbol{X_i} = (X_{i1}, X_{i2}, X_{i3}, X_{i4}) \sim \mathcal{N}_4(\boldsymbol{0}, \boldsymbol{I}).
For "ols"
, a continuous outcome Y \in \mathbb{R}
is simulated
to depend on X_1
through a linear term with the effect size specified
by the effect
argument, while the other predictors,
\boldsymbol{X} \setminus X_1
, have nonlinear effects:
Y_i = effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y,
and \varepsilon_y \sim \mathcal{N}(0, sigma_y)
, where the
sigma_y
argument specifies the error variance.
For "logistic"
, we simulate:
\Pr(Y_i = 1 \mid \boldsymbol{X}) = logit^{-1}(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y)
and generate:
Y_i \sim Bern[1, \Pr(Y_i = 1 \mid \boldsymbol{X})]
where \varepsilon_y \sim \mathcal{N}(0, sigma\_y)
.
For "poisson"
, we simulate:
\lambda_Y = exp(effect \times Z_{i1} + \frac{1}{2} Z_{i2}^2 + \frac{1}{3} Z_{i3}^3 + \frac{1}{4} Z_{i4}^2 + \varepsilon_y)
and generate:
Y_i \sim Poisson(\lambda_Y)
Generating Predictions
To generate predicted outcomes for "mean"
and "quantile"
, we
simulate a continuous variable with mean given by the empirical mean of the
training data and error variance given by the sigma_y
argument.
For "ols"
, we fit a generalized additive model (GAM) on the
simulated training dataset and calculate predictions for the
labeled and unlabeled datasets as deterministic functions of
\boldsymbol{X}
. Specifically, we fit the following GAM:
Y^{\mathcal{T}} = s_0 + s_1(X_1^{\mathcal{T}}) + s_2(X_2^{\mathcal{T}}) +
s_3(X_3^{\mathcal{T}}) + s_4(X_4^{\mathcal{T}}) + \varepsilon_p,
where \mathcal{T}
denotes the training dataset, s_0
is an
intercept term, and s_1(\cdot)
, s_2(\cdot)
, s_3(\cdot)
,
and s_4(\cdot)
are smoothing spline functions for X_1
, X_2
,
X_3
, and X_4
, respectively, with three target equivalent degrees
of freedom. Residual error is modeled as \varepsilon_p
.
Predictions for labeled and unlabeled datasets are calculated as:
f(\boldsymbol{X}^{\mathcal{L}\cup\mathcal{U}}) = \hat{s}_0 + \hat{s}_1(X_1^{\mathcal{L}\cup\mathcal{U}}) +
\hat{s}_2(X_2^{\mathcal{L}\cup\mathcal{U}}) + \hat{s}_3(X_3^{\mathcal{L}\cup\mathcal{U}}) +
\hat{s}_4(X_4^{\mathcal{L}\cup\mathcal{U}}),
where \hat{s}_0, \hat{s}_1, \hat{s}_2, \hat{s}_3
, and \hat{s}_4
are estimates of s_0, s_1, s_2, s_3
, and s_4
, respectively.
NOTE: For continuous outcomes, we provide optional arguments shift
and
scale
to further apply a location shift and scaling factor,
respectively, to the predicted outcomes. These default to shift = 0
and scale = 1
, i.e., no location shift or scaling.
For "logistic"
, we train k-nearest neighbors (k-NN) classifiers on
the simulated training dataset for values of k
ranging from 1
to 10. The optimal k
is chosen via cross-validation, minimizing the
misclassification error on the validation folds. Predictions for the
labeled and unlabeled datasets are obtained by applying the
k-NN classifier with the optimal k
to \boldsymbol{X}
.
Specifically, for each observation in the labeled and unlabeled datasets:
\hat{Y} = \operatorname{argmax}_c \sum_{i \in \mathcal{N}_k} I(Y_i = c),
where \mathcal{N}_k
represents the set of k
nearest neighbors in
the training dataset, c
indexes the possible classes (0 or 1), and
I(\cdot)
is an indicator function.
For "poisson"
, we fit a generalized linear model (GLM) with a log link
function to the simulated training dataset. The model is of the form:
\log(\mu^{\mathcal{T}}) = \gamma_0 + \gamma_1 X_1^{\mathcal{T}} + \gamma_2 X_2^{\mathcal{T}} +
\gamma_3 X_3^{\mathcal{T}} + \gamma_4 X_4^{\mathcal{T}},
where \mu^{\mathcal{T}}
is the expected count for the response variable
in the training dataset, \gamma_0
is the intercept, and
\gamma_1
, \gamma_2
, \gamma_3
, and \gamma_4
are the
regression coefficients for the predictors X_1
, X_2
, X_3
,
and X_4
, respectively.
Predictions for the labeled and unlabeled datasets are calculated as:
\hat{\mu}^{\mathcal{L} \cup \mathcal{U}} = \exp(\hat{\gamma}_0 + \hat{\gamma}_1 X_1^{\mathcal{L} \cup \mathcal{U}} +
\hat{\gamma}_2 X_2^{\mathcal{L} \cup \mathcal{U}} + \hat{\gamma}_3 X_3^{\mathcal{L} \cup \mathcal{U}} +
\hat{\gamma}_4 X_4^{\mathcal{L} \cup \mathcal{U}}),
where \hat{\gamma}_0
, \hat{\gamma}_1
, \hat{\gamma}_2
, \hat{\gamma}_3
,
and \hat{\gamma}_4
are the estimated coefficients.
A data.frame containing n rows and columns corresponding to the labeled outcome (Y), the predicted outcome (f), a character variable (set_label) indicating which data set the observation belongs to (training, labeled, or unlabeled), and four independent, normally distributed predictors (X1, X2, X3, and X4), where applicable.
#-- Mean
dat_mean <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1,
model = "mean")
head(dat_mean)
#-- Linear Regression
dat_ols <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1,
model = "ols")
head(dat_ols)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.