correlated_logistic_gaussian_dgp: Generate correlated Gaussian covariates and (binary) logistic...
In Yu-Group/dgpoix: Generate synthetic data that is as fresh as the real thing

correlated_logistic_gaussian_dgp

R Documentation

Generate correlated Gaussian covariates and (binary) logistic response data.

Description

Generate normally-distributed covariates that are potentially correlated and (binary) logistic response data.

Usage

correlated_logistic_gaussian_dgp(
  n,
  p_uncorr,
  p_corr,
  s_uncorr = p_uncorr,
  s_corr = p_corr,
  corr,
  betas_uncorr = NULL,
  betas_corr = NULL,
  betas_uncorr_sd = 1,
  betas_corr_sd = 1,
  intercept = 0,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

`n`	Number of samples.
`p_uncorr`	Number of uncorrelated features.
`p_corr`	Number of features in correlated group.
`s_uncorr`	Sparsity level of features in uncorrelated group. Coefficients corresponding to features after the `s_uncorr` position (i.e., positions i = `s_uncorr` + 1, ..., `p_uncorr`) are set to 0.
`s_corr`	Sparsity level of features in correlated group. Coefficients corresponding to features after the `s_corr` position (i.e., positions i = `s_corr` + 1, ..., `p_corr`) are set to 0.
`corr`	Correlation between features in correlated group.
`betas_uncorr`	Coefficient vector for uncorrelated features. If a scalar is provided, the coefficient vector is constant. If `NULL` (default), entries in the coefficient vector are drawn iid from N(0, `betas_uncorr_sd`^2). Can also be a function that generates the coefficient vector; see `generate_coef()`.
`betas_corr`	Coefficient vector for correlated features. If a scalar is provided, the coefficient vector is constant. If `NULL` (default), entries in the coefficient vector are drawn iid from N(0, `betas_corr_sd`^2). Can also be a function that generates the coefficient vector; see `generate_coef()`.
`betas_uncorr_sd`	(Optional) SD of normal distribution from which to draw `betas_uncorr`. Only used if `betas_uncorr` argument is `NULL` or is a function in which case `betas_uncorr_sd` is optionally passed to the function as `sd`; see `generate_coef()`.
`betas_corr_sd`	(Optional) SD of normal distribution from which to draw `betas_corr`. Only used if `betas_corr` argument is `NULL` or is a function in which case `betas_corr_sd` is optionally passed to the function as `sd`; see `generate_coef()`.
`intercept`	Scalar intercept term.
`data_split`	Logical; if `TRUE`, splits data into training and test sets according to `train_prop`.
`train_prop`	Proportion of data in training set if `data_split = TRUE`.
`return_values`	Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support".
`...`	Not used.

Details

Data is generated via:

log(p / (1 - p)) = intercept + betas_uncorr \%\emph{\% X_uncorr + betas_corr \%}\% X_corr,

where p = P(y = 1 | X), X_uncorr is an (uncorrelated) standard Gaussian random matrix, and X_corr is a correlated Gaussian random matrix with variance 1 and Cor(X_corr_i, X_corr_j) = corr for all i, j. The true underlying support of this data is the first s_uncorr and s_corr features in X_uncorr and X_corr respectively.

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X: A data.frame.
y: A response vector of length nrow(X).
support: A vector of feature indices indicating all features used in the true support of the DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Examples

# generate data from: log(p / (1 - p)) = betas_corr_1 * x_corr_1 + betas_corr_2 * x_corr_2,
# where betas_corr_1, betas_corr_2 ~ N(0, 1),
# Var(X_corr_i) = 1, Cor(X_corr_i, X_corr_j) = 0.7 for all i, j = 1, ..., 10
sim_data <- correlated_logistic_gaussian_dgp(n = 100, p_uncorr = 0, p_corr = 10,
                                             s_corr = 2, corr = 0.7)

# generate data from: log(p / (1 - p)) = betas_uncorr %*% X_uncorr - X_corr_1,
# where betas_uncorr ~ N(0, .5), betas_corr = [-1, 0], X_uncorr ~ N(0, I_10),
# X_corr ~ N(0, Sigma), Sigma has 1s on diagonals and 0.7 elsewhere.
sim_data <- correlated_logistic_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 2,
                                             corr = 0.7, betas_uncorr_sd = 1,
                                             betas_corr = c(-1, 0))

Yu-Group/dgpoix documentation built on June 3, 2022, 1:40 a.m.