correlated_lss_gaussian_dgp: Generate correlated Gaussian covariates and LSS response...
In Yu-Group/dgpoix: Generate synthetic data that is as fresh as the real thing

correlated_lss_gaussian_dgp

R Documentation

Generate correlated Gaussian covariates and LSS response data.

Description

Generate normally-distributed covariates that are potentially correlated and LSS response data with a specified error distribution.

Usage

correlated_lss_gaussian_dgp(
  n,
  p_uncorr,
  p_corr,
  s_uncorr = p_uncorr,
  s_corr = p_corr,
  corr,
  k,
  thresholds = 0,
  signs = 1,
  betas = 1,
  intercept = 0,
  overlap = FALSE,
  mixed_int = FALSE,
  err = NULL,
  data_split = FALSE,
  train_prop = 0.5,
  return_values = c("X", "y", "support"),
  ...
)

Arguments

`n`	Number of samples.
`p_uncorr`	Number of uncorrelated features.
`p_corr`	Number of features in correlated group.
`s_uncorr`	Number of interactions from features in uncorrelated group.
`s_corr`	Number of interactions from features in correlated group.
`corr`	Correlation between features in correlated group.
`k`	Order of the interactions.
`thresholds`	A scalar or a s x k matrix of the thresholds for each term in the LSS model.
`signs`	A scalar or a s x k matrix of the sign of each interaction (1 means > while -1 means <).
`betas`	Scalar, vector, or function to generate coefficients corresponding to interaction terms. See \codegenerate_coef().
`intercept`	Scalar intercept term.
`overlap`	If TRUE, simulate support indices with replacement; if FALSE, simulate support indices without replacement (so no overlap)
`mixed_int`	If `TRUE`, correlated and uncorrelated variables are mixed together when constructing an interaction of order-k. If `FALSE`, each interaction of order-k is composed of only correlated variables or only uncorrelated variables.
`err`	Function from which to generate simulated error vector. Default is `NULL` which adds no error to the DGP.
`data_split`	Logical; if `TRUE`, splits data into training and test sets according to `train_prop`.
`train_prop`	Proportion of data in training set if `data_split = TRUE`.
`return_values`	Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support", "int_support".
`...`	Other arguments to pass to err() to generate the error vector.

Details

Data is generated via:

y = intercept + sum_{i = 1}^{s} beta_i prod_{j = 1}^{k}1(X_{S_j} lessgtr thresholds_ij) + err(...),

where X = [X_uncorr, X_corr], X_uncorr is an (uncorrelated) standard Gaussian random matrix, and X_corr is a correlated Gaussian random matrix with variance 1 and Cor(X_corr_i, X_corr_j) = corr for all i, j. If overlap = TRUE, then the true interaction support is randomly chosen from the (p_uncorr + p_corr) features in X. If overlap = FALSE, then the true interaction support is sequentially taken from the first s_uncorr*k features in X_uncorr and the first s_corr*k features in X_corr.

For more details on the LSS model, see Behr, Merle, et al. "Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests." arXiv preprint arXiv:2102.11800 (2021).

Value

A list of the named objects that were requested in return_values. See brief descriptions below.

X: A data.frame.
y: A response vector of length nrow(X).
support: A vector of feature indices indicating all features used in the true support of the DGP.
int_support: A vector of signed feature indices in the true (interaction) support of the DGP. For example, "1+_2-" means that the interaction between high values of feature 1 and low values of feature 2 appears in the underlying DGP.

Note that if data_split = TRUE and "X", "y" are in return_values, then the returned list also contains slots for "Xtest" and "ytest".

Examples

# generate data from: y = 1(X_1 > 0, X_2 > 0) + 1(X_3 > 0, X_4 > 0), where
# X is a 100 x 10 correlated Gaussian random matrix with
# Var(X_i) = 1 for all i and Cor(X_i, X_j) = 0.7 for all i != j
sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 0, p_corr = 10,
                                        k = 2, s_corr = 2, corr = 0.7,
                                        thresholds = 0, signs = 1, betas = 1)

# generate data from: y = 3 * 1(X_1 > 0, X_2 > 0) - 1(X_11 > 0, X_12 > 0) + N(0, 1),
# where X = [Z, U], Z is a 100 x 10 standard Gaussian random matrix,
# U is a 100 x 10 Gaussian random matrix with Var(U_i) = 1 and Cor(U_i, U_j) = 0.7
sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 10,
                                        s_uncorr = 1, s_corr = 1, corr = 0.7,
                                        k = 2, betas = c(3, -1), err = rnorm)

# generate data from: y = \sum_{i = 1}^{4} \prod_{j = 1}^{2} 1(X_{s_j} > 0),
# where s_j \in {1:4, 11:14} are randomly selected indiceds, X = [Z, U],
# Z is a 100 x 10 standard Gaussian random matrix, U is a 100 x 10 Gaussian
# random matrix with Var(U_i) = 1 and Cor(U_i, U_j) = 0.7
# i.e., interactions may consist of both correlated and uncorrelated features
sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 10,
                                        s_uncorr = 2, s_corr = 2, k = 2,
                                        corr = 0.7, mixed_int = TRUE)

Yu-Group/dgpoix documentation built on June 3, 2022, 1:40 a.m.