generate_y_lss: Generate locally spiky smooth (LSS) response data.
In Yu-Group/dgpoix: Generate synthetic data that is as fresh as the real thing

generate_y_lss

R Documentation

Generate locally spiky smooth (LSS) response data.

Description

Generate LSS response data with a specified error distribution given the observed data matrices.

Usage

generate_y_lss(
  X,
  k,
  s,
  thresholds = 1,
  signs = 1,
  betas = 1,
  intercept = 0,
  overlap = FALSE,
  err = NULL,
  return_support = FALSE,
  ...
)

Arguments

`X`	Data matrix or data frame.
`k`	Order of the interactions.
`s`	Number of interactions in the LSS model or a matrix of the support indices with each interaction taking a row in this matrix and ncol = k.
`thresholds`	A scalar or a s x k matrix of the thresholds for each term in the LSS model.
`signs`	A scalar or a s x k matrix of the sign of each interaction (1 means > while -1 means <).
`betas`	Scalar, vector, or function to generate coefficients corresponding to interaction terms. See \codegenerate_coef().
`intercept`	Scalar intercept term.
`overlap`	If TRUE, simulate support indices with replacement; if FALSE, simulate support indices without replacement (so no overlap)
`err`	Function from which to generate simulated error vector. Default is `NULL` which adds no error to the DGP.
`return_support`	Logical specifying whether or not to return a vector of the support column names. If `X` has no column names, then the indices of the support are used.
`...`	Other arguments to pass to err() to generate the error vector.

Details

Here, data is generated from the following LSS model:

E(Y|X) = intercept + sum_{i = 1}^{s} beta_i prod_{j = 1}^{k}1(X_{S_j} lessgtr thresholds_ij)

For more details on the LSS model, see Behr, Merle, et al. "Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests." arXiv preprint arXiv:2102.11800 (2021).

Value

If return_support = TRUE, returns a list of three:

y: A response vector of length nrow(X).
support: A vector of feature indices indicating all features used in the true support of the DGP.
int_support: A vector of signed feature indices in the true (interaction) support of the DGP. For example, "1+_2-" means that the interaction between high values of feature 1 and low values of feature 2 appears in the underlying DGP.

If return_support = FALSE, returns only the response vector y.

Examples

X <- generate_X_gaussian(.n = 100, .p = 10)

# generate data from: y = 1(X_1 > 0, X_2 > 0) + 1(X_3 > 0, X_4 > 0)
y <- generate_y_lss(X = X, k = 2, s = matrix(1:4, nrow = 2, byrow = TRUE),
                    thresholds = 0, signs = 1, betas = 1)

# generate data from: y = 3 * 1(X_1 < 0) - 1(X_2 > 1) + N(0, 1)
y <- generate_y_lss(X = X, k = 1, 
                    s = matrix(1:2, nrow = 2),
                    thresholds = matrix(0:1, nrow = 2), 
                    signs = matrix(c(-1, 1), nrow = 2),
                    betas = c(3, -1),
                    err = rnorm)

Yu-Group/dgpoix documentation built on June 3, 2022, 1:40 a.m.