generate_y_lss: Generate locally spiky smooth (LSS) response data.

View source: R/dgp-lib-y.R

generate_y_lssR Documentation

Generate locally spiky smooth (LSS) response data.

Description

Generate LSS response data with a specified error distribution given the observed data matrices.

Usage

generate_y_lss(
  X,
  k,
  s,
  thresholds = 1,
  signs = 1,
  betas = 1,
  intercept = 0,
  overlap = FALSE,
  err = NULL,
  return_support = FALSE,
  ...
)

Arguments

X

Data matrix or data frame.

k

Order of the interactions.

s

Number of interactions in the LSS model or a matrix of the support indices with each interaction taking a row in this matrix and ncol = k.

thresholds

A scalar or a s x k matrix of the thresholds for each term in the LSS model.

signs

A scalar or a s x k matrix of the sign of each interaction (1 means > while -1 means <).

betas

Scalar, vector, or function to generate coefficients corresponding to interaction terms. See \codegenerate_coef().

intercept

Scalar intercept term.

overlap

If TRUE, simulate support indices with replacement; if FALSE, simulate support indices without replacement (so no overlap)

err

Function from which to generate simulated error vector. Default is NULL which adds no error to the DGP.

return_support

Logical specifying whether or not to return a vector of the support column names. If X has no column names, then the indices of the support are used.

...

Other arguments to pass to err() to generate the error vector.

Details

Here, data is generated from the following LSS model:

E(Y|X) = intercept + sum_{i = 1}^{s} beta_i prod_{j = 1}^{k}1(X_{S_j} lessgtr thresholds_ij)

For more details on the LSS model, see Behr, Merle, et al. "Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests." arXiv preprint arXiv:2102.11800 (2021).

Value

If return_support = TRUE, returns a list of three:

y

A response vector of length nrow(X).

support

A vector of feature indices indicating all features used in the true support of the DGP.

int_support

A vector of signed feature indices in the true (interaction) support of the DGP. For example, "1+_2-" means that the interaction between high values of feature 1 and low values of feature 2 appears in the underlying DGP.

If return_support = FALSE, returns only the response vector y.

Examples

X <- generate_X_gaussian(.n = 100, .p = 10)

# generate data from: y = 1(X_1 > 0, X_2 > 0) + 1(X_3 > 0, X_4 > 0)
y <- generate_y_lss(X = X, k = 2, s = matrix(1:4, nrow = 2, byrow = TRUE),
                    thresholds = 0, signs = 1, betas = 1)

# generate data from: y = 3 * 1(X_1 < 0) - 1(X_2 > 1) + N(0, 1)
y <- generate_y_lss(X = X, k = 1, 
                    s = matrix(1:2, nrow = 2),
                    thresholds = matrix(0:1, nrow = 2), 
                    signs = matrix(c(-1, 1), nrow = 2),
                    betas = c(3, -1),
                    err = rnorm)


Yu-Group/dgpoix documentation built on June 3, 2022, 1:40 a.m.