correlated_lss_gaussian_dgp | R Documentation |
Generate normally-distributed covariates that are potentially correlated and LSS response data with a specified error distribution.
correlated_lss_gaussian_dgp( n, p_uncorr, p_corr, s_uncorr = p_uncorr, s_corr = p_corr, corr, k, thresholds = 0, signs = 1, betas = 1, intercept = 0, overlap = FALSE, mixed_int = FALSE, err = NULL, data_split = FALSE, train_prop = 0.5, return_values = c("X", "y", "support"), ... )
n |
Number of samples. |
p_uncorr |
Number of uncorrelated features. |
p_corr |
Number of features in correlated group. |
s_uncorr |
Number of interactions from features in uncorrelated group. |
s_corr |
Number of interactions from features in correlated group. |
corr |
Correlation between features in correlated group. |
k |
Order of the interactions. |
thresholds |
A scalar or a s x k matrix of the thresholds for each term in the LSS model. |
signs |
A scalar or a s x k matrix of the sign of each interaction (1 means > while -1 means <). |
betas |
Scalar, vector, or function to generate coefficients corresponding to interaction terms. See \codegenerate_coef(). |
intercept |
Scalar intercept term. |
overlap |
If TRUE, simulate support indices with replacement; if FALSE, simulate support indices without replacement (so no overlap) |
mixed_int |
If |
err |
Function from which to generate simulated error vector. Default is
|
data_split |
Logical; if |
train_prop |
Proportion of data in training set if |
return_values |
Character vector indicating what objects to return in list. Elements in vector must be one of "X", "y", "support", "int_support". |
... |
Other arguments to pass to err() to generate the error vector. |
Data is generated via:
y = intercept + sum_{i = 1}^{s} beta_i prod_{j = 1}^{k}1(X_{S_j} lessgtr thresholds_ij) + err(...),
where
X = [X_uncorr, X_corr], X_uncorr is an (uncorrelated) standard Gaussian
random matrix, and X_corr is a correlated Gaussian random matrix with
variance 1 and Cor(X_corr_i, X_corr_j) = corr for all i, j. If
overlap = TRUE
, then the true interaction support is randomly chosen
from the (p_uncorr + p_corr) features in X
. If overlap = FALSE
,
then the true interaction support is sequentially taken from the first
s_uncorr*k
features in X_uncorr and the first
s_corr*k
features in X_corr.
For more details on the LSS model, see Behr, Merle, et al. "Provable Boolean Interaction Recovery from Tree Ensemble obtained via Random Forests." arXiv preprint arXiv:2102.11800 (2021).
A list of the named objects that were requested in
return_values
. See brief descriptions below.
A data.frame
.
A response vector of length nrow(X)
.
A vector of feature indices indicating all features used in the true support of the DGP.
A vector of signed feature indices in the true (interaction) support of the DGP. For example, "1+_2-" means that the interaction between high values of feature 1 and low values of feature 2 appears in the underlying DGP.
Note that if data_split = TRUE
and "X", "y"
are in return_values
, then the returned list also contains slots for
"Xtest" and "ytest".
# generate data from: y = 1(X_1 > 0, X_2 > 0) + 1(X_3 > 0, X_4 > 0), where # X is a 100 x 10 correlated Gaussian random matrix with # Var(X_i) = 1 for all i and Cor(X_i, X_j) = 0.7 for all i != j sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 0, p_corr = 10, k = 2, s_corr = 2, corr = 0.7, thresholds = 0, signs = 1, betas = 1) # generate data from: y = 3 * 1(X_1 > 0, X_2 > 0) - 1(X_11 > 0, X_12 > 0) + N(0, 1), # where X = [Z, U], Z is a 100 x 10 standard Gaussian random matrix, # U is a 100 x 10 Gaussian random matrix with Var(U_i) = 1 and Cor(U_i, U_j) = 0.7 sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 10, s_uncorr = 1, s_corr = 1, corr = 0.7, k = 2, betas = c(3, -1), err = rnorm) # generate data from: y = \sum_{i = 1}^{4} \prod_{j = 1}^{2} 1(X_{s_j} > 0), # where s_j \in {1:4, 11:14} are randomly selected indiceds, X = [Z, U], # Z is a 100 x 10 standard Gaussian random matrix, U is a 100 x 10 Gaussian # random matrix with Var(U_i) = 1 and Cor(U_i, U_j) = 0.7 # i.e., interactions may consist of both correlated and uncorrelated features sim_data <- correlated_lss_gaussian_dgp(n = 100, p_uncorr = 10, p_corr = 10, s_uncorr = 2, s_corr = 2, k = 2, corr = 0.7, mixed_int = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.