genLatentData | R Documentation |
Generate a data set including latent features Z, observed features X (which may include noisy or noiseless observations of the latent features in Z), an observed response y which is a linear model of features from Z and X as well as independent mean zero noise, and mu (the responses from y without the added noise). Data is generated in the same way as in the simulations from Faletto and Bien (2022).
genLatentData( n, p, k_unclustered, cluster_size, n_clusters = 1, sig_clusters = 1, rho = 0.9, var = 1, beta_latent = 1.5, beta_unclustered = 1, snr = as.numeric(NA), sigma_eps_sq = as.numeric(NA) )
n |
Integer or numeric; the number of observations to generate. (The generated X and Z will have n rows, and the generated y and mu will have length n.) |
p |
Integer or numeric; the number of features to generate. The generated X will have p columns. |
k_unclustered |
Integer or numeric; the number of features in X that will have nonzero coefficients in the true model for y among those features not generated from the n_clusters latent variables (called "weak signal" features in the simulations from Faletto and Bien 2022). The coefficients on these features will be determined by beta_unclustered. |
cluster_size |
Integer or numeric; for each of the n_clusters latent variables, X will contain cluster_size noisy proxies that are correlated with the latent variable. |
n_clusters |
Integer or numeric; the number of latent variables to generate, each of which will be associated with an observed cluster in X. Must be at least 1. Default is 1. |
sig_clusters |
Integer or numeric; the number of generated latent features that will have nonzero coefficients in the true model for y (all of them will have coefficient beta_latent). Must be less than or equal to n_clusters. Default is 1. |
rho |
Integer or numeric; the covariance of the proxies in each cluster with the latent variable (and each other). Note that the correlation between the features in the cluster will be rho/var. Default is 0.9. |
var |
Integer or numeric; the variance of all of the observed features in X (both the proxies for the latent variables and the k_unclustered other features). Default is 1. |
beta_latent |
Integer or numeric; the coefficient used for all sig_clusters latent variables that have nonzero coefficients in the true model for y. Default is 1.5. |
beta_unclustered |
Integer or numeric; the maximum coefficient in the model for y among the k_unclustered features in X not generated from the latent variables. The coefficients of the features will be beta_unclustered/sqrt(1:k_unclustered). Default is 1. |
snr |
Integer or numeric; the signal-to-noise ratio of the response y. If snr is specified, the variance of the noise in y will be calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of snr and sigma_eps_sq must be specified. Default is NA. |
sigma_eps_sq |
Integer or numeric; the variance on the noise added to y. Only one of snr and sigma_eps_sq must be specified. Default is NA. |
A list of the following elements.
X |
An n x p matrix of n observations from a p-dimensional multivariate normal distribution generated using the specified parameters. The first n_clusters times cluster_size features will be the clusters of features correlated with the n_clusters latent variables. The next k_unclustered features will be the "weak signal" features, and the remaining p - n_clusters*cluster_size - k_unclustered features will be the unclustered noise features. |
y |
The response generated from X, the latent features from Z, and the coefficient vector. |
Z |
The latent features; either a numeric vector (if n_clusters > 1) or a numeric matrix (if n_clusters > 1). |
mu |
The expected response given X, Z, and the true coefficient vector (equal to y minus the added noise). |
Gregory Faletto, Jacob Bien
Faletto, G., & Bien, J. (2022). Cluster Stability Selection. arXiv preprint arXiv:2201.00494. https://arxiv.org/abs/2201.00494.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.