knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The cheapknockoff package contains the implementations of a procedure for performing feature selection when features have costs. The basic premise is to construct multiple knockoffs for each feature, forcing the more expensive feature to compete with more knockoffs. This makes the more expensive feature harder to be selected unless it shows significant association with the response given its cost. The details of our proposal can be found in Yu, Witten, and Bien (2019) Controlling Costs: Feature Selection on a Budget.
We consider the following linear model, $$ y = \sum_{j = 1}^p X_j \beta^\ast_j + \varepsilon, $$ where $X \sim N(\mathbf{0}, \mathbf{I}_p)$, and $\varepsilon \sim N(0, 1)$. The following code simulates $n = 100$ observation from the model above with $p = 30$.
library(cheapknockoff) set.seed(123) n <- 100 p <- 30 x <- matrix(data = rnorm(n * p), nrow = n, ncol = p) y <- x[, 1] - 2 * x[, 2] + rnorm(n)
Additionally, we consider the setting where these features have costs. For example, we set the costs for the two relevant features $X_1$ and $X_2$ as $2$ and $9$, and the costs for other features are randomly selected integer values between $2$ and $9$.
omega <- c(2, 9, sample(seq(2, 9), size = 28, replace = TRUE))
cheapknockoff functionsPerforming the cost-conscious feature selection procedure using cheapknockoff package involves the following three steps:
The function multiple_knockoff_Gaussian constructs multiple knockoffs by assuming that the design matrix follows a multivariate Gaussian distribution with known mean and covariance matrix (which, in this simulation setting, are $\mathbf{0}$ and $\mathbf{I}$ respectively). It also requires the input of feature costs.
X_k <- multiple_knockoff_Gaussian(X = x, mu = rep(0, p), Sigma = diag(1, p), omega = omega)
Constructing multiple knockoffs involves solving a optimization problem. There are three formulations of the involved optimization problem, which can be specified as an input in knockoff_Guassian. For details, see
Gimenez & Zhou (2018) Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization.
With the constructed multiple knockoffs, the next step is to compute the test statistics used to perform feature selection.
In particular, the function stat_glmnet_coef computes the statistics based on the absolute value of the coefficient estimate from glmnet fit.
For further details, see Section 2.2 of Yu, Witten, and Bien (2019) Controlling Costs: Feature Selection on a Budget.
stat <- cheapknockoff::stat_glmnet_coef(X = x, X_k = X_k, y = y, omega = omega)
The output of stat_glmnet_coef is a list of three components.
kappa is the vector of indices of winner for each variable competing with its multiple knockoff counterparts. kappa[j] = 1 indicates that the original variable is beating all of its knockoff counterparts, and kappa[j] not equal to 1 means otherwise.
tau is a vector of scores determining the order for which we consider to include variables into the model.
Finally, score_total is the original glmnet coefficient estimates for each variable and its knockoff counterparts.
Finally, the following function generate_path yields a path of selected variables.
path <- cheapknockoff::generate_path(kappa = stat$kappa, tau = stat$tau)
The output path is a list of selected variable sets. For example,
path[1]
is $X_1$. The more relavant feature $X_2$ (in the sense that $X_2$ has larger coefficient) is later included in
path[2]
It is because $X_2$ is much more expensive than $X_1$. This relects the premise of our proposal on making use of cheap features over expensive features.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.