knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

In [@seqknockoffpaper] we presented sequential knockoffs and performed extensive simulations to evaluate performance of the methodology. In the following we demonstrate code that can be used to reproduce the preprint's results. In particular, we provide functions for generating data sets, simulating knockoffs (MX and sequential), applying the (single and multiple) knockoff filter for variable selection, visualizing selections, and finally evaluating performance in terms of false discovery rate and true positive rate.

Package main functions

library(seqknockoff)

Data generation {#simdata}

Simulation of $X$: The generate_X function simulates the rows of an $n \times p$ data frame $X$ independently from a multivariate Gaussian distribution with mean $0$ and $p \times p$ covariance matrix $$ \Sigma_{ij} = \left { \begin{array}{lr} 1{i = j}/n, & \text{Independent,} \ \rho^{1{i \neq j}}/n, & \text{Equicorrelated,} \ \rho^{|i-j|}/n, & \text{AR1}, \end{array} \right. $$ where $p_b$ randomly selected columns are then dichotomized with the indicator function $\delta(x)=1(x > 0)$.

# Generate a 2000 x 30 Gaussian data.frame under AR1(rho=0.5) covariance structure, 
# with 10 of the columns  dichotomized
X <- generate_X(n=2000, p=30, p_b=10, cov_type = "cov_ar1", rho=0.5)

The covariance type is specified with the parameter cov_type and the correlation coefficient with rho. Columns are shuffled after simulation. Each column of the resulting data.frame is either of class "numeric" (for the continuous columns) or "factor" (for the binary columns).

Simulation of $y|X$: The generate_y function simulates a response $y$ from the (sparse) regression model $$ y = X \beta + \varepsilon, \textrm{ where } \varepsilon \sim N(0,I_n), $$ where the first p_nn regression coefficients are non-zero, all other are set to zero. The (common) amplitude of the non-zero regression coefficients is specified with a.

# Generate y ~ N(X%*%beta,I_n) where first 10 beta-coefficients are = 3.5, all other = 0.
y <- generate_y(X = X, p_nn=10, a=3.5)

Inside generate_y the binarized columns of X are first scaled by a $\sqrt{n}$-factor to ensure that the marginal (column-wise) variances of the model.matrix of X are all equal to $1/n$.

Knockoff generation

Sequential knockoffs: The function knockoffs_seq is the most important function of the package. It receives as input a data.frame (or tibble) X whose columns are either of class "numeric" (for the continuous columns) or "factor" (for the categorical columns). This is a common format of data involving both continuous and categorical predictors. The output is a data.frame (or tibble) corresponding to the sequential knockoffs of X:

# Simulate sequential knockoff of X:
Xk <- knockoffs_seq(X)

The above function will by default sample sequential knockoffs based on LASSO regression. Behind the scenes, knockoffs_seq calls the function glmnet::cv.glmnet sequentially and the user may optionally input other parameters of the glmnet::cv.glmnet function. For example, the below code will sample sequential knockoffs based on an elastic-net with penalty parameter $\alpha=0.5$

# Simulate sequential knockoff of X based on sequential elastic-net with penalty parameter alpha.
Xk <- knockoffs_seq(X, alpha=0.5)

MX-knockoffs: We also offer a wrapper function knockoffs_mx to simulate Gaussian MX-knockoffs (via call to knockoff::create.second_order). This function only works on data frames with all "numeric" columns.

# Generate a 2000 x 30 Gaussian data.frame under AR1(rho=0.5) covariance structure, 
X <- generate_X(n=2000, p=30, p_b=0, cov_type = "cov_ar1", rho=0.5)
# Simulate second order multivariate Gaussian MX knockoff:
Xk <- knockoffs_mx(X)

Knockoff filter

The function knockoff_filter implements the The knockoff filter procedure, which takes as input X, y, fdr, family and the additional parameters knockoffs and statistic. The parametrization is inspired by the knockoff::knockoff.filter function and allows the user to specify multiple FDR thresholds fdr. The parameter knockoffs represents a user-specified knockoff simulating function. The default is knockoffs=knockoffs_seq for general data frames X with numeric and factor columns, but can also be knockoffs=knockoffs_mx for data frames with only numeric columns. The parameter statistic is a user-specified function to calculate feature statistics to distinguish between the original and knockoff variables. By default statistic=stat_glmnet (importance statistics based on GLM involving the model.matrix of X; which internally calls the knockoff::stat.glmnet_coefdiff function). See ?stat_glmnet for details.

set.seed(123)
X <- generate_X(n=2000, p=30, p_b=10, cov_type = "cov_ar1", rho=0.5)
y <- generate_y(X = X, p_nn=10, a=3.5)
S <- knockoff_filter(X, y, fdr=c(0.05, 0.1, 0.2))
S

The output is a list of selections from the different thresholds. If only a single fdr threshold is provided the output is simply a vector of selected indices.

Multiple knockoffs filter

In order to evaluate robustness of the knockoff filter we may run it several times, each time simulating a new knockoff of X. Since the knockoff_filter may take long to run, in particular with sequential knockoffs (knockoffs=knockoffs_seq), we recommend to use parallel computing. Here we demonstrate how one could use the R-package clustermq for that purpose. Note that the user can specify a cluster.scheduler that is system dependent. On our HPC we utilize the LSF job scheduler, but other options (including local parallelization) are outlined here: User Guide - clustermq

library(clustermq)

The user-friendly Q-function allows us to run the knockoff_filter with fdr=0.2 $50$ times in parallel:

set.seed(123)
X <- generate_X(n=2000, p=30, p_b=0, cov_type = "cov_ar1", rho=0.5)
y <- generate_y(X = X, p_nn=10, a=3.5)
S <- Q(knockoff_filter, fdr = rep(0.2, 50), const = list(X=X, y=y, knockoffs=knockoffs_mx, 
       statistic=stat_glmnet), pkgs="seqknockoff", n_jobs=50, seed=1)

The output S is a list of selected variable indices across the $50$ replicates. The above call to the Q-function is the parallel equivalent to the following inefficient loop (not run):

S <- list()
for (count in 1:50) {
  S[[count]] <- knockoff_filter(X, y, fdr=0.2, knockoffs=knockoffs_mx, statistic=stat_glmnet)
}

To obtain a single final variable selection we use the heuristics in The multiple knockoffs filter procedure, which is implemented in the function multi_select

# Obtain a single set of selected variables from the list of knockoff selections:
multi_select(S, p=30) # p is the total number of variables

Heatmap of multiple variable selections

In order to evaluate the robustness of the knockoff selection procedure we can visualize a heatmap of the selections across the knockoff replicates. The function plot_heatmap facilitates this visualization and takes as input the above list of selection indicies along with actual variable labels. The output is a $p \times B$ binary heatmap, where rows correspond to variables and each column $b$ records which variables were selected with knockoff filter replicate $b=1,\dots,B$.

plot_heatmap(S, labels=paste0("x",1:30))

Note the message "Co-Clustering successfully terminated!". We apply co-clustering of the rows and columns of the heatmap with the blockcluster package, which identifies blocks of similarities in the binary selections. We then order the blocks according to increasing mean number of selections. This helps with aesthetics, but also helps visualize the most important variables that should tend towards the top of the heatmap. The plot_heatmap function will order the heatmap rows/columns according to importance only if the blockcluster and dplyr packages are installed.

Simulation experiment (computationally intensive)

Consider an experiment aimed at comparing variable selection performance between the MX and sequential knockoff filters. We simulate a gaussian design matrix $X$ with $n$ rows and $p$ columns, and AR1($\rho$) correlation structure. We then simulate response $y \sim N(X\beta,I_n)$ where the first $p_{nn}$ of the $\beta$-coefficients are equal to $a$, while all other coefficients are equal to zero. We then apply the knockoff_filter with both MX and sequential knockoffs. Finally, we record False Discovery Proportions (with function eval_fdp) and True Positive Proportions (with function eval_tpp).

We implement a function that performs the above for a single simulated data set $(X,y)$:

sim_experiment <- function(n, p, cov_type="cov_ar1", rho, p_nn, a) {
  # Generate Gaussian design matrix:
  X <- generate_X(n=n, p=p, p_b=0, cov_type="cov_ar1", rho=rho)
  # Simulate y given X:
  y <- generate_y(X=X, p_nn=p_nn, a=a)
  # Apply MX and sequential knockoff filters:
  S.mx <- knockoff_filter(X=X, y=y, fdr=0.2, knockoffs=knockoffs_mx)
  S.seq <- knockoff_filter(X=X, y=y, fdr=0.2, knockoffs=knockoffs_seq)
  # Evaluate fdp and tpp for each method:
  results <- data.frame(method = c("MX Knockoff", "Seq Knockoff"),
                        fdp = c(eval_fdp(S.mx, negatives=(p_nn+1):p), 
                                eval_fdp(S.seq, negatives=(p_nn+1):p)),
                        tpp = c(eval_tpp(S.mx, positives=1:p_nn), 
                                eval_tpp(S.seq, positives=1:p_nn)),
                        a = a)

  return(results)

}

We then utilize the clustermq package to run the above function in parallel for different amplitudes $a \in {1,2,3,4,5}$ and across $n_{sim}=100$ replicates per amplitude. We assume AR1 covariance structure and fix $n=2,000$, $p=200$, $\rho=0.5$, and $p_{nn}=50$.

# Recommended to run only on HPC:
results_list <- Q(sim_experiment, a = rep(1:5, each=100), const = list(n=2000, p=200, 
                  cov_type="cov_ar1", rho=0.5, p_nn=50), pkgs="seqknockoff", n_jobs=500, seed=1)
results <- do.call("rbind", results_list)

We can now compare boxplots of false discovery proportions (fdp) and and true positive proportions (tpp) as a function of amplitude.

library(ggplot2)
ggplot(mapping=aes(x=factor(a), y=fdp), data=results) + 
  geom_boxplot(aes(fill=method)) + 
  stat_summary(aes(group=method, colour=method), fun.y="mean", geom="line") +
  scale_y_continuous(breaks=c(0, 0.2, 0.4, 0.6, 0.8, 1), minor_breaks = NULL) +
  xlab("Amplitude") +
  ggtitle("False Discovery Rate Curves") +
  theme(legend.position="bottom")
ggplot(mapping=aes(x=factor(a), y=tpp), data=results) + 
  geom_boxplot(aes(fill=method)) + 
  stat_summary(aes(group=method, colour=method), fun.y="mean", geom="line") +
  xlab("Amplitude") +
  ggtitle("Power Curves") +
  theme(legend.position="bottom")

The overlaid mean curves are estimates of false discovery rate (fdr) and power.

Appendix {#appendix}

A. The knockoff filter procedure {#knockoff-filter-procedure}

The Knockoff-filter is defined by the following steps:

  1. Simulate a knockoff copy $\tilde{X}$ of the original data $X$.
  2. Compute feature statistics $W_j=w_j((X,\tilde{X}),Y)$ to distinguish between $X_j$ and its knockoff $\tilde{X}_j$. Large, positive statistics indicate association with $Y$.
  3. Use the knockoffs$+$ procedure to select variables $j$ that fulfill $W_j \geq \tau_+$ where $$ \tau_+ = \underset{t>0}{\operatorname{argmin}} \left{\frac{1 + |{j : W_j \leq t}|}{|{j : W_j \leq t}|} \leq q\right}. $$

B. The multiple knockoffs filter procedure {#multiple-knockoffs-filter-procedure}

Let $\tilde{X}_1, \dots, \tilde{X}_B$ denote $B$ independent knockoff copies of $X$. For each knockoff copy $b$ run the knockoff filter and select the set of influential variables, $S_b \subseteq {1,\dots,p}$. We propose the following heuristics to select a final set of variables:

The first step above essentially filters out variables that don't appear more than $(100\cdot r)\%$ of the time, which would seem like a reasonable requirement in practice (e.g. with $r=0.5$). The second step above then filters the $B$ knockoff selections $S_b$ accordingly and searches for the most frequent variable set among those. The third step then establishes the final selection, namely the most liberal variable selection among the sets ${S(r):r \geq 0.5}$.

References



kormama1/seqknockoff documentation built on April 11, 2021, 7:44 a.m.