stabilityGLM: Computes the area under the stability path for all covariates

Description Usage Arguments Details Value References See Also Examples

View source: R/stability.R

Description

To perform model selection, this function scores all covariates in X according to the area under their stability selection paths. Our model selection procedure starts by dynamically defining a grid for the elastic net penalization parameter lambda. To define the grid, we solve the full-dataset elastic net. This yields n_lambda log-scaled values between lambda_max and lambda_min. lambda_max is the maximum value for which the elastic net support is not empty. On the other hand, lambda_min can be derived through lambda_min_ratio, which is the ratio of lambda_min to lambda_max. The next step is identical to the original stability selection procedure. For each value of lambda, we solve n_subsample times the same elastic net, though for a different subsample. The subsample is a random selection of half of the samples of the original dataset. The empirical frequency of each covariate entering the support is then the number of times the covariate is selected in the support as a fraction of n_subsample. We obtain the stability path by associating each value of lambda with the corresponding empirical frequency. The final scores are the areas under the stability path curves. This is a key difference with the original stability selection procedure where the final score is the maximum empirical frequency. On simulations, our scoring technique outperformed maximum empirical frequencies.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
stabilityGLM(
  X,
  Y,
  weights = rep(1, nrow(X)),
  family = "gaussian",
  n_subsample = 20,
  n_lambda = 100,
  short = TRUE,
  lambda_min_ratio = 0.01,
  eps = 1e-05
)

Arguments

X

input design matrix

Y

response vector

weights

nonnegative sample weights

family

response type. Either 'gaussian' or 'binomial'

n_subsample

number of subsamples for stability selection

n_lambda

total number of lambda values

short

whether to compute the aucs only on the first half of the stability path. We observed better performance for thresholded paths

lambda_min_ratio

ratio of lambda_min to lambda_max (see description for a thorough explanation)

eps

elastic net mixing parameter.

Details

For a fixed lambda, the L2 penalization is lambda * eps, while the L1 penalization is lambda * (1-eps). The goal of the L2 penalization is to ensure the uniqueness of the solution. For that reason, we recommend setting eps << 1.

Value

a vector containing the areas under the stability path curves

References

Slim, L., Chatelain, C., Azencott, C.-A., & Vert, J.-P. (2018). Novel Methods for Epistasis Detection in Genome-Wide Association Studies. BioRxiv.

Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417–473.

Haury, A. C., Mordelet, F., Vera-Licona, P., & Vert, J. P. (2012). TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Systems Biology, 6.

See Also

glmnet-package

Other support estimation functions: stabilityBIG()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# ---- Continuous data ----
n <- 50
p <- 20
X <- matrix(rnorm(n * p), ncol = p)
Y <- crossprod(t(X), rnorm(p))
aucs_cont <- stabilityGLM(X, Y,
  family = "gaussian", n_subsample = 1,
  short = FALSE
)

# ---- Binary data ----
X <- matrix(rnorm(n * p), ncol = p)
Y <- runif(n, min = 0, max = 1) < 1 / (1 + exp(-X[, c(1, 7, 15)] %*% rnorm(3)))
weights <- runif(n, min = 0.4, max = 0.8)
aucs_binary <- stabilityGLM(X, Y,
  weights = weights,
  n_lambda = 50, lambda_min_ratio = 0.05, n_subsample = 1
)

EpiSlim/epiGWAS documentation built on Nov. 19, 2019, 7:15 p.m.