sven: Selection of variables with embedded screening using Bayesian...

Description Usage Arguments Details Value Author(s) References Examples

View source: R/sven.R

Description

SVEN is an approach to selecting variables with embedded screening using a Bayesian hierarchical model. It is also a variable selection method in the spirit of the stochastic shotgun search algorithm. However, by embedding a unique model based screening and using fast Cholesky updates, SVEN produces a highly scalable algorithm to explore gigantic model spaces and rapidly identify the regions of high posterior probabilities. It outputs the log (unnormalized) posterior probability of a set of best (highest probability) models. For more details, see Li et al. (2020).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sven(
  X,
  y,
  w = sqrt(nrow(X))/ncol(X),
  lam = nrow(X)/ncol(X)^2,
  Ntemp = 3,
  Tmax = (log(log(ncol(X))) + log(ncol(X))),
  Miter = 50,
  wam.threshold = 0.5,
  log.eps = -16,
  verbose = TRUE
)

Arguments

X

The n\times p covariate matrix without intercept. The following classes are supported: matrix and dgCMatrix. Every care is taken not to make copies of this (typically) giant matrix. No need to center or scale this matrix manually. Scaling is performed implicitly and regression coefficient are returned on the original scale.

y

The response vector of length n. No need to center or scale.

w

The prior inclusion probability of each variable. Default: sqrt(n)/p.

lam

The slab precision parameter. Default: n/p^2 as suggested by the theory of Li et al. (2020).

Ntemp

The number of temperatures. Default: 3.

Tmax

The maximum temperature. Default: \log\log p+\log p.

Miter

The number of iterations per temperature. Default: 50.

wam.threshold

The threshold probability to select the covariates for WAM. A covariate will be included in WAM if its corresponding marginal inclusion probability is greater than the threshold. Default: 0.5.

log.eps

The tolerance to choose the number of top models. See detail. Default: -16.

verbose

If TRUE, the function prints the current temperature SVEN is at; the default is TRUE.

Details

SVEN is developed based on a hierarchical Gaussian linear model with priors placed on the regression coefficients as well as on the model space as follows:

y | X, β_0,β,γ,σ^2,w,λ \sim N(β_01 + X_γβ_γ,σ^2I_n)

β_i|β_0,γ,σ^2,w,λ \stackrel{indep.}{\sim} N(0, γ_iσ^2/λ),~i=1,…,p,

(β_0,σ^2)|γ,w,p \sim p(β_0,σ^2) \propto 1/σ^2

γ_i|w,λ \stackrel{iid}{\sim} Bernoulli(w)

where X_γ is the n \times |γ| submatrix of X consisting of those columns of X for which γ_i=1 and similarly, β_γ is the |γ| subvector of β corresponding to γ. Degenerate spike priors on inactive variables and Gaussian slab priors on active covariates makes the posterior probability (up to a normalizing constant) of a model P(γ|y) available in explicit form (Li et al., 2020).

The variable selection starts from an empty model and updates the model according to the posterior probability of its neighboring models for some pre-specified number of iterations. In each iteration, the models with small probabilities are screened out in order to quickly identify the regions of high posterior probabilities. A temperature schedule is used to facilitate exploration of models separated by valleys in the posterior probability function, thus mitigate posterior multimodality associated with variable selection models. The default maximum temperature is guided by the asymptotic posterior model selection consistency results in Li et al. (2020).

SVEN provides the maximum a posteriori (MAP) model as well as the weighted average model (WAM). WAM is obtained in the following way: (1) keep the best (highest probability) K distinct models γ^{(1)},…,γ^{(K)} with

\log P≤ft(γ^{(1)}|y\right) ≥ \cdots ≥ \log P≤ft(γ^{(K)}|y\right)

where K is chosen so that \log ≤ft\{P≤ft(γ^{(K)}|y\right)/P≤ft(γ^{(1)}|y\right)\right\} > \code{log.eps}; (2) assign the weights

w_i = P(γ^{(i)}|y)/∑_{k=1}^K P(γ^{(k)}|y)

to the model γ^{(i)}; (3) define the approximate marginal inclusion probabilities for the jth variable as

\hatπ_j = ∑_{k=1}^K w_k I(γ^{(k)}_j = 1).

Then, the WAM is defined as the model containing variables j with \hatπ_j > \code{wam.threshold}. SVEN also provides all the top K models which are stored in an p \times K sparse matrix, along with their corresponding log (unnormalized) posterior probabilities.

Value

A list with components

model.map

A vector of indices corresponding to the selected variables in the MAP model.

model.wam

A vector of indices corresponding to the selected variables in the WAM.

model.top

A sparse matrix storing the top models.

beta.map

The ridge estimator of regression coefficients in the MAP model.

beta.wam

The ridge estimator of regression coefficients in the WAM.

mip.map

The marginal inclusion probabilities of the variables in the MAP model.

mip.wam

The marginal inclusion probabilities of the variables in the WAM.

pprob.map

The log (unnormalized) posterior probability corresponding to the MAP model.

pprob.top

A vector of the log (unnormalized) posterior probabilities corresponding to the top models.

stats

Additional statistics.

Author(s)

Dongjin Li and Somak Dutta
Maintainer: Dongjin Li <dongjl@iastate.edu>

References

Li, D., Dutta, S., Roy, V.(2020) Model Based Screening Embedded Bayesian Variable Selection for Ultra-high Dimensional Settings http://arxiv.org/abs/2006.07561

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
n=50; p=100; nonzero = 3
trueidx <- 1:3
nonzero.value <- 5
TrueBeta <- numeric(p)
TrueBeta[trueidx] <- nonzero.value

rho <- 0.6
x1 <- matrix(rnorm(n*p), n, p)
X <- sqrt(1-rho)*x1 + sqrt(rho)*rnorm(n)
y <- 0.5 + X %*% TrueBeta + rnorm(n)
res <- sven(X=X, y=y)
res$model.map # the MAP model
res$model.wam # the WAM
res$mip.map # the marginal inclusion probabilities of the variables in the MAP model
res$mip.wam # the marginal inclusion probabilities of the variables in the WAM
res$pprob.top # the log (unnormalized) posterior probabilities corresponding to the top models.

res$beta.map # the ridge estimator of regression coefficients in the MAP model 
res$beta.wam # the ridge estimator of regression coefficients in the WAM

dongjli/bravo documentation built on Sept. 20, 2021, 3:33 a.m.