spikeslab: Spike and Slab Regression

View source: R/spikeslab.R

spikeslabR Documentation

Spike and Slab Regression

Description

Fits a rescaled spike and slab model using a continuous bimodal prior. A generalized elastic net estimator is used for variable selection and estimation. Can be used for prediction and variable selection in low- and high-dimensional linear regression models.

Usage

spikeslab(formula, data = NULL, x = NULL, y = NULL,
    n.iter1 = 500, n.iter2 = 500, mse = TRUE,
    bigp.smalln = FALSE, bigp.smalln.factor = 1, screen = (bigp.smalln),
    r.effects = NULL, max.var = 500, center = TRUE, intercept = TRUE,
    fast = TRUE, beta.blocks = 5, verbose = FALSE, ntree = 300,
    seed = NULL, ...)

Arguments

formula

A symbolic description of the model to be fit.

data

Data frame containing the data used in the formula.

x

x predictor matrix (can be used in place of formula and data frame call).

y

y response (can be used in place of formula and data frame call).

n.iter1

Number of burn-in Gibbs sampled values (i.e., discarded values).

n.iter2

Number of Gibbs sampled values, following burn-in.

mse

If TRUE, an external estimate for the overall variance is calculated using ridge regression or random forests (the latter is used when the degrees of freedom are low). Otherwise, the variance is included in the prior and estimated using Gibbs sampling.

bigp.smalln

Use if p >> n.

bigp.smalln.factor

Removes all variables except the top ntimes bigp.smalln.factor ones (used in filtering when p >> n).

screen

If TRUE, variables are pre-filtered.

r.effects

List used for grouping variables (see details below).

max.var

Maximum number of variables allowed in the final model.

center

If TRUE, variables are centered by their means. Default is TRUE and should only be adjusted in extreme examples.

intercept

If TRUE, an intercept is included in the model, otherwise no intercept is included. Default is TRUE.

fast

If TRUE, use blocked Gibbs sampling to accelerate the algorithm.

beta.blocks

Update beta using this number of blocks (fast must be TRUE).

verbose

If TRUE, verbose output is sent to the terminal.

ntree

Number of trees used by random forests (applies only when mse is TRUE).

seed

Seed for random number generator. Must be a negative integer.

...

Further arguments passed to or from other methods.

Details

—> General:

The spike and slab method is described in detail in Ishwaran and Rao (2003, 2005a, 2005b and 2009). For high-dimensional problems in which p >> n, where p is the number of variables and n is the sample size, use the option bigp.smalln=TRUE. Doing so implements a three-stage procedure:

(1) Filtering step. This removes all variables except the top n times bigp.smalln.factor ones. Uses spike and slab regression with grouped regularization (complexity) parameters.

(2) Model averaging step. Refit the model using only those predictors from step 1. Returns the posterior mean values from fitting a spike and slab model; referred to as the Bayesian model averaged (bma) estimate.

(3) Variable selection step. Select variables using the generalized elastic net (gnet).

The filtering step is omitted when bigp.smalln=FALSE. Filtering can however be requested by setting screen=TRUE although users should be aware that this may degrade performance and should only be used when p is on the same order of n.

Variables can be grouped using r.effects. Grouping has the effect of forcing variables within a given group to share a common complexity (regularization) parameter. To do so, define a list with each entry in the list made up of the variable names to be grouped. There is no limit to the number of groups. Any variable that does not appear in the list will be assigned to a default group (the default group also has its own group-specific regularization parameter). See Examples 1 and 3 below.

—> Miscellanea:

By default, fast=TRUE when bigp.smalln=TRUE. This invokes an ultra-fast filtering step. Setting fast=FALSE invokes a more thorough filtering method that may slightly improve inferential results, but computational times will become very slow. The trade-off is unlikely to be justified.

The formula and data-frame call should be avoided in high-dimensional problems and instead the x-predictor matrix and y response vector should be passed directly (see Example 3). This avoids the huge overhead in parsing formula in R.

By default, predictors are normalized to have mean 0 and variance 1. Pre-processing also involves centering y unless the user specifically requests that the intercept be excluded from the model. Users can also over-ride centering predictors by setting center=FALSE. Use with extreme care.

The verbose option sends output to the terminal showing the number of Gibbs iterations and the current complexity (regularization) parameter(s).

Depends on the randomForest package for estimating the variance when mse=TRUE. Note that mse is over-ridden and set to FALSE when bigp.smalln=TRUE.

Depends on the lars package for the variable slection step.

Value

An object of class spikeslab with the following components:

summary

Summary object.

verbose

Verbose details (used for printing).

terms

Terms.

sigma.hat

Estimated variance.

y

Original y.

xnew

Centered, rescaled x-matrix.

x

Original x-matrix.

y.center

Centering for original y.

x.center

Centering for original x-matrix.

x.scale

Scaling for original x-matrix.

names

Variable names.

bma

bma coefficients in terms of xnew.

bma.scale

bma coefficients rescaled in terms of original x.

gnet

gnet coefficients in terms of xnew.

gnet.scale

gnet coefficients rescaled in terms of original x.

gnet.path

gnet path scaled in terms of the original x.

gnet.obj

gnet object (a lars-type object).

gnet.obj.vars

Variables (in order) used to calculate the gnet object.

gnet.parms

Generalized ridge regression parameters used to define the gnet.

phat

Estimated model dimension.

complexity

Complexity (regularization) parameter estimates.

ridge

List containing ridge values used to determine the bma.

models

List containing the models sampled.

Author(s)

Hemant Ishwaran (hemant.ishwaran@gmail.com)

J. Sunil Rao (rao.jsunil@gmail.com)

Udaya B. Kogalur (ubk@kogalur.com)

References

Breiman L. (2001). Random forests, Machine Learning, 45:5-32.

Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004). Least angle regression (with discussion). Ann. Statist., 32:407-499.

Ishwaran H. and Rao J.S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Stat. Assoc., 98:438-455.

Ishwaran H. and Rao J.S. (2005a). Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Statist., 33:730-773.

Ishwaran H. and Rao J.S. (2005b). Spike and slab gene selection for multigroup microarray data. J. Amer. Stat. Assoc., 100:764-780.

Ishwaran H. and Rao J.S. (2010). Generalized ridge regression: geometry and computational solutions when p is larger than n.

Ishwaran H., Kogalur U.B. and Rao J.S. (2010). spikeslab: prediction and variable selection using spike and slab regression. R Journal, 2(2), 68-73.

Ishwaran H. and Rao J.S. (2011). Mixing generalized ridge regressions.

Zou H. and Hastie T. (2005). Regularization and variable selection via the elastic net. J. Royal Statist. Society B, 67(2):301-320.

See Also

cv.spikeslab, plot.spikeslab, predict.spikeslab, print.spikeslab, sparsePC.spikeslab.

Examples


#------------------------------------------------------------
# Example 1:  diabetes data
#------------------------------------------------------------

# basic call
data(diabetesI, package = "spikeslab")
obj <- spikeslab(Y ~ . , diabetesI, verbose=TRUE)
print(obj)
plot(obj)

# grouping effect
# separate main effects and interactions into two groups
# use a group-specific regularization parameter for each group
xnames <- names(diabetesI[, -1])
r.eff <- vector("list", 2)
r.eff[[1]] <- xnames[c(1:10)]
r.eff[[2]] <- xnames[-c(1:10)]
obj2 <- spikeslab(Y ~ . , diabetesI, verbose=TRUE, r.effects=r.eff)
obj2
# extract the regularization parameters
print(apply(obj2$complexity, 2, summary))

## Not run: 
#------------------------------------------------------------
# Example 2: high-dimensional noise (diabetes data)
#------------------------------------------------------------

# add 2000 noise variables
data(diabetesI, package = "spikeslab")
diabetes.noise <- cbind(diabetesI,
      noise = matrix(rnorm(nrow(diabetesI) * 2000), nrow(diabetesI)))

# example of a big p, small n call
# don't use formula call; make call with x and y arguments
x <- diabetes.noise[, -1]
y <- diabetes.noise[, 1]
obj <- spikeslab(x=x, y=y, verbose=TRUE, bigp.smalln=TRUE, max.var=100)
obj

# same example ... but now group variables 
r.eff <- vector("list", 2)
r.eff[[1]] <- names(x)[c(1:100)]
r.eff[[2]] <- names(x)[-c(1:100)]
obj2 <- spikeslab(x=x, y=y, verbose=TRUE, bigp.smalln=TRUE,
                 r.effects=r.eff, max.var=100)
obj2

#------------------------------------------------------------
# Example 3: housing data with interactions
#------------------------------------------------------------

# another example of a big p, small n call
data(housingI, package = "spikeslab")
obj <- spikeslab(medv ~ ., housingI, verbose = TRUE,
           bigp.smalln = TRUE, max.var = 200)
print(obj)



## End(Not run)

spikeslab documentation built on April 27, 2022, 1:05 a.m.