Description Usage Arguments Details Value References See Also Examples
Selection of influential variables or model components with error control.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ## generic stability selection funcion
stabsel(x, ...)
## a method to fit models with stability selection
## S3 method for class 'matrix'
stabsel(x, y, fitfun = lars.lasso,
args.fitfun = list(), cutoff, q, PFER,
folds = subsample(rep(1, nrow(x)), B = B),
B = ifelse(sampling.type == "MB", 100, 50),
assumption = c("unimodal", "r-concave", "none"),
sampling.type = c("SS", "MB"),
papply = mclapply, mc.preschedule = FALSE,
verbose = TRUE, FWER, eval = TRUE, ...)
## essentially a wrapper for data.frames (see details)
## S3 method for class 'data.frame'
stabsel(x, y, intercept = FALSE, ...)
|
x |
a |
y |
a vector or matrix containing the outcome. |
intercept |
logical. If |
fitfun |
a function that takes the arguments |
args.fitfun |
a named list containing additional arguments that are
passed to the fitting function; see also argument |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
folds |
a weight matrix with number of rows equal to the number
of observations, see |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
sampling.type |
use sampling scheme of of Shah & Samworth
(2013), i.e., with complementarty pairs ( |
B |
number of subsampling replicates. Per default, we use 50 complementary pairs for the error bounds of Shah & Samworth (2013) and 100 for the error bound derived in Meinshausen & Buehlmann (2010). As we use B complementray pairs in the former case this leads to 2B subsamples. |
papply |
(parallel) apply function, defaults to
|
mc.preschedule |
preschedule tasks if |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
eval |
logical. Determines whether stability selection is
evaluated ( |
... |
additional arguments to parallel apply methods such as
|
This function implements the stability selection procedure by
Meinshausen and Buehlmann (2010) and the improved error bounds by Shah
and Samworth (2013). For details see also Hofner et al. (2014). The
error bounds are implemented in the function
stabsel_parameters
. Two of the three arguments
cutoff
, q
and PFER
must be specified. The
per-family error rate (PFER), i.e., the expected number of false
positives E(V), where V is the number of false positives,
is bounded by the argument PFER
.
As controlling the PFER is more conservative as controlling the
family-wise error rate (FWER), the procedure also controlls the FWER,
i.e., the probability of selecting at least one non-influential
variable (or model component) is less than PFER
.
Predefined fitfuns
functions exist but more can be
easily implemented. Note that stepwise regression methods are usually
not advised as they tend to be relatively unstable. See example below.
The function stabsel
for data.frame
s is
essentially just a wrapper to the matrix
function with
the same argments. The only difference is that in a pre-processing
step, the data set is converted to a model matrix using the function
model.matrix
. The additional argument intercept
determines if an explicit intercept should be added to the model
matrix. This is often not neccessary but depends on the fitfun
.
An object of class stabsel
with a special print
method.
The object has the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
(realized) upper bound for the per-family error rate. |
specifiedPFER |
specified upper bound for the per-family error rate. |
p |
the number of effects subject to selection. |
B |
the number of subsamples. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
call |
the call. |
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi: 10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
stabsel_parameters
for the computation of error bounds,
stabsel.stabsel
for the fast re-computation of
parameters of a fitted stabsel
object,
fitfun
for available fitting functions and
plot.stabsel
for available plot functions
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
if (require("TH.data")) {
## make data set available
data("bodyfat", package = "TH.data")
} else {
## simulate some data if TH.data not available.
## Note that results are non-sense with this data.
bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10)
}
## set seed
set.seed(1234)
####################################################################
### using stability selection with Lasso methods:
if (require("lars")) {
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.lasso, cutoff = 0.75,
PFER = 1))
(stab.stepwise <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.stepwise, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.lasso, main = "Lasso")
plot(stab.stepwise, main = "Stepwise Selection")
## --> stepwise selection seems to be quite unstable even in this low
## dimensional example!
}
## set seed (again to make results comparable)
set.seed(1234)
if (require("glmnet")) {
(stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.glmnet, main = "Lasso (glmnet)")
if (exists("stab.lasso"))
plot(stab.lasso, main = "Lasso (lars)")
}
## Select variables with maximum coefficients based on lasso estimate
set.seed(1234) # reset seed
if (require("glmnet")) {
## use cross-validated lambda
lambda.min <- cv.glmnet(x = as.matrix(bodyfat[, -2]), y = bodyfat[,2])$lambda.min
(stab.maxCoef <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso_maxCoef,
# specify additional parameters to fitfun
args.fitfun = list(lambda = lambda.min),
cutoff = 0.75, PFER = 1))
## WARNING: Using a fixed penalty (lambda) is usually not permitted and
## not sensible. See ?fitfun for details.
## now compare standard lasso with "maximal parameter estimates" from lasso
par(mfrow = c(2, 1))
plot(stab.maxCoef, main = "Lasso (glmnet; Maximum Coefficients)")
plot(stab.glmnet, main = "Lasso (glmnet)")
## --> very different results.
}
####################################################################
### using stability selection directly on computed boosting models
### from mboost
if (require("mboost")) {
### low-dimensional example
mod <- glmboost(DEXfat ~ ., data = bodyfat)
## compute cutoff ahead of running stabsel to see if it is a sensible
## parameter choice.
## p = ncol(bodyfat) - 1 (= Outcome) + 1 ( = Intercept)
stabsel_parameters(q = 3, PFER = 1, p = ncol(bodyfat) - 1 + 1,
sampling.type = "MB")
## the same:
stabsel(mod, q = 3, PFER = 1, sampling.type = "MB", eval = FALSE)
### Do not test the following code per default on CRAN as it takes some time to run:
## now run stability selection
(sbody <- stabsel(mod, q = 3, PFER = 1, sampling.type = "MB"))
opar <- par(mai = par("mai") * c(1, 1, 1, 2.7))
plot(sbody)
par(opar)
plot(sbody, type = "maxsel", ymargin = 6)
}
|
Loading required package: parallel
Loading required package: TH.data
Loading required package: survival
Loading required package: MASS
Attaching package: 'TH.data'
The following object is masked from 'package:MASS':
geyser
Loading required package: lars
Loaded lars 1.2
Loading required package: glmnet
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-16
Loading required package: mboost
This is mboost 2.9-1. See 'package?mboost' and 'news(package = "mboost")'
for a complete list of changes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.