cv_varsel  R Documentation 
Run the search part and the evaluation part for a projection predictive
variable selection. The search part determines the predictor ranking (also
known as solution path), i.e., the best submodel for each submodel size
(number of predictor terms). The evaluation part determines the predictive
performance of the submodels along the predictor ranking. In contrast to
varsel()
, cv_varsel()
performs a crossvalidation (CV) by running the
search part with the training data of each CV fold separately (an exception
is explained in section "Note" below) and by running the evaluation part on
the corresponding test set of each CV fold.
cv_varsel(object, ...)
## Default S3 method:
cv_varsel(object, ...)
## S3 method for class 'vsel'
cv_varsel(object, ...)
## S3 method for class 'refmodel'
cv_varsel(
object,
method = "forward",
cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold",
ndraws = NULL,
nclusters = 20,
ndraws_pred = 400,
nclusters_pred = NULL,
refit_prj = !inherits(object, "datafit"),
nterms_max = NULL,
penalty = NULL,
verbose = TRUE,
nloo = NULL,
K = if (!inherits(object, "datafit")) 5 else 10,
cvfits = object$cvfits,
lambda_min_ratio = 1e05,
nlambda = 150,
thresh = 1e06,
regul = 1e04,
validate_search = TRUE,
seed = NA,
search_terms = NULL,
parallel = getOption("projpred.prll_cv", FALSE),
...
)
object 
An object of class 
... 
Arguments passed to 
method 
The method for the search part. Possible options are

cv_method 
The CV method, either 
ndraws 
Number of posterior draws used in the search part. Ignored if

nclusters 
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of 
ndraws_pred 
Only relevant if 
nclusters_pred 
Only relevant if 
refit_prj 
For the evaluation part, should the submodels along the
predictor ranking be fitted again ( 
nterms_max 
Maximum submodel size (number of predictor terms) up to
which the search is continued. If 
penalty 
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of 
verbose 
A single logical value indicating whether to print out additional information during the computations. 
nloo 
Caution: Still experimental. Only relevant if 
K 
Only relevant if 
cvfits 
Only relevant if 
lambda_min_ratio 
Only relevant for L1 search. Ratio between the smallest and largest lambda in the L1penalized search. This parameter essentially determines how long the search is carried out, i.e., how large submodels are explored. No need to change this unless the program gives a warning about this. 
nlambda 
Only relevant for L1 search. Number of values in the lambda grid for L1penalized search. No need to change this unless the program gives a warning about this. 
thresh 
Only relevant for L1 search. Convergence threshold when computing the L1 path. Usually, there is no need to change this. 
regul 
A number giving the amount of ridge regularization when projecting onto (i.e., fitting) submodels which are GLMs. Usually there is no need for regularization, but sometimes we need to add some regularization to avoid numerical problems. 
validate_search 
Only relevant if 
seed 
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument 
search_terms 
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( 
parallel 
A single logical value indicating whether to run costly parts
of the CV in parallel ( 
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmenteddata projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.
An L1 search may select an interaction term before all involved lowerorder interaction terms (including maineffect terms) have been selected. In projpred versions > 2.6.0, the resulting predictor ranking is automatically modified so that the lowerorder interaction terms come before this interaction term, but if this is conceptually undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes j \in \{0, ..., J\}
(with J
denoting the maximum submodel size, not counting the intercept). The
candidate models at model size j
are constructed from those elements
from search_terms
which yield model size j
when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size j
. Also note that internally, search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 0) always consists of the interceptonly model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
(or, in a simpler way that leads to the same results,
search_terms = c("x1", "x1 + x2", "x1 + x3")
, for which helper function
force_search_terms()
exists). This search would start with y ~ 1
as the
only candidate at model size 0. At model size 1, y ~ x1
would be the only
candidate. At model size 2, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 3, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
(or, in a simpler way
that leads to the same results, search_terms = c("x2", "x3")
).
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpredpackage).
If validate_search
is FALSE
, the search is not included in the CV
so that only a single fulldata search is run.
For PSISLOO CV, projpred calls loo::psis()
(or, exceptionally,
loo::sis()
, see below) with r_eff = NA
. This is only a problem if there
was extreme autocorrelation between the MCMC iterations when the reference
model was built. In those cases however, the reference model should not
have been used anyway, so we don't expect projpred's r_eff = NA
to
be a problem.
PSIS cannot be used if the draws have different (i.e., nonconstant) weights or if the number of draws is too small. In such cases, projpred resorts to standard importance sampling (SIS) and throws a warning about this. Throughout the documentation, the term "PSIS" is used even though in fact, projpred resorts to SIS in these special cases.
With parallel = TRUE
, costly parts of projpred's CV are run in
parallel. Costly parts are the foldwise searches and performance
evaluations in case of validate_search = TRUE
. (Note that in case of
K
fold CV, the K
reference model refits are not affected by
argument parallel
; only projpred's CV is affected.) The
parallelization is powered by the foreach package. Thus, any parallel
(or sequential) backend compatible with foreach can be used, e.g.,
the backends from packages doParallel, doMPI, or
doFuture. For GLMs, this CV parallelization should work reliably, but
for other models (such as GLMMs), it may lead to excessive memory usage
which in turn may crash the R session (on Unix systems, setting an
appropriate memory limit via unix::rlimit_as()
may avoid crashing the
whole machine). However, the problem of excessive memory usage is less
pronounced for the CV parallelization than for the projection
parallelization described in projpredpackage. In that regard, the CV
parallelization is recommended over the projection parallelization.
Magnusson, Måns, Michael Andersen, Johan Jonasson, and Aki Vehtari. 2019. "Bayesian LeaveOneOut CrossValidation for Large Data." In Proceedings of the 36th International Conference on Machine Learning, edited by Kamalika Chaudhuri and Ruslan Salakhutdinov, 97:4244–53. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v97/magnusson19a.html.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. "Practical Bayesian Model Evaluation Using LeaveOneOut CrossValidation and WAIC." Statistics and Computing 27 (5): 1413–32. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1007/s1122201696964")}.
Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2022. "Pareto Smoothed Importance Sampling." arXiv. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.48550/arXiv.1507.02646")}.
varsel()
# Data:
dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit < rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876
)
# Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and
# `nclusters_pred`, but only for the sake of speed in this example; this is
# not recommended in general):
cvvs < cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2,
nterms_max = 3, nclusters_pred = 10, seed = 5555)
# Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
# and `?ranking` for possible postprocessing functions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.