suggest_size | R Documentation |
This function can suggest an appropriate submodel size based on a decision
rule described in section "Details" below. Note that this decision is quite
heuristic and should be interpreted with caution. It is recommended to
examine the results via plot.vsel()
, cv_proportions()
,
plot.cv_proportions()
, and/or summary.vsel()
and to make the final
decision based on what is most appropriate for the problem at hand.
suggest_size(object, ...)
## S3 method for class 'vsel'
suggest_size(
object,
stat = "elpd",
pct = 0,
type = "upper",
thres_elpd = NA,
warnings = TRUE,
...
)
object |
An object of class |
... |
Arguments passed to |
stat |
Performance statistic (i.e., utility or loss) used for the
decision. See argument |
pct |
A number giving the proportion (not percents) of the relative null model utility one is willing to sacrifice. See section "Details" below for more information. |
type |
Either |
thres_elpd |
Only relevant if |
warnings |
Mainly for internal use. A single logical value indicating
whether to throw warnings if automatic suggestion fails. Usually there is
no reason to set this to |
In general (beware of special extensions below), the suggested model
size is the smallest model size j \in \{0, 1, ...,
\texttt{nterms\_max}\}
for which either the
lower or upper bound (depending on argument type
) of the
normal-approximation (or bootstrap; see argument stat
) confidence
interval (with nominal coverage 1 - alpha
; see argument alpha
of
summary.vsel()
) for U_j - U_{\mathrm{base}}
(with
U_j
denoting the j
-th submodel's true utility and
U_{\mathrm{base}}
denoting the baseline model's true utility)
falls above (or is equal to)
\texttt{pct} \cdot (u_0 -
u_{\mathrm{base}})
where u_0
denotes the null
model's estimated utility and u_{\mathrm{base}}
the baseline
model's estimated utility. The baseline model is either the reference model
or the best submodel found (see argument baseline
of summary.vsel()
).
If !is.na(thres_elpd)
and stat = "elpd"
, the decision rule above is
extended: The suggested model size is then the smallest model size j
fulfilling the rule above or u_j - u_{\mathrm{base}} >
\texttt{thres\_elpd}
. Correspondingly, in case
of stat = "mlpd"
(and !is.na(thres_elpd)
), the suggested model size is
the smallest model size j
fulfilling the rule above or u_j -
u_{\mathrm{base}} > \frac{\texttt{thres\_elpd}}{N}
with N
denoting the number of observations.
For example (disregarding the special extensions in case of
!is.na(thres_elpd)
with stat = "elpd"
or stat = "mlpd"
),
alpha = 2 * pnorm(-1)
, pct = 0
, and type = "upper"
means that we
select the smallest model size for which the upper bound of the
1 - 2 * pnorm(-1)
(approximately 68.3%) confidence interval for
U_j - U_{\mathrm{base}}
exceeds (or is equal to) zero,
that is (if stat
is a performance statistic for which the normal
approximation is used, not the bootstrap), for which the submodel's utility
estimate is at most one standard error smaller than the baseline model's
utility estimate (with that standard error referring to the utility
difference).
Apart from the two summary.vsel()
arguments mentioned above (alpha
and
baseline
), resp_oscale
is another important summary.vsel()
argument
that may be passed via ...
.
A single numeric value, giving the suggested submodel size (or NA
if the suggestion failed).
The intercept is not counted by suggest_size()
, so a suggested size of
zero stands for the intercept-only model.
Loss statistics like the root mean squared error (RMSE) and the mean
squared error (MSE) are converted to utilities by multiplying them by -1
,
so a call such as suggest_size(object, stat = "rmse", type = "upper")
finds the smallest model size whose upper confidence interval bound for the
negative RMSE or MSE exceeds the cutoff (or, equivalently, has the lower
confidence interval bound for the RMSE or MSE below the cutoff). This is
done to make the interpretation of argument type
the same regardless of
argument stat
.
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The "stanreg" fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Run varsel() (here without cross-validation, with L1 search, and with small
# values for `nterms_max` and `nclusters_pred`, but only for the sake of
# speed in this example; this is not recommended in general):
vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10,
seed = 5555)
print(suggest_size(vs))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.