summary.vsel: Summary of a 'varsel()' or 'cv_varsel()' run
In stan-dev/projpred: Projection Predictive Feature Selection

summary.vsel

R Documentation

Summary of a `varsel()` or `cv_varsel()` run

Description

This is the summary() method for vsel objects (returned by varsel() or cv_varsel()). Apart from some general information about the varsel() or cv_varsel() run, it shows the full-data predictor ranking, basic information about the (CV) variability in the ranking of the predictors (if available; inferred from cv_proportions()), and estimates for user-specified predictive performance statistics. For a graphical representation, see plot.vsel(). For extracting the predictive performance results printed at the bottom of the output created by this summary() method, see performances().

Usage

## S3 method for class 'vsel'
summary(
  object,
  nterms_max = NULL,
  stats = "elpd",
  type = c("mean", "se", "diff", "diff.se"),
  deltas = FALSE,
  alpha = 2 * pnorm(-1),
  baseline = if (!inherits(object$refmodel, "datafit")) "ref" else "best",
  resp_oscale = TRUE,
  cumulate = FALSE,
  ...
)

Arguments

`object`	An object of class `vsel` (returned by `varsel()` or `cv_varsel()`).
`nterms_max`	Maximum submodel size (number of predictor terms) for which the performance statistics are calculated. Using `NULL` is effectively the same as `length(ranking(object)$fulldata)`. Note that `nterms_max` does not count the intercept, so use `nterms_max = 0` for the intercept-only model. For `plot.vsel()`, `nterms_max` must be at least `1`.
`stats`	One or more character strings determining which performance statistics (i.e., utilities or losses) to estimate based on the observations in the evaluation (or "test") set (in case of cross-validation, these are all observations because they are partitioned into multiple test sets; in case of `varsel()` with `d_test = NULL`, these are again all observations because the test set is the same as the training set). Available statistics are: `"elpd"`: expected log (pointwise) predictive density (for a new dataset) (ELPD). Estimated by the sum of the observation-specific log predictive density values (with each of these predictive density values being a—possibly weighted—average across the parameter draws). For the corresponding confidence interval, a normal approximation is used. `"mlpd"`: mean log predictive density (MLPD), that is, the ELPD divided by the number of observations. For the corresponding confidence interval, a normal approximation is used. `"gmpd"`: geometric mean predictive density (GMPD), that is, `exp()` of the MLPD. The GMPD is especially helpful for discrete response families (because there, the GMPD is bounded by zero and one). For the corresponding standard error, the delta method is used. The corresponding confidence interval type is "exponentiated normal approximation" because the confidence interval bounds are the exponentiated confidence interval bounds of the MLPD. `"mse"`: mean squared error (only available in the situations mentioned in section "Details" below). For the corresponding confidence interval, a log-normal approximation is used if `deltas` is `FALSE` and a normal approximation is used if `deltas` is `TRUE`. `"rmse"`: root mean squared error (only available in the situations mentioned in section "Details" below). For the corresponding standard error, the delta method is used. For the corresponding confidence interval, a log-normal approximation is used if `deltas` is `FALSE` and a normal approximation is used if `deltas` is `TRUE`. `"R2"`: R-squared, i.e., coefficient of determination (only available in the situations mentioned in section "Details" below). For the corresponding standard error, the delta method is used. For the corresponding confidence interval, a normal approximation is used. `"acc"` (or its alias, `"pctcorr"`): classification accuracy (only available in the situations mentioned in section "Details" below). By "classification accuracy", we mean the proportion of correctly classified observations. For this, the response category ("class") with highest probability (the probabilities are model-based) is taken as the prediction ("classification") for an observation. For the corresponding confidence interval, a normal approximation is used. `"auc"`: area under the ROC curve (only available in the situations mentioned in section "Details" below). For the corresponding standard error and lower and upper confidence interval bounds, bootstrapping is used. Not supported in case of subsampled LOO-CV (see argument `nloo` of `cv_varsel()`).
`type`	One or more items from `"mean"`, `"se"`, `"lower"`, `"upper"`, `"diff"`, and `"diff.se"` indicating which of these to compute for each item from `stats` (mean, standard error, lower and upper confidence interval bounds, mean difference to the corresponding statistic of the reference model, and standard error of this difference, respectively; note that for the GMPD, `"diff"`, and `"diff.se"` actually refer to the ratio vs. the reference model, not the difference). The confidence interval bounds belong to confidence intervals with (nominal) coverage `1 - alpha`. Items `"diff"` and `"diff.se"` are only supported if `deltas` is `FALSE`.
`deltas`	If `TRUE`, the submodel statistics are estimated relatively to the baseline model (see argument `baseline`). For the GMPD, the term "relatively" refers to the ratio vs. the baseline model (i.e., the submodel statistic divided by the baseline model statistic). For all other `stats`, "relatively" refers to the difference from the baseline model (i.e., the submodel statistic minus the baseline model statistic).
`alpha`	A number determining the (nominal) coverage `1 - alpha` of the confidence intervals. For example, in case of a normal-approximation confidence interval, `alpha = 2 * pnorm(-1)` corresponds to a confidence interval stretching by one standard error on either side of the point estimate.
`baseline`	For `summary.vsel()`: Only relevant if `deltas` is `TRUE`. For `plot.vsel()`: Always relevant. Either `"ref"` or `"best"`, indicating whether the baseline is the reference model or the best submodel found (in terms of `stats[1]`), respectively. In case of subsampled LOO-CV, `baseline = "best"` is not supported.
`resp_oscale`	Only relevant for the latent projection. A single logical value indicating whether to calculate the performance statistics on the original response scale (`TRUE`) or on latent scale (`FALSE`).
`cumulate`	Passed to argument `cumulate` of `cv_proportions()`. Affects column `cv_proportions_diag` of the summary table.
`...`	Arguments passed to the internal function which is used for bootstrapping (if applicable; see argument `stats`). Currently, relevant arguments are `B` (the number of bootstrap samples, defaulting to `2000`) and `seed` (see `set.seed()`, but defaulting to `NA` so that `set.seed()` is not called within that function at all).

Details

The stats options "mse", "rmse", and "R2" are only available for:

the traditional projection,
the latent projection with resp_oscale = FALSE,
the latent projection with resp_oscale = TRUE in combination with ⁠<refmodel>$family$cats⁠ being NULL.

The stats option "acc" (= "pctcorr") is only available for:

the binomial() family in case of the traditional projection,
all families in case of the augmented-data projection,
the binomial() family (on the original response scale) in case of the latent projection with resp_oscale = TRUE in combination with ⁠<refmodel>$family$cats⁠ being NULL,
all families (on the original response scale) in case of the latent projection with resp_oscale = TRUE in combination with ⁠<refmodel>$family$cats⁠ being not NULL.

The stats option "auc" is only available for:

the binomial() family in case of the traditional projection,
the binomial() family (on the original response scale) in case of the latent projection with resp_oscale = TRUE in combination with ⁠<refmodel>$family$cats⁠ being NULL.

Note that the stats option "auc" is not supported in case of subsampled LOO-CV (see argument nloo of cv_varsel()).

Value

An object of class vselsummary. The elements of this object are not meant to be accessed directly but instead via helper functions (print.vselsummary() and performances.vselsummary()).

Examples


# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

# The `stanreg` fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
  y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
  QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)

# Run varsel() (here without cross-validation, with L1 search, and with small
# values for `nterms_max` and `nclusters_pred`, but only for the sake of
# speed in this example; this is not recommended in general):
vs <- varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10,
             seed = 5555)
print(summary(vs), digits = 1)

stan-dev/projpred documentation built on April 13, 2025, 12:55 a.m.