forward.selection: Cross-validated forward selection

fsR Documentation

Cross-validated forward selection

Description

Run forward selection starting from a baseline model. As it uses all observations in the input data frame, it is not possible to produce unbiased estimates of the predictive performance of the panel selected (use nested.fs() for that purpose).

Usage

fs(
  formula,
  data,
  family,
  choose.from = NULL,
  test = c("t", "wilcoxon"),
  num.inner.folds = 30,
  max.iters = 10,
  min.llk.diff = 2,
  max.pval = 0.5,
  sel.crit = c("paired.test", "total.loglik", "both"),
  num.filter = 0,
  filter.ignore = NULL,
  seed = 50,
  verbose = TRUE
)

forward.selection(x, y, init.model, family, ...)

Arguments

formula

An object of class formula (or one that can be coerced to that class) that describes the baseline model to be fitted.

data

Data frame or matrix containing outcome variable and predictors.

family

Type of model fitted: either gaussian() for linear regression or binomial() for logistic regression. This can be specified also as a function name (gaussian) or as a string ("gaussian").

choose.from

Indices or variable names over which the selection should be performed. If NULL (default), all variables in x that are not in init.model are considered.

test

Type of statistical paired test to use (ignored if sel.crit="total.loglik").

num.inner.folds

Number of folds in the inner cross-validation. It must be at least 5 (default: 30).

max.iters

Maximum number of iterations (default: 10).

min.llk.diff

Minimum improvement in log-likelihood required before selection is terminated (default: 2).

max.pval

Interrupt the selection when the best achievable p-value exceeds this threshold (default: 0.5).

sel.crit

Selection criterion: "paired.test" chooses the variable with smallest p-value using the paired test specified by test (see Details), as long as this is smaller than max.pval; "total.loglik" picks the variable that gives the largest increase in log-likelihood; "both" attempts to combine both previous criteria, choosing the variable that produces the largest increase in log-likelihood only among the best 5 variables ranked according to the paired-test p-value.

num.filter

Number of variables to be retained by the univariate association filter (see Details), which can only be enabled if family=binomial(). Variables listed in init.model are never filtered. If set to 0 (default), the filter is disabled.

filter.ignore

Vector of variable names that should not be pruned by the univariate association filter so that they are always allowed to be selected (ignored if num.filter=0).

seed

Seed of the random number generator for the inner folds.

verbose

Whether the variable chosen at each iteration should be printed out (default: TRUE).

x

Dataframe of predictors: this should include all variables in the initial set and the variables that are allowed to enter the selected panel.

y

Outcome variable. If family=binomial, it can only contain two classes of values that can be coerced to 0-1.

init.model

Either a formula or a vector of names of the initial set of variables that define the model from which the forward selection should start.

...

Further arguments to fs.

Details

At each iteration, this function runs cross-validation to choose which variable enters the final panel by fitting the current model augmented by each remaining variable considered one at a time.

By default variables are selected according to the paired.test criterion. At each iteration, the sampling distribution of differences in validation log-likelihood obtained across all inner cross-validation folds of the models with and without each additional variable are tested against the null hypothesis of zero mean (with the alternative hypothesis being that the model with the additional variable is better). The test is paired according to the inner folds. Although the training folds are not independent, the p-value from this test approximates the probability that including the marker will not decrease the validation log-likelihood (approximate false discovery rate).

In the case of a binary outcome when very large number of predictors is available, it may be convenient to apply a univariate association filter. If num.filter is set to a positive value, then all available predictors (excluding those whose name is matched by filter.ignore) are tested for univariate association with the outcome, and only the first num.filter enter the selection phase, while the others are filtered out. This is done on the training part of all inner folds. Filtering can enhance the performance of forward selection when the number of available variables exceeds about 30-40.

forward.selection provides the legacy interface used up to version 0.9.2. It is considered discontinued, and in the future it will be deprecated and eventually removed.

Value

An object of class fs containing the following fields:

fs

A data frame containing the forward selection summary.

init

The set of variables used in the initial model.

panel

Names of variables selected (in order).

init.model

Right-hand side of the formula corresponding to the initial model.

final.model

Right-hand side of the formula corresponding to the final model after forward selection.

family

Type of model fitted.

params

List of parameters used.

iter1

Summary statistics for all variables at the first iteration.

all.iter

Validation log-likelihoods for all inner folds at all iterations.

See Also

nested.fs() and summary.fs().

Examples


data(diabetes)
fs.res <- fs(Y ~ age + sex, data=diabetes, family=gaussian(),
             choose.from=1:10, num.inner.folds=5, max.iters=3)
summary(fs.res)



mcol/nestfs documentation built on Jan. 4, 2023, 12:38 p.m.