sbfControl: Control Object for Selection By Filtering (SBF)
In caret: Classification and Regression Training

sbfControl

R Documentation

Control Object for Selection By Filtering (SBF)

Description

Controls the execution of models with simple filters for feature selection

Usage

sbfControl(
  functions = NULL,
  method = "boot",
  saveDetails = FALSE,
  number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25),
  repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number),
  verbose = FALSE,
  returnResamp = "final",
  p = 0.75,
  index = NULL,
  indexOut = NULL,
  timingSamps = 0,
  seeds = NA,
  allowParallel = TRUE,
  multivariate = FALSE
)

Arguments

`functions`	a list of functions for model fitting, prediction and variable filtering (see Details below)
`method`	The external resampling method: `boot`, `cv`, `LOOCV` or `LGOCV` (for repeated training/test splits
`saveDetails`	a logical to save the predictions and variable importances from the selection process
`number`	Either the number of folds or number of resampling iterations
`repeats`	For repeated k-fold cross-validation only: the number of complete sets of folds to compute
`verbose`	a logical to print a log for each external resampling iteration
`returnResamp`	A character string indicating how much of the resampled summary metrics should be saved. Values can be “final” or “none”
`p`	For leave-group out cross-validation: the training percentage
`index`	a list with elements for each external resampling iteration. Each list element is the sample rows used for training at that iteration.
`indexOut`	a list (the same length as `index`) that dictates which sample are held-out for each resample. If `NULL`, then the unique set of samples not contained in `index` is used.
`timingSamps`	the number of training set samples that will be used to measure the time for predicting samples (zero indicates that the prediction time should not be estimated).
`seeds`	an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of `NA` will stop the seed from being set within the worker processes while a value of `NULL` will set the seeds using a random set of integers. Alternatively, a vector of integers can be used. The vector should have `B+1` elements where `B` is the number of resamples. See the Examples section below.
`allowParallel`	if a parallel backend is loaded and available, should the function use it?
`multivariate`	a logical; should all the columns of `x` be exposed to the `score` function at once?

Details

More details on this function can be found at http://topepo.github.io/caret/feature-selection-using-univariate-filters.html.

Simple filter-based feature selection requires function to be specified for some operations.

The fit function builds the model based on the current data set. The arguments for the function must be:

x the current training set of predictor data with the appropriate subset of variables (i.e. after filtering)
y the current outcome data (either a numeric or factor vector)
... optional arguments to pass to the fit function in the call to sbf

The function should return a model object that can be used to generate predictions.

The pred function returns a vector of predictions (numeric or factors) from the current model. The arguments are:

object the model generated by the fit function
x the current set of predictor set for the held-back samples

The score function is used to return scores with names for each predictor (such as a p-value). Inputs are:

x the predictors for the training samples. If sbfControl()$multivariate is TRUE, this will be the full predictor matrix. Otherwise it is a vector for a specific predictor.
y the current training outcomes

When sbfControl()$multivariate is TRUE, the score function should return a named vector where length(scores) == ncol(x). Otherwise, the function's output should be a single value. Univariate examples are give by anovaScores for classification and gamScores for regression and the example below.

The filter function is used to return a logical vector with names for each predictor (TRUE indicates that the prediction should be retained). Inputs are:

score the output of the score function
x the predictors for the training samples
y the current training outcomes

The function should return a named logical vector.

Examples of these functions are included in the package: caretSBF, lmSBF, rfSBF, treebagSBF, ldaSBF and nbSBF.

The web page http://topepo.github.io/caret/ has more details and examples related to this function.

Value

a list that echos the specified arguments

Author(s)

Max Kuhn

Examples


## Not run: 
data(BloodBrain)

## Use a GAM is the filter, then fit a random forest model
set.seed(1)
RFwithGAM <- sbf(bbbDescr, logBBB,
                 sbfControl = sbfControl(functions = rfSBF,
                                         verbose = FALSE,
                                         seeds = sample.int(100000, 11),
                                         method = "cv"))
RFwithGAM


## A simple example for multivariate scoring
rfSBF2 <- rfSBF
rfSBF2$score <- function(x, y) apply(x, 2, rfSBF$score, y = y)

set.seed(1)
RFwithGAM2 <- sbf(bbbDescr, logBBB,
                  sbfControl = sbfControl(functions = rfSBF2,
                                          verbose = FALSE,
                                          seeds = sample.int(100000, 11),
                                          method = "cv",
                                          multivariate = TRUE))
RFwithGAM2



## End(Not run)

caret documentation built on April 3, 2025, 7:02 p.m.