step_FCBF: Fast Correlation Based Filter for Feature Selection

step_fcbfR Documentation

Fast Correlation Based Filter for Feature Selection

Description

step_fcbf takes a set of features and performs a fast correlation based filter, resulting in a smaller subset of features being selected. The number of features selected depends on the min_su threshold parameter (a lower threshold selects more features).

Usage

step_fcbf(
  recipe,
  ...,
  min_su = 0.025,
  outcome = NA,
  cutpoint = 0.5,
  features_retained = NA,
  role = NA,
  trained = FALSE,
  removals = NULL,
  skip = FALSE,
  id = rand_id("FCBF")
)

Arguments

recipe

A recipe object. The step will be added to the sequence of operations for this recipe.

...

Selector functions that specify which features should be considered by the FCBF. e.g. all_numeric_predictors(), all_predictors()

min_su

Minimum threshold for symmetrical uncertainty. Lower values allow more features to be selected.

outcome

Outcome variable used for filter selection. If there is only one outcome variable in the recipe, it will automatically be detected. If multiple outcome variables exist, the user should specify it.

cutpoint

Quantile value (0-1) describing how to split numeric features into binary nominal features. e.g. 0.5 = median split

features_retained

Internal object that gives a record of which features were retained after FCBF. Should not be specified by the user.

role

Not used for this step since new variables are not created

trained

A logical to indicate if the quantities for preprocessing have been estimated.

removals

Feature columns that will be removed. Used internally and should not be set by the user.

skip

A logical. Should the step be skipped when the recipe is baked by bake()? While all operations are baked when prep() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.

id

A character string that is unique to this step to identify it.

Details

step_fcbf takes a range of features (e.g. the full feature set) and selects a subset of features using the FCBF algorithm as described in Yu, L. and Liu, H. (2003).

FCBF selects features to simultaneously minimize correlation between features and maximise correlations between the features and the target. FCBF only works with categorical features, so continuous features must first be discretized. By default this is based on a median split (i.e. splitting continuous variables into 'high' versus 'low'), but the method may be customized in the internal function 'discretize_var'.

#' Code to implement the FCBF algorithm is driven by Bioconductor package FCBF. step_fcbf provides wrappers that allow it to be used within the tidymodels framework

Value

Returns the recipe object, with step_fcbf added to the sequence of operations for this recipe.

References

Yu, L. and Liu, H. (2003); Feature Selection for High-Dimensional Data A Fast Correlation Based Filter Solution, Proc. 20th Intl. Conf. Mach. Learn. (ICML-2003), Washington DC, 2003.


rowanjh/stepFCBF documentation built on April 8, 2023, 4:28 a.m.