feature_search: General Feature Searching Engine
In FSelectorRcpp: 'Rcpp' Implementation of 'FSelector' Entropy-Based Feature Selection Algorithms with a Sparse Matrix Support

feature_search

R Documentation

General Feature Searching Engine

Description

A convenience wrapper for greedy and exhaustive feature selection algorithms that extract valuable attributes depending on the evaluation method (called evaluator). This function is a reimplementation of FSelector's exhaustive.search and greedy.search.

Usage

feature_search(
  attributes,
  fun,
  data,
  mode = c("greedy", "exhaustive"),
  type = c("forward", "backward"),
  sizes = 1:length(attributes),
  parallel = TRUE,
  ...
)

Arguments

`attributes`	A character vector with attributes' names to be used to extract the most valuable features.
`fun`	A function (evaluator) to be used to score features' sets at each iteration of the algorithm passed via `mode`. See Examples.
`data`	A data set for `fun` function (evaluator).
`mode`	A character that determines which search algorithm to perform. Defualt is `"greedy"`.
`type`	Used when `mode = "greedy"` - whether to use the `backward` or the `forward` multiple-way search. Default is `"forward"`.
`sizes`	Used when `mode = "exhaustive"` - a vector of sizes of attributes subsets.
`parallel`	Allow parallelization.
`...`	Other arguments passed to foreach function.

Details

The evaluator function passed with fun is used to determine the importance score of current features' subset. The score is used in a multiple-way (backward or forward) greedy algorithm as a stopping moment or as a selection criterion in the exhaustive search that checks all possible attributes' subset combinations (of sizes passed in sizes).

Value

A list with following components

best - a data.frame with the best subset and it's score (1 - feature used, 0 - feature not used),
all - a data.frame with all checked features' subsets and their score (1 - feature used, 0 - feature not used),
data - the data used in the feature selection,
fun - the evaluator used to compute the score of importance for features' subsets,
call - an origin call of the feature_search,
mode - the mode used in the call.

Note

Note that score depends on the evaluator you provide in the fun parameter.

Author(s)

Zygmunt Zawadzki zygmunt@zstat.pl

Krzysztof Slomczynski krzysztofslomczynski@gmail.com

Examples


# Enable parallelization in examples
## Not run: 
 library(doParallel)
 cl <- makeCluster(2)
 registerDoParallel(cl)

## End(Not run)
# Close at the end
# stopCluster(cl) #nolint
# registerDoSEQ() #nolint

if(require("rpart")) {
# 1) Evaluator from FSelector package.
evaluator <- function(subset, data, dependent = names(iris)[5]) {
  library(rpart)
  k <- 5
  splits <- runif(nrow(data))
  results <- sapply(1:k, function(i) {
    test.idx <- (splits >= (i - 1) / k) & (splits < i / k)
    train.idx <- !test.idx
    test <- data[test.idx, , drop = FALSE]
    train <- data[train.idx, , drop = FALSE]
    tree <- rpart(to_formula(subset, dependent), train)
    error.rate <- sum(test[[dependent]] != predict(tree, test, type = "c")) /
    nrow(test)
    return(1 - error.rate)
  })
  return(mean(results))
}

set.seed(123)
# Default greedy search.
system.time(
  feature_search(attributes = names(iris)[-5],
                 fun = evaluator,
                 data = iris)
)
system.time(
  feature_search(attributes = names(iris)[-5],
                 fun = evaluator,
                 data = iris,
                 parallel = FALSE)
)

# Optional exhaustive search.
system.time(
  feature_search(attributes = names(iris)[-5],
                 fun = evaluator,
                 data = iris,
                 mode = "exhaustive")
)
system.time(
  feature_search(attributes = names(iris)[-5],
                 fun = evaluator,
                 data = iris,
                 mode = "exhaustive",
                 parallel = FALSE)
)
}

# 2) Maximize R^2 statistics in the linear regression model/problem.

evaluator_R2_lm <- function(attributes, data, dependent = names(iris)[1]) {
  summary(
    lm(to_formula(attributes, dependent), data = data)
  )$r.squared
}

feature_search(attributes = names(iris)[-1],
               fun = evaluator_R2_lm, data = iris,
               mode = "exhaustive")

# 3) Optimize BIC crietion in generalized linear model.
# Aim of Bayesian approach it to identify the model with the highest
# probability of being the true model. - Kuha 2004
if(require("MASS")) {
utils::data(anorexia, package = "MASS")

evaluator_BIC_glm <- function(attributes, data, dependent = "Postwt") {
  extractAIC(
    fit = glm(to_formula(attributes, dependent), family = gaussian,
              data = data),
    k = log(nrow(data))
  )[2]
}

feature_search(attributes = c("Prewt", "Treat", "offset(Prewt)"),
               fun = evaluator_BIC_glm,
               data = anorexia,
               mode = "exhaustive")
}
# Close parallelization
## Not run: 
stopCluster(cl)
registerDoSEQ()

## End(Not run)

FSelectorRcpp documentation built on Oct. 3, 2024, 1:08 a.m.