varrank: Heuristics Tools Based on Mutual Information for Variable...

View source: R/varrank.R

varrankR Documentation

Heuristics Tools Based on Mutual Information for Variable Ranking and Feature Selection

Description

This function heuristically estimates the variables ranks based on mutual information with multiple model and multiple search schemes.

Usage

varrank(data.df = NULL, variable.important = NULL, method =
                 c("battiti", "kwak", "peng", "estevez"), algorithm =
                 c("forward", "backward"),
                 scheme = c("mid", "miq"),
                 discretization.method = NULL, ratio = NULL, n.var =
                 NULL, verbose = TRUE)
                 

Arguments

data.df

a named data frame with either numeric or factor variables.

variable.important

a list of variables that is the target variables.

method

the method to be used. See ‘Details’. Default is "estevez".

algorithm

the algorithm scheme to be used. Default is '"forward".

scheme

the scheme search to be used. "mid" and "miq" stands for the Mutual Information Difference and Quotient schemes, respectively. Those are two ways to combine the relevance and redundancy. They are the two most used mRMRe schemes

discretization.method

a character vector giving the discretization method to use. See discretization.

ratio

parameter to be used in "battiti" and "kwak".

n.var

number of variables to be returned.

verbose

logical. Should a progress bar be plotted? As the search scheme.

Details

By default varrank performs a variable ranking based on forward search algorithm using mutual information. The scoring is based on one of the four models implemented. The input dataset can be discrete, continuous or mixed variables. The target variables can be a set. The filter approach based on mutual information is the Minimum Redundancy Maximum Relevance (mRMRe) algorithm. A general formulation of the ensemble of mRMRe techniques is, given a set of features F, a subset of important features C, a candidate feature f_i and possibly some already selected features f_s in S. The local score function for a mid scheme (Mutual Information Difference) is expressed as:

g_i(A, C, S, F) = MI(f_i;C) - ∑_{f_s} A(f_i, f_s, C) MI(f_i; f_s)

Depending of the value method, the value of A and B will be set accordingly to:

battiti defined A=B, where B is a user defined parameter (called ratio). Battiti (1994).

kwak defined A = B MI(f_s;C)/H(f_s), where B a user defined parameter (called ratio). Kwak et al. (2002).

peng defined A=1/|S|. Peng et al. (2005).

estevez defined A = 1/|S| min(H(f_i), H(f_s)). Estévez et al. (2009).

The search algorithm implemented are a forward search i.e. start with an empty set S and fill in. The the returned list is ranked by decreasing order of relevance. A backward search which start with a full set i.e. all variables except variable.importance are considered as selected and the returned list is in increasing order of relevance based solely on mutual information estimation. Thus a backward search will only check for relevance and not for redundancy. n.var is optional if it is not provided, all candidate variables will be ranked.

Value

A list with an entry for the variables ranked and another entry for the score for each ranked variables.

Author(s)

Gilles Kratzer

References

Kratzer, G. and Furrer, R. "varrank: an R package for variable ranking based on mutual information with applications to system epidemiology"

Examples


if (requireNamespace("mlbench", quietly = TRUE)) {
library(mlbench)
data(PimaIndiansDiabetes)

##forward search for all variables
out1 <- varrank(data.df = PimaIndiansDiabetes,
  method = "estevez",
  variable.important = "diabetes",
  discretization.method = "sturges",
  algorithm = "forward", scheme = "mid")

##forward search for 3 variables
out2 <- varrank(data.df = PimaIndiansDiabetes,
  method = "estevez",
  variable.important = "diabetes",
  discretization.method = "sturges",
  algorithm = "forward",
  scheme = "mid",
  n.var=3)

##backward search for all variables
out3 <- varrank(data.df = PimaIndiansDiabetes,
  method = "peng",
  variable.important = "diabetes",
  discretization.method = "sturges",
  algorithm = "backward",
  scheme = "mid",
  n.var=NULL)
  }


varrank documentation built on Oct. 12, 2022, 5:06 p.m.