# MRMR: Minimum redundancy maximal relevancy filter In mbq/praznik: Collection of Information-Based Feature Selection Filters

## Description

The method starts with an attribute of a maximal mutual information with the decision Y. Then, it greedily adds attribute X with a maximal value of the following criterion:

J(X)=I(X;Y)-\frac{1}{|S|}∑_{W\in S} I(X;W),

where S is the set of already selected attributes.

## Usage

 1 MRMR(X, Y, k = 3, threads = 0) 

## Arguments

 X Attribute table, given as a data frame with either factors (preferred), booleans, integers (treated as categorical) or reals (which undergo automatic categorisation; see below for details). NAs are not allowed. Y Decision attribute; should be given as a factor, but other options are accepted, exactly like for attributes. NAs are not allowed. k Number of attributes to select. Must not exceed ncol(X). threads Number of threads to use; default value, 0, means all available to OpenMP.

## Value

A list with two elements: selection, a vector of indices of the selected features in the selection order, and score, a vector of corresponding feature scores. Names of both vectors will correspond to the names of features in X. Both vectors will either have a length k or zero, when all features turn out to have zero mutual information with the decision.

## Note

The method requires input to be discrete to use empirical estimators of distribution, and, consequently, information gain or entropy. To allow smoother user experience, praznik automatically coerces non-factor vectors in X and Y, which requires additional time and space and may yield confusing results – the best practice is to convert data to factors prior to feeding them in this function. Real attributes are cut into about 10 equally-spaced bins, following the heuristic often used in literature. Precise number of cuts depends on the number of objects; namely, it is n/3, but never less than 2 and never more than 10. Integers (which technically are also numeric) are treated as categorical variables (for compatibility with similar software), so in a very different way – one should be aware that an actually numeric attribute which happens to be an integer could be coerced into a n-level categorical, which would have a perfect mutual information score and would likely become a very disruptive false positive.

## References

"Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy" H. Peng et al. IEEE Pattern Analysis and Machine Intelligence (PAMI) (2005)

## Examples

 1 2 data(MadelonD) MRMR(MadelonD$X,MadelonD$Y,20) 

mbq/praznik documentation built on May 9, 2018, 12:59 a.m.