R/praznik.R

#' Tools for information-based feature selection and scoring
#' 
#' Praznik is a collection of tools for information theory-based feature selection and scoring.
#'
#' The first part of the package consists of efficient implementations of several popular information filters, feature selection approaches based on greedy optimisation of certain feature inclusion criterion.
#' In a nutshell, an algorithm of this class requires an information system \eqn{(X,Y)} and a predefined number of features to selected \eqn{k}, and works like this.
#' To start, it estimates mutual information between each feature and the decision, find a feature with maximal such score and stores it as a first on a list of selected features, \eqn{S}.
#' Then, it estimates a value of a certain criterion function \eqn{J(X,Y,S)} for each feature \eqn{X}; this function also depends on \eqn{Y} and the set of already selected features \eqn{S}.
#' As in the first step, the previously unselected feature with a greatest value of the criterion function is selected next.
#' This is repeated until the method would gather \eqn{k} features, or, in case of some methods, when no more informative features can be found.
#' The methods implemented in praznik consider the following criteria.
#'
#' The mutual information maximisation filter, \code{\link{MIM}}, simply selects top-\eqn{k} features of best mutual information, that is
#' \deqn{J_{MIM}=I(X;Y).}
#'
#' The minimal conditional mutual information maximisation proposed by F. Fleauret, \code{\link{CMIM}}, uses
#' \deqn{J_{CMIM}(X)=\min(I(X;Y),\min_{W\in S} I(X;Y|W));}
#' this method is also effectively identical to the information fragments method.
#' 
#' The minimum redundancy maximal relevancy proposed by H. Peng et al., \code{\link{MRMR}}, uses
#' \deqn{J_{MRMR}=I(X;Y)-\frac{1}{|S|}\sum_{W\in S} I(X;W).}
#'
#' The joint mutual information filter by H. Yang and J. Moody, \code{\link{JMI}}, uses
#' \deqn{J_{JMI}=\sum_{W\in S} I(X,W;Y).}
#'
#' The double input symmetrical relevance filter by P. Meyer and G. Bontempi, \code{\link{DISR}}, uses
#' \deqn{J_{DISR}(X)=\sum_{W\in S} \frac{I(X,W;Y)}{H(X,W,Y)}.}
#'
#' The minimal joint mutual information maximisation filter by M. Bennasar, Y. Hicks and R. Setchi, \code{\link{JMIM}}, uses
#' \deqn{J_{JMIM}=\min_{W\in S} I(X,W;Y).}
#'
#' The minimal normalised joint mutual information maximisation filter by the same authors, \code{\link{NJMIM}}, uses
#' \deqn{J_{NJMIM}=\min_{W\in S} \frac{I(X,W;Y)}{H(X,W,Y)}.}
#'
#' The third-order joint mutual information filter by Sechidis et al., \code{\link{JMI3}}, uses
#' \deqn{J(X)=\frac{1}{2}\sum_{(U,W)\in S^2; U\neq W} I(X,U,W;Y).}
#'
#' While \code{\link{CMIM}}, \code{\link{JMIM}} and \code{\link{NJMIM}} consider minimal value over already selected features, they may use a somewhat more sophisticated and faster algorithm.
#'
#' The second part of the package provides methods for scoring features, useful on its own as well as building blocks for more sophisticated algorithms.
#' In particular, the package exposes the following functions:
#' 
#' \code{\link{hScores}} returns 
#' \deqn{H(X).}
#'
#' \code{\link{jhScores}} returns 
#' \deqn{H(X,Y).}
#'
#' \code{\link{miScores}} returns 
#' \deqn{I(X;Y).}
#'
#' \code{\link{cmiScores}} returns, for a given condition vector \eqn{Z},
#' \deqn{I(X;Y|Z).}
#'
#' \code{\link{jmiScores}} returns
#' \deqn{I(X,Z;Y).}
#'
#' \code{\link{njmiScores}} returns
#' \deqn{\frac{I(X,Z;Y)}{H(X,Y,Z)}.}
#'
#' \code{\link{minCmiScores}}, \code{\link{maxCmiScores}} and \code{\link{minMaxCmiScores}} return
#' \deqn{\min_j I(X_i;Y|X_j)} and/or \deqn{\max_j I(X_i;Y|X_j).}
#'
#' \code{\link{maxJmiScores}} returns
#' \deqn{\max_{j\neq i} I(X_i,X_j;Y).}
#'
#' \code{\link{triScores}} returns, for every triple of features,
#' \deqn{I(X_i;X_j;X_k).}
#'
#' These function generally also have their \code{*Matrix} counterparts, which efficiently build a matrix of scores between all pairs of features.
#' This is especially useful for network inference tasks.
#'
#' Estimation of mutual information and its generalisations is a hard task; still, praznik aims at speed and simplicity and hence only offers basic, maximum likelihood estimator applicable on discrete data.
#' For convenience, praznik automatically and silently coerces non-factor inputs into about ten equally-spaced bins, following the heuristic often used in literature.
#'
#' Furthermore, praznik provides \code{\link{kTransform}} function for converting continuous features into discrete ones with Kendall transformation, a novel approach based on Kendall correlation coefficient which allows for multivariate reasoning based on monotonicity agreement.
#'
#' Additionally, praznik has a limited, experimental support for replacing entropic statistics with Gini impurity-based; in such framework, entropy is replaced by Gini impurity
#' \deqn{g(X):=1-\sum_x p_x^2,}
#' which leads to an impurity gain
#' \deqn{G(X;Y):=g(Y)-E(g(Y)|X)=\sum_{xy}\frac{p_{xy}^2}{p_x}-\sum_{y} p_y^2,}
#' a counterpart of mutual information or information gain.
#' It does not possess most of elegant properties of mutual information, yet values of both are usually highly correlated; moreover, Gini gain is computationally easier to calculate, hence it often replaces MI in performance-sensitive applications, like optimising splits in decision trees.
#'
#' In a present version, praznik includes \code{\link{impScores}} for generating values of \eqn{G} for all features (an analog of \code{\link{miScores}}, as well as \code{\link{JIM}}, a Gini gain-based feature selection method otherwise identical to \code{\link{JMI}}.
#' @references "Praznik: High performance information-based feature selection" M.B. Kursa. SoftwareX 16, 100819 (2021).
#' @references "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection" G. Brown et al. JMLR (2012).
"_PACKAGE"


# Load the native code

#' @useDynLib praznik, .registration=TRUE
NULL

# Some example data

#' Pre-discretised Madelon dataset
#'
#' Madelon is a synthetic data set from the NIPS 2003 feature selection challenge, generated by Isabelle Guyon.
#' It contains 480 irrelevant and 20 relevant features, including 5 informative and 15 redundant.
#' In this version, the originally numerical features have been pre-cut into 10 bins, as well as their names have been altered to reveal 20 relevant features (as identified by the Boruta method).
#' @format A list with two elements, \code{X} containing a data frame with predictors, and \code{Y}, the decision. 
#' Features are in the same order as in the original data; the names of relevant ones start with \code{Rel}, while of irrelevant ones with \code{Irr}.
#' @source \url{https://archive.ics.uci.edu/ml/datasets/Madelon}
#' @usage data(MadelonD)
"MadelonD"

Try the praznik package in your browser

Any scripts or data that you put into this service are public.

praznik documentation built on May 20, 2022, 5:06 p.m.