triScores: Mutual information of feature triples

View source: R/scorers.R

triScoresR Documentation

Mutual information of feature triples

Description

Calculates mutual information of each triple of features, that is

I(X_i;X_j;X_k).

Usage

triScores(X, threads = 0)

Arguments

X

Attribute table, given as a data frame with either factors (preferred), booleans, integers (treated as categorical) or reals (which undergo automatic categorisation; see below for details). Single vector will be interpreted as a data.frame with one column. NAs are not allowed.

threads

Number of threads to use; default value, 0, means all available to OpenMP.

Value

A data frame with four columns; first three (Var1, Var2 and Var3) are names of features, fourth, MI is the value of the mutual information. The order of features does not matter, hence only

n(n-1)(n-2)/6

unique, sorted triples are evaluated.

Note

The method requires input to be discrete to use empirical estimators of distribution, and, consequently, information gain or entropy. To allow smoother user experience, praznik automatically coerces non-factor vectors in inputs, which requires additional time, memory and may yield confusing results – the best practice is to convert data to factors prior to feeding them in this function. Real attributes are cut into about 10 equally-spaced bins, following the heuristic often used in literature. Precise number of cuts depends on the number of objects; namely, it is n/3, but never less than 2 and never more than 10. Integers (which technically are also numeric) are treated as categorical variables (for compatibility with similar software), so in a very different way – one should be aware that an actually numeric attribute which happens to be an integer could be coerced into a n-level categorical, which would have a perfect mutual information score and would likely become a very disruptive false positive.

In a current version, the maximal number of features accepted is 2345, which gives a bit less than 2^32 triples. The equation used for calculation is

I(X_i;X_j;X_k)=I(X_i;X_k)+I(X_j;X_k)-I(X_i,X_j;X_k).

Henceforth, please mind that rounding errors may occur and influence reproducibility.

Examples

triScores(iris)

praznik documentation built on May 20, 2022, 5:06 p.m.