computeMutualInfo: Compute (conditional) mutual information

View source: R/computeInformation.R

computeMutualInfoR Documentation

Compute (conditional) mutual information

Description

For discrete or categorical variables, the (conditional) mutual information is computed using the empirical frequencies minus a complexity cost (computed as BIC or with the Normalized Maximum Likelihood). When continuous variables are present, each continuous variable is discretized for each mutual information estimate so as to maximize the mutual information minus the complexity cost (see Cabeli 2020).

Usage

computeMutualInfo(
  x,
  y,
  df_conditioning = NULL,
  maxbins = NULL,
  cplx = c("nml", "bic"),
  n_eff = -1,
  sample_weights = NULL,
  is_continuous = NULL,
  plot = FALSE
)

Arguments

x

[a vector] The X vector that contains the observational data of the first variable.

y

[a vector] The Y vector that contains the observational data of the second variable.

df_conditioning

[a data frame] The data frame of the observations of the conditioning variables.

maxbins

[an integer] When the data contain continuous variables, the maximum number of bins allowed during the discretization. A smaller number makes the computation faster, a larger number allows finer discretization.

cplx

[a string] The complexity model:

  • ["bic"] Bayesian Information Criterion

  • ["nml"] Normalized Maximum Likelihood, more accurate complexity cost compared to BIC, especially on small sample size.

n_eff

[an integer] The effective number of samples. When there is significant autocorrelation between successive samples, you may want to specify an effective number of samples that is lower than the total number of samples.

sample_weights

[a vector of floats] Individual weights for each sample, used for the same reason as the effective number of samples but with individual weights.

is_continuous

[a vector of booleans] Specify if each variable is to be treated as continuous (TRUE) or discrete (FALSE), must be of length 'ncol(df_conditioning) + 2', in the order X, Y, U1, U2, .... If not specified, factors and character vectors are considered as discrete, and numerical vectors as continuous.

plot

[a boolean] Specify whether the resulting XY optimum discretization is to be plotted (requires 'ggplot2' and 'gridExtra').

Details

For a pair of continuous variables X and Y, the mutual information I(X;Y) will be computed iteratively. In each iteration, the algorithm optimizes the partitioning of X and then of Y, in order to maximize

Ik(X_{d};Y_{d}) = I(X_{d};Y_{d}) - cplx(X_{d};Y_{d})

where cplx(X_{d}; Y_{d}) is the complexity cost of the corresponding partitioning (see Cabeli 2020). Upon convergence, the information terms I(X_{d};Y_{d}) and Ik(X_{d};Y_{d}), as well as the partitioning of X_{d} and Y_{d} in terms of cutpoints, are returned.

For conditional mutual information with a conditioning set U, the computation is done based on

Ik(X;Y|U) = 0.5*(Ik(X_{d};Y_{d},U_{d}) - Ik(X_{d};U_{d}) + Ik(Y_{d};X_{d},U_{d}) - Ik(Y_{d};U_{d})),

where each of the four summands is estimated separately.

Value

A list that contains :

  • cutpoints1: Only when X is continuous, a vector containing the cutpoints for the partitioning of X.

  • cutpoints2: Only when Y is continuous, a vector containing the cutpoints for the partitioning of Y.

  • n_iterations: Only when at least one of the input variables is continuous, the number of iterations it takes to reach the convergence of the estimated information.

  • iteration1, iteration2, ... Only when at least one of the input variables is continuous, the list of vectors of cutpoints of each iteration.

  • info: The estimation of (conditional) mutual information without the complexity cost.

  • infok: The estimation of (conditional) mutual information with the complexity cost (Ik = I - cplx).

  • plot: Only when 'plot == TRUE', the plot object.

References

Examples

library(miic)
N <- 1000
# Dependence, conditional independence : X <- Z -> Y
Z <- runif(N)
X <- Z * 2 + rnorm(N, sd = 0.2)
Y <- Z * 2 + rnorm(N, sd = 0.2)
res <- computeMutualInfo(X, Y, plot = FALSE)
message("I(X;Y) = ", res$info)
res <- computeMutualInfo(X, Y, df_conditioning = matrix(Z, ncol = 1), plot = FALSE)
message("I(X;Y|Z) = ", res$info)


# Conditional independence with categorical conditioning variable : X <- Z -> Y
Z <- sample(1:3, N, replace = TRUE)
X <- -as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N)
Y <- as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N)
res <- miic::computeMutualInfo(X, Y, cplx = "nml")
message("I(X;Y) = ", res$info)
res <- miic::computeMutualInfo(X, Y, matrix(Z, ncol = 1), is_continuous = c(TRUE, TRUE, FALSE))
message("I(X;Y|Z) = ", res$info)


# Independence, conditional dependence : X -> Z <- Y
X <- runif(N)
Y <- runif(N)
Z <- X + Y + rnorm(N, sd = 0.1)
res <- computeMutualInfo(X, Y, plot = TRUE)
message("I(X;Y) = ", res$info)
res <- computeMutualInfo(X, Y, df_conditioning = matrix(Z, ncol = 1), plot = TRUE)
message("I(X;Y|Z) = ", res$info)


miic documentation built on Sept. 18, 2024, 1:07 a.m.

Related to computeMutualInfo in miic...