discretizeMutual: Iterative dynamic programming for (conditional) mutual...
In miic: Learning Causal or Non-Causal Graphical Models Using Information Theory

discretizeMutual

R Documentation

Iterative dynamic programming for (conditional) mutual information through optimized discretization.

Description

This function chooses cutpoints in the input distributions by maximizing the mutual information minus a complexity cost (computed as BIC or with the Normalized Maximum Likelihood). The (conditional) mutual information computed on the optimized discretized distributions effectively estimates the mutual information of the original continuous variables.

Usage

discretizeMutual(
  x,
  y,
  matrix_u = NULL,
  maxbins = NULL,
  cplx = "nml",
  n_eff = NULL,
  sample_weights = NULL,
  is_continuous = NULL,
  plot = TRUE
)

Arguments

`x`	[a vector] The `X` vector that contains the observational data of the first variable.
`y`	[a vector] The `Y` vector that contains the observational data of the second variable.
`matrix_u`	[a numeric matrix] The matrix with the observations of as many columns as conditioning variables.
`maxbins`	[an int] The maximum number of bins desired in the discretization. A lower number makes the computation faster, a higher number allows finer discretization (by default : 5 * cubic root of N).
`cplx`	[a string] The complexity used in the dynamic programming: ["bic"] Bayesian Information Criterion ["nml"] Normalized Maximum Likelihood, more accurate complexity cost compared to BIC, especially on small sample size.
`n_eff`	[an integer] The effective number of samples. When there is significant autocorrelation between successive samples, you may want to specify an effective number of samples that is lower than the total number of samples.
`sample_weights`	[a vector of floats] Individual weights for each sample, used for the same reason as the effective number of samples but with individual weights.
`is_continuous`	[a vector of booleans] Specify if each variable is to be treated as continuous (TRUE) or discrete (FALSE) in a logical vector of length ncol(matrix_u) + 2, in the order [X, Y, U1, U2...]. By default, factors and character vectors are treated as discrete, and numerical vectors as continuous.
`plot`	[a boolean] Specify whether the resulting XY optimum discretization is to be plotted (requires 'ggplot2' and 'gridExtra').

Details

For a pair of continuous variables X and Y, the algorithm will iteratively choose cutpoints on X then on Y, maximizing I(X_{d};Y_{d}) - cplx(X_{d};Y_{d}) where cplx(X_{d};Y_{d}) is the complexity cost of the considered discretizations of X and Y (see Cabeli 2020). Upon convergence, the discretization scheme of X_{d} and Y_{d} is returned as well as I(X_{d};Y_{d}) and I(X_{d};Y_{d})-cplx(X_{d};Y_{d}).

With a set of conditioning variables U, the discretization scheme maximizes each term of the sum I(X;Y|U) \sim 0.5*(I(X_{d};Y_{d}, U_{d}) - I(X_{d};U_{d}) + I(Y_{d};X_{d}, U_{d}) - I(Y_{d};U_{d})).

Discrete variables can be passed as factors and will be used "as is" to maximize each term.

Value

A list that contains :

two vectors containing the cutpoints for each variable : cutpoints1 corresponds to x, cutpoints2 corresponds to y.
n_iterations is the number of iterations performed before convergence of the (C)MI estimation.
iteration1, iteration2, ..., lists containing the cutpoint vectors for each iteration.
info and infok, the estimated (C)MI value and (C)MI minus the complexity cost.
if plot == TRUE, a plot object (requires ggplot2 and gridExtra).

References

Cabeli et al., PLoS Comput. Biol. 2020, Learning clinical networks from medical records based on information estimates in mixed-type data

Examples

library(miic)
N <- 1000
# Dependence, conditional independence : X <- Z -> Y
Z <- runif(N)
X <- Z * 2 + rnorm(N, sd = 0.2)
Y <- Z * 2 + rnorm(N, sd = 0.2)
res <- discretizeMutual(X, Y, plot = FALSE)
message("I(X;Y) = ", res$info)
res <- discretizeMutual(X, Y, matrix_u = matrix(Z, ncol = 1), plot = FALSE)
message("I(X;Y|Z) = ", res$info)


# Conditional independence with categorical conditioning variable : X <- Z -> Y
Z <- sample(1:3, N, replace = TRUE)
X <- -as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N)
Y <- as.numeric(Z == 1) + as.numeric(Z == 2) + 0.2 * rnorm(N)
res <- miic::discretizeMutual(X, Y, cplx = "nml")
message("I(X;Y) = ", res$info)
res <- miic::discretizeMutual(X, Y, matrix(Z, ncol = 1), is_continuous = c(TRUE, TRUE, FALSE))
message("I(X;Y|Z) = ", res$info)


# Independence, conditional dependence : X -> Z <- Y
X <- runif(N)
Y <- runif(N)
Z <- X + Y + rnorm(N, sd = 0.1)
res <- discretizeMutual(X, Y, plot = TRUE)
message("I(X;Y) = ", res$info)
res <- discretizeMutual(X, Y, matrix_u = matrix(Z, ncol = 1), plot = TRUE)
message("I(X;Y|Z) = ", res$info)

miic documentation built on Sept. 18, 2024, 1:07 a.m.