computeThreePointInfo: Compute (conditional) three-point information

View source: R/computeInformation.R

computeThreePointInfoR Documentation

Compute (conditional) three-point information

Description

Three point information is defined and computed as the difference of mutual information and conditional mutual information, e.g.

I(X;Y;Z|U) = I(X;Y|U) - Ik(X;Y|U,Z)

For discrete or categorical variables, the three-point information is computed with the empirical frequencies minus a complexity cost (computed as BIC or with the Normalized Maximum Likelihood).

Usage

computeThreePointInfo(
  x,
  y,
  z,
  df_conditioning = NULL,
  maxbins = NULL,
  cplx = c("nml", "bic"),
  n_eff = -1,
  sample_weights = NULL,
  is_continuous = NULL
)

Arguments

x

[a vector] The X vector that contains the observational data of the first variable.

y

[a vector] The Y vector that contains the observational data of the second variable.

z

[a vector] The Z vector that contains the observational data of the third variable.

df_conditioning

[a data frame] The data frame of the observations of the set of conditioning variables U.

maxbins

[an integer] When the data contain continuous variables, the maximum number of bins allowed during the discretization. A smaller number makes the computation faster, a larger number allows finer discretization.

cplx

[a string] The complexity model:

  • ["bic"] Bayesian Information Criterion

  • ["nml"] Normalized Maximum Likelihood, more accurate complexity cost compared to BIC, especially on small sample size.

n_eff

[an integer] The effective number of samples. When there is significant autocorrelation between successive samples, you may want to specify an effective number of samples that is lower than the total number of samples.

sample_weights

[a vector of floats] Individual weights for each sample, used for the same reason as the effective number of samples but with individual weights.

is_continuous

[a vector of booleans] Specify if each variable is to be treated as continuous (TRUE) or discrete (FALSE), must be of length 'ncol(df_conditioning) + 3', in the order X, Y, Z, U1, U2, .... If not specified, factors and character vectors are considered as discrete, and numerical vectors as continuous.

Details

For variables X, Y, Z and a set of conditioning variables U, the conditional three point information is defined as

Ik(X;Y;Z|U) = Ik(X;Y|U) - Ik(X;Y|U,Z)

where Ik is the shifted or regularized conditional mutual information. See computeMutualInfo for the definition of Ik.

Value

A list that contains :

  • i3: The estimation of (conditional) three-point information without the complexity cost.

  • i3k: The estimation of (conditional) three-point information with the complexity cost (i3k = i3 - cplx).

  • i2: For reference, the estimation of (conditional) mutual information I(X;Y|U) used in the estimation of i3.

  • i2k: For reference, the estimation of regularized (conditional) mutual information Ik(X;Y|U) used in the estimation of i3k.

References

Examples

library(miic)
N <- 1000
# Dependence, conditional independence : X <- Z -> Y
Z <- runif(N)
X <- Z * 2 + rnorm(N, sd = 0.2)
Y <- Z * 2 + rnorm(N, sd = 0.2)
res <- computeThreePointInfo(X, Y, Z)
message("I(X;Y;Z) = ", res$i3)
message("Ik(X;Y;Z) = ", res$i3k)


# Independence, conditional dependence : X -> Z <- Y
X <- runif(N)
Y <- runif(N)
Z <- X + Y + rnorm(N, sd = 0.1)
res <- computeThreePointInfo(X, Y, Z)
message("I(X;Y;Z) = ", res$i3)
message("Ik(X;Y;Z) = ", res$i3k)


miic documentation built on Sept. 18, 2024, 1:07 a.m.