miic: MIIC, causal network learning algorithm including latent...

Description Usage Arguments Details Value References See Also Examples

View source: R/miic.R

Description

MIIC (Multivariate Information based Inductive Causation) combines constraint-based and information-theoretic approaches to disentangle direct from indirect effects amongst correlated variables, including cause-effect relationships and the effect of unobserved latent causes.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
miic(
  input_data,
  state_order = NULL,
  true_edges = NULL,
  black_box = NULL,
  n_threads = 1,
  cplx = c("nml", "mdl"),
  orientation = TRUE,
  ori_proba_ratio = 1,
  propagation = TRUE,
  latent = c("no", "yes", "orientation"),
  n_eff = -1,
  n_shuffles = 0,
  conf_threshold = 0,
  sample_weights = NULL,
  test_mar = TRUE,
  consistent = c("no", "orientation", "skeleton"),
  max_iteration = 100,
  consensus_threshold = 0.8,
  verbose = FALSE
)

Arguments

input_data

[a data frame] A n*d data frame (n samples, d variables) that contains the observational data. Each column corresponds to one variable and each row is a sample that gives the values for all the observed variables. The column names correspond to the names of the observed variables. Numeric columns will be treated as continuous values, factors and character as categorical.

state_order

[a data frame] An optional d*(2-3) data frame giving the order of the ordinal categorical variables. It will be used during post-processing to compute the signs of the edges using partial linear correlation. If specified, the data frame must have at least a "var_names" column, containing the names of each variable as specified by colnames(input_data). A "var_type" column may specify if each variable is to be considered as discrete (0) or continuous (1). And the "levels_increasing_order" column contains a single character string with all of the unique levels of the ordinal variable in increasing order, delimited by a comma. If the variable is categorical but not ordinal, the "levels_increasing_order" column may instead contain NA.

true_edges

[a data frame] An optional E*2 data frame containing the E edges of the true graph for computing performance after the run.

black_box

[a data frame] An optional E*2 data frame containing E pairs of variables that will be considered as independent during the network reconstruction. In practice, these edges will not be included in the skeleton initialization and cannot be part of the final result. Variable names must correspond to the input_data data frame.

n_threads

[a positive integer] When set greater than 1, n_threads parallel threads will be used for computation. Make sure your compiler is compatible with openmp if you wish to use multithreading.

cplx

[a string; c("nml", "mdl")] In practice, the finite size of the input dataset requires that the 2-point and 3-point information measures should be shifted by a complexity term. The finite size corrections can be based on the Minimal Description Length (MDL) criterion (set the option with "mdl"). In practice, the MDL complexity criterion tends to underestimate the relevance of edges connecting variables with many different categories, leading to the removal of false negative edges. To avoid such biases with finite datasets, the (universal) Normalized Maximum Likelihood (NML) criterion can be used (set the option with "nml"). The default is "nml" (see Affeldt et al., UAI 2015).

orientation

[a boolean value] The miic network skeleton can be partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. The propagation procedure relyes on probabilities; for more details, see Verny et al., PLoS Comp. Bio. 2017). If set to FALSE the orientation step is not performed.

ori_proba_ratio

[a floating point between 0 and 1] When orienting an edge according to the probability of orientation, the threshold to accept the orientation. For a given edge, denote by p > 0.5 the probability of orientation, the orientation is accepted if (1 - p) / p < ori_proba_ratio. 0 means reject all orientations, 1 means accept all orientations.

propagation

[a boolean value] If set to FALSE, the skeleton is partially oriented with only the v-structure orientations. Otherwise, the v-structure orientations are propagated to downstream undirected edges in unshielded triples following the orientation method

latent

[a string; c("no", "yes", "orientation")] When set to "yes", the network reconstruction is taking into account hidden (latent) variables. When set to "orientation", latent variables are not considered during the skeleton reconstruction but allows bi-directed edges during the orientation. Dependence between two observed variables due to a latent variable is indicated with a '6' in the adjacency matrix and in the network edges.summary and by a bi-directed edge in the (partially) oriented graph.

n_eff

[a positive integer] The n samples given in the input_data data frame are expected to be independent. In case of correlated samples such as in time series or Monte Carlo sampling approaches, the effective number of independent samples n_eff can be estimated using the decay of the autocorrelation function (Verny et al., PLoS Comp. Bio. 2017). This effective number n_eff of independent samples can be provided using this parameter.

n_shuffles

[a positive integer] The number of shufflings of the original dataset in order to evaluate the edge specific confidence ratio of all inferred edges.

conf_threshold

[a positive floating point] The threshold used to filter the less probable edges following the skeleton step. See Verny et al., PLoS Comp. Bio. 2017.

sample_weights

[a numeric vector] An optional vector containing the weight of each observation.

test_mar

[a boolean value] If set to TRUE, distributions with missing values will be tested with Kullback-Leibler divergence : conditioning variables for the given link X\rightarrow YZ will be considered only if the divergence between the full distribution and the non-missing distribution KL(P(X,Y) | P(X,Y)_{!NA}) is low enough (with P(X,Y)_{!NA} as the joint distribution of X and Y on samples which are not missing on Z. This is a way to ensure that data are missing at random for the considered interaction and to avoid selection bias. Set to TRUE by default

consistent

[a string; c("no", "orientation", "skeleton")] if "orientation": iterate over skeleton and orientation steps to ensure consistency of the network; if "skeleton": iterate over skeleton step to get a consistent skeleton, then orient edges and discard inconsistent orientations to ensure consistency of the network. See (Li et al., NeurIPS 2019) for details.

max_iteration

[a positive integer] When the consistent parameter is set to "skeleton" or "orientation", the maximum number of iterations allowed when trying to find a consistent graph. Set to 100 by default.

consensus_threshold

[a floating point between 0.5 and 1.0] When the consistent parameter is set to "skeleton" or "orientation", and when the result graph is inconsistent, or is a union of more than one inconsistent graphs, a consensus graph will be produced based on a pool of graphs. If the result graph is inconsistent, then the pool is made of [max_iteration] graphs from the iterations, otherwise it is made of those graphs in the union. In the consensus graph, the status of each edge is determined as follows: Choose from the pool the most probable status. For example, if the pool contains [A, B, B, B, C], then choose status B, if the frequency of presence of B (0.6 in the example) is equal to or higher than [consensus_threshold], then set B as the status of the edge in the consensus graph, otherwise set undirected edge as the status. Set to 0.8 by default.

verbose

[a boolean value] If TRUE, debugging output is printed.

Details

Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data.

The method relies on an information theoretic based (conditional) independence test which is described in (Verny et al., PLoS Comp. Bio. 2017), (Cabeli et al., PLoS Comp. Bio. 2020). It deals with both categorical and continuous variables by performing optimal context-dependent discretization. As such, the input data frame may contain both numerical columns which will be treated as continuous, or character / factor columns which will be treated as categorical. For further details on the optimal discretization method and the conditional independence test, see the function discretizeMutual. The user may also choose to run miic with scheme presented in (Li et al., NeurIPS 2019) to improve the end result's interpretability by ensuring consistent separating set during the skeleton iterations.

Value

A miic-like object that contains:

References

See Also

discretizeMutual for optimal discretization and (conditional) independence test.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
library(miic)

# EXAMPLE HEMATOPOIESIS
data(hematoData)

# execute MIIC (reconstruct graph)
miic.res <- miic(
  input_data = hematoData[1:1000,], latent = "yes",
  n_shuffles = 10, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic.res, method="igraph")
}


# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).

miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp"))

# EXAMPLE CANCER
data(cosmicCancer)
data(cosmicCancer_stateOrder)
# execute MIIC (reconstruct graph)
miic.res <- miic(
  input_data = cosmicCancer, state_order = cosmicCancer_stateOrder, latent = "yes",
  n_shuffles = 100, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic.res)
}

# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp"))

# EXAMPLE OHNOLOGS
data(ohno)
data(ohno_stateOrder)
# execute MIIC (reconstruct graph)
miic.res <- miic(
  input_data = ohno, latent = "yes", state_order = ohno_stateOrder,
  n_shuffles = 100, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic.res)
}

# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp"))

miic documentation built on Jan. 13, 2021, 10:34 a.m.

Related to miic in miic...