miic | R Documentation |
MIIC (Multivariate Information-based Inductive Causation) combines constraint-based and information-theoretic approaches to disentangle direct from indirect effects amongst correlated variables, including cause-effect relationships and the effect of unobserved latent causes.
miic(
input_data,
state_order = NULL,
true_edges = NULL,
black_box = NULL,
n_threads = 1,
cplx = "nml",
orientation = TRUE,
ort_proba_ratio = 1,
ort_consensus_ratio = NULL,
propagation = FALSE,
latent = "orientation",
n_eff = -1,
n_shuffles = 0,
conf_threshold = 0,
sample_weights = NULL,
test_mar = TRUE,
consistent = "no",
max_iteration = 100,
consensus_threshold = 0.8,
negative_info = FALSE,
mode = "S",
n_layers = NULL,
delta_t = NULL,
mov_avg = NULL,
keep_max_data = FALSE,
max_nodes = 50,
verbose = FALSE
)
input_data |
[a data frame, required] A n*d data frame (n samples, d variables) that contains the observational data. In standard mode, each column corresponds to one variable and each row is a sample that gives the values for all the observed variables. The column names correspond to the names of the observed variables. Numeric columns with at least 5 distinct values will be treated as continuous by default whilst numeric columns with less than 5 distinct values, factors and characters will be considered as categorical. In temporal mode, the expected data frame layout is variables as columns and time series/time steps as rows. The time step information must be supplied in the first column and, for each time series, be consecutive and in ascending order (increment of 1). Multiple trajectories can be provided, miic will consider that a new trajectory starts each time a smaller time step than the one of the previous row is encountered. |
state_order |
[a data frame, optional, NULL by default] A data frame providing extra information for variables. It must have d rows where d is the number of input variables and possible columns are described below. For optional columns, if they are not provided or contain missing values, default values suitable for input_data will be used. "var_names" (required) contains the name of each variable as specified by colnames(input_data). In temporal mode, the time steps column should not be mentioned in the variables list. "var_type" (optional) contains a binary value that specifies if each variable is to be considered as discrete (0) or continuous (1). "levels_increasing_order" (optional) contains a single character string with all of the unique levels of the ordinal variable in increasing order, delimited by comma ','. It will be used during the post-processing to compute the sign of an edge using Spearman's rank correlation. If a variable is continuous or is categorical but not ordinal, this column should be NA. "is_contextual" (optional) contains a binary value that specifies if a variable is to be considered as a contextual variable (1) or not (0). Contextual variables cannot be the child node of any other variable (cannot have edge with arrowhead pointing to them). "is_consequence" (optional) contains a binary value that specifies if a variable is to be considered as a consequence variable (1) or not (0). Edges between consequence variables are ignored, consequence variables cannot be the parent node of any other variable and cannot be used as contributors. Edges between a non consequence and consequence variables are pre-oriented toward the consequence. Several other columns are possible in temporal mode: "n_layers" (optional) contains an integer value that specifies the number of layers to be considered for the variable. Note that if a "n_layers" column is present in the state_order, its values will overwrite the function parameter. "delta_t" (optional) contains an integer value that specifies the number of time steps between each layer for the variable. Note that if a "delta_t" column is present in the state_order, its values will overwrite the function parameter. "mov_avg" (optional) contains an integer value that specifies the size of the moving average window to be applied to the variable. Note that if "mov_avg" column is present in the state_order, its values will overwrite the function parameter. |
true_edges |
[a data frame, optional, NULL by default] A data frame containing the edges of the true graph for
computing performance after the run. |
black_box |
[a data frame, optional, NULL by default] A data frame containing pairs of variables that will be considered
as independent during the network reconstruction. In practice, these edges
will not be included in the skeleton initialization and cannot be part of
the final result. |
n_threads |
[a positive integer, optional, 1 by default] When set greater than 1, n_threads parallel threads will be used for computation. Make sure your compiler is compatible with openmp if you wish to use multithreading. |
cplx |
[a string, optional, "nml" by default, possible values: "nml", "bic"] In practice, the finite size of the input dataset requires that the 2-point and 3-point information measures should be shifted by a complexity term. The finite size corrections can be based on the Bayesian Information Criterion (BIC). However, the BIC complexity term tends to underestimate the relevance of edges connecting variables with many different categories, leading to the removal of false negative edges. To avoid such biases with finite datasets, the (universal) Normalized Maximum Likelihood (NML) criterion can be used (see Affeldt 2015). |
orientation |
[a boolean value, optional, TRUE by default] The miic network skeleton can be partially directed by orienting edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples and, in temporal mode, using time. If set to FALSE, the orientation step is not performed. |
ort_proba_ratio |
[a floating point between 0 and 1, optional, 1 by default] The threshold when deducing the type of an edge tip (head/tail) from the probability of orientation. For a given edge tip, denote by p the probability of it being a head, the orientation is accepted if (1 - p) / p < ort_proba_ratio. 0 means reject all orientations, 1 means accept all orientations. |
ort_consensus_ratio |
[a floating point between 0 and 1, optional,
NULL by default]
Used to determine if orientations correspond to genuine causal edges
and, when consistency is activated, to deduce the orientations in
the consensus graph. |
propagation |
[a boolean value, optional, FALSE by default] If set to FALSE, the skeleton is partially oriented with only the v-structure orientations. Otherwise, the v-structure orientations are propagated to downstream un-directed edges in unshielded triples following the propagation procedure, relying on probabilities (for more details, see Verny 2017). |
latent |
[a string, optional, "orientation" by default, possible values: "orientation", "no", "yes"] When set to "yes", the network reconstruction is taking into account hidden (latent) variables. When set to "orientation", latent variables are not considered during the skeleton reconstruction but allows bi-directed edges during the orientation. Dependence between two observed variables due to a latent variable is indicated with a '6' in the adjacency matrix and in the network edges.summary and by a bi-directed edge in the (partially) oriented graph. |
n_eff |
[a positive integer, optional, -1 by default] In standard mode, the n samples given in the input_data data frame are expected to be independent. In case of correlated samples such as in Monte Carlo sampling approaches, the effective number of independent samples n_eff can be estimated using the decay of the autocorrelation function (see Verny 2017). This effective number n_eff of independent samples can be provided using this parameter. |
n_shuffles |
[a positive integer, optional, 0 by default] The number of shufflings of the original dataset in order to evaluate the edge specific confidence ratio of all retained edges. Default is 0: no confidence cut is applied. If the number of shufflings is set to an integer > 0, the confidence threshold must also be > 0 (e.g. n_shuffles = 100 and conf_threshold = 0.01). |
conf_threshold |
[a positive floating point, optional, 0 by default] The threshold used to filter the less probable edges following the skeleton step (see Verny 2017). Default is 0: no confidence cut is applied. If the confidence threshold is set > 0, the number of shufflings must also be > 0 (e.g. n_shuffles = 100 and conf_threshold = 0.01). |
sample_weights |
[a numeric vector, optional, NULL by default] An vector containing the weight of each observation. If defined, it must be a vector of floats in the range [0,1] of size equal to the number of samples. |
test_mar |
[a boolean value, optional, TRUE by default] If set to TRUE, distributions with missing values will be tested with
Kullback-Leibler divergence: conditioning variables for the given link
|
consistent |
[a string, optional, "no" by default, possible values: "no", "orientation", "skeleton"] If set to "orientation": iterate over skeleton and orientation steps to ensure consistency of the separating sets and all disconnected pairs in the final network. If set to "skeleton": iterate over skeleton step to get a consistent skeleton, then orient edges including inconsistent orientations (see Li 2019 for details). |
max_iteration |
[a positive integer, optional, 100 by default] When the consistent parameter is set to "skeleton" or "orientation", the maximum number of iterations allowed when trying to find a consistent graph. |
consensus_threshold |
[a floating point between 0.5 and 1.0, optional, 0.8 by default] When the consistent parameter is set to "skeleton" or "orientation" and when the result graph is inconsistent or is a union of more than one inconsistent graphs, a consensus graph will be produced based on a pool of graphs. If the result graph is inconsistent, then the pool is made of max_iteration graphs from the iterations, otherwise it is made of those graphs in the union. In the consensus graph, an edge is present when the proportion of non-zero status in the pool is above the threshold. For example, if the pool contains [A, B, B, 0, 0], where "A", "B" are different status of the edge and "0" indicates the absence of the edge. Then the edge is set to connected ("1") if the proportion of non-zero status (0.6 in the example) is equal to or higher than consensus_threshold. (When set to connected, the orientation of the edge will be further determined by the average probability of orientation.) |
negative_info |
[a boolean value, optional, FALSE by default] If TRUE, negative shifted mutual information is allowed during the computation when mutual information is inferior to the complexity term. For small dataset with complicated structures, e.g. discrete variables with many levels, allowing for negative shifted mutual information may help identifying weak v-structures related to those discrete variables, as the negative three-point information in those cases will come from the difference between two negative shifted mutual information terms (expected to be negative due to the small sample size). However, under this setting, a v-structure (X -> Z <- Y) in the final graph does not necessarily imply that X is dependent on Y conditioning on Z, As a consequence, the reliability of certain orientations is not guaranteed. By contrast, keeping this parameter as FALSE is more conservative and leads to more reliable orientations (see Cabeli 2021 and Ribeiro-Dantas 2024). |
mode |
[a string, optional, "S" by default, possible values are "S": Standard (non temporal data) or "TS": Temporal Stationary data] When temporal mode is activated, the time information must be provided in the first column of input_data. For more details about temporal stationary mode (see Simon 2024). |
n_layers |
[an integer, optional, NULL by default, must be >= 2 if supplied] Used only in temporal mode, n_layers defines the number of layers that will be considered for the variables in the time unfolded graph. The layers will be distant of delta_t time steps. If not supplied, the number of layers is estimated from the dynamic of the dataset and the maximum number of nodes max_nodes allowed in the final lagged graph. |
delta_t |
[an integer, optional, NULL by default, must be >= 1 if supplied] Used only in temporal mode, delta_t defines the number of time steps between each layer. i.e. on 1000 time steps with n_layers = 3 and delta_t = 7, the time steps kept for the samples conversion will be 1, 8, 15 for the first sample, the next sample will use 2, 9, 16 and so on. If not supplied, the number of time steps between layers is estimated from the dynamic of the dataset and the number of layers. |
mov_avg |
[an integer, optional, NULL by default, must be >= 2 if supplied] Used only in temporal mode. When supplied, a moving average operation is applied to all integer and numeric variables that are not contextual variables. |
keep_max_data |
[a boolean value, optional, FALSE by default] Used only in temporal mode. If TRUE, rows where some NAs have been introduced during the moving averages and lagging will be kept whilst they will be dropped if FALSE. |
max_nodes |
[an integer, optional, 50 by default] Used only in temporal mode and if the n_layers or delta_t parameters are not supplied. max_nodes is used as the maximum number of nodes in the final time-unfolded graph to compute n_layers and/or delta_t. The default is 50 to produce quick runs and can be increased up to 200 or 300 on recent computers to produce more precise results. |
verbose |
[a boolean value, optional, FALSE by default] If TRUE, debugging output is printed. |
Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data. Miic distinguishes genuine causal edges (with both reliable arrow heads and tails) from putative causal edges (with one reliable arrow head only) and latent causal edges (with both reliable arrow heads). (see Ribeiro-Dantas 2024)
In temporal mode, miic reorganizes the dataset using the n_layers and delta_t parameters to transform the time steps into lagged samples. As starting point, a lagged graph is created with only edges having at least one node laying on the last time step. Then, miic standard algorithm is applied to remove dispensable edges. The remaining edges are then duplicated to ensure time invariance (stationary dynamic) and oriented using the temporality and the signature of causality in observational data. The use of temporal mode is presented in Simon 2024.
The method relies on information theoretic principles which replace (conditional) independence tests as described in Affeldt 2015, Cabeli 2020, Cabeli 2021 and Ribeiro-Dantas 2024. It deals with both categorical and continuous variables by performing optimal context-dependent discretization. As such, the input data frame may contain both numerical columns which will be treated as continuous, or character / factor columns which will be treated as categorical. For further details on the optimal discretization method and the conditional independence test, see the function discretizeMutual. The user may also choose to run miic with scheme presented in Li 2019 and Ribeiro-Dantas 2024 to improve the end result's interpretability by ensuring consistent separating sets.
A miic-like object that contains:
summary: a data frame with information about the relationship between relevant pair of variables.
As returning the information on all possible pairs of variables could lead to an huge data frame, by convention, the summary does not include pair of variables not sharing information at all (I'(x,y) <= 0). However, as exception to this convention, when a ground truth is supplied (using the true_edges parameter), the edges that are not retained by MIIC because the variables does not share information at all but are present in the true edges will be included in the summary to report correctly all the false negative edges.
So, the summary contains these categories of edges:
edges retained
edges not retained after conditioning on some contributor(s)
edges not retained without conditioning but present in true edges
while these edges are not considered as relevant and are not included:
edges not retained without conditioning and not in true edges
Information available in the summary are:
x: X node name
y: Y node name
type: contains 'N' if the edge has been removed or 'P' for retained edges. If the true graph is supplied in the true_edges parameter, 'P' becomes 'TP' (True Positive) or 'FP' (False Positive), while 'N' becomes 'TN' (True Negative) or 'FN' (False Negative). Note that, as the summary does not contain all the removed edges, edges not present have to be considered as 'N' and, if the true graph is supplied, as 'TN'.
ai: the contributing nodes found by the method which contribute to the mutual information between x and y, and possibly separate them.
raw_contributions: describes the share of total mutual information between x and y explained by each contributor, measured by I'(x;y;ai|{aj}) / I'(x;y), where {aj} is the separating set before adding ai.
contributions: describes the share of remaining mutual information between x and y explained by each successive contributors, measured by I'(x;y;ai|{aj}) / I'(x;y|{aj}), where {aj} is the separating set before adding ai.
info: the mutual information I(x;y) times n_xy, the number of samples without missing or NA values for both x and y.
n_xy: gives the number of samples on which the information without conditioning has been computed. If the input dataset has no missing value, the number of samples is the same for all pairs and corresponds to the total number of samples.
info_cond: the conditional mutual information I(x;y|ai) times the number of samples without NA n_xy_ai used in the computation. info_cond is equal to info when ai is an empty set.
cplx: the complexity term for the pair (x, y) taking into account the contributing nodes ai.
n_xy_ai: the number of samples without NA in x, y and all nodes in ai on which the information and the complexity terms are computed. If the input dataset has no missing value, the number of samples is the same for all pairs and corresponds to the total number of samples.
info_shifted: value equal to info_cond - cplx. Used to decide whether the edge is retained (when positive), or removed (when zero or possibly negative when the parameter negative_info is set to TRUE).
ort_inferred: the orientation of the edge (x, y).
0: edge removed, 1: un-directed, 2: directed from X to Y, -2: directed
from Y to X, 6: bi-directed.
When the consistent option is turned on and there is more than
one graph in the consistent cycle, this is the inferred orientation
of the edge in the last graph in the cycle.
ort_ground_truth: the orientation of the edge (x, y) in the ground truth graph when true edges are provided.
is_inference_correct: indicates if the inferred orientation agrees with the provided ground truth. TRUE: agrees, FALSE: disagrees and set to NA when no ground truth is supplied.
is_causal: boolean value indicating the causal nature of the
arrow tips of an edge, based on the probabilities given in the columns
p_y2x and p_x2y. TRUE: when the edges is directed
and both the head and the tail are set with high confidence
(adjustable with the ort_consensus_ratio parameter),
FALSE otherwise or NA if the edge is not retained.
More formally, an oriented edge is marked as genuine causal when
(1 - p_{head}) / p_{head} <
ort_consensus_ratio
and p_{tail} / (1 - p_{tail}) <
ort_consensus_ratio.
A directed edge not marked as genuine causal indicates that only
the head is set with high confidence, while the tail remains uncertain.
This corresponds to a putative causal edge, which could either be
a genuine causal edge or a bi-directed edge from a latent confounder.
Note that the genuine causality is deducible only when latent variables
are allowed and propagation is not allowed.
ort_consensus: Not computed (NAs) when consistency is not activated or, when consistency is on, if there is only one graph returned (no cycle). When computed, indicates the consensus orientation of the edge determined from the consensus skeleton and the ort_consensus_ratio threshold on averaged orientation probabilities over the cycle of graphs. Possible values are 0: not connected, 1: un-directed, -2 or 2: directed and 6: bi-directed (latent variable).
is_causal_consensus: Not computed (NAs) when consistency is not activated or, when consistency is on, if there is only one graph returned (no cycle). When computed, work in the same way as is_causal but on the consensus graph.
edge_stats: Not computed (NAs) when consistency is not activated or, when consistency is on, if there is only one graph returned (no cycle). When computed, contains the frequencies of all ort_inferred values present in the cycle of graphs for the edge (x, y), in the format [percentage(orientation)], separated by ";". e.g. In a cycle of 4 graphs, if an edge is three times marked as 2 (directed) and one time marked as 1 (un-directed), edge_stats will contain "75%(2);25%(1)".
sign: the sign of the partial correlation between variables x and y, conditioned on the contributing nodes ai.
partial_correlation: value of the partial correlation for the edge (x, y) conditioned on the contributing nodes ai.
p_y2x: probability of the arrowhead from y to x, of the inferred orientation, derived from the three-point mutual information (see Verny 2017 and Ribeiro-Dantas 2024). NA if the edge is removed.
p_x2y: probability of the arrowhead from x to y, of the inferred orientation, derived from the three-point mutual information (see Verny 2017 and Ribeiro-Dantas 2024). NA if the edge is removed.
confidence: computed only when the confidence cut is activated, NA otherwise. When computed, it corresponds to a measure of the strength of the retained edges: it is the ratio between the probability to reject the edge exp(-info_shifted(x;y|ai)) in the original dataset and the mean probability to do the same in n_shuffles number of randomized datasets. Edges with confidence > conf_threshold will be filtered out from the graph. (see parameters n_shuffles and conf_threshold)
edges: a data frame with the raw edges output coming from the C++ core function. This data frame is used internally by MIIC to produce the summary and contains all pairs of variables (x, y).
triples: this data frame lists the orientation probabilities of the two edges of all unshielded triples of the reconstructed network with the structure: node1 – mid-node – node2:
node1: node at the end of the unshielded triplet
p1: probability of the arrowhead node1 <- mid-node
p2: probability of the arrowhead node1 -> mid-node
mid-node: node at the center of the unshielded triplet
p3: probability of the arrowhead mid-node <- node2
p4: probability of the arrowhead mid-node -> node2
node2: node at the end of the unshielded triplet
ni3: 3 point (conditional) mutual information * N
conflict: indicates if there is a conflict between the computed probabilities and the ni3 value
adj_matrix: the adjacency matrix is a square matrix used to represent the inferred graph. The entries of the matrix indicate whether pairs of vertices are adjacent or not in the graph. The matrix can be read as a (row, column) set of couples where the row represents the source node and the column the target node. Since miic can reconstruct mixed networks (including directed, un-directed and bi-directed edges), we will have a different digit for each case:
1: (x, y) edge is un-directed
2: (x, y) edge is directed as x -> y
-2: (x, y) edge is directed as x <- y
6: (x, y) edge is bi-directed
proba_adj_matrix: the probability adjacency matrix is a square matrix used to represent the orientation probabilities associated to the edges tips. The value at ("row", "column") is the probability, for the edge between "row" and "column" nodes, of the edge tip on the "row" side. A probability less than 0.5 is an indication of a possible tail (cause) and a probability greater than 0.5 a possible head (effect).
adj_matrices: present only when consistency is activated. The list of the adjacency matrices, one for each graph which is part of the resulting cycle of graphs. Each item is a square matrix with the same layout as adj_matrix.
proba_adj_matrices: present only when consistency is activated. The list of the probability adjacency matrices, one for each graph which is part of the resulting cycle of graphs. Each item is a square matrix with the same layout as proba_adj_matrix.
proba_adj_average: present only when consistency is activated. The average probability adjacency matrix is a square matrix used to represent the orientation probabilities associated to the edges tips of the consensus graph. Its layout is the same as proba_adj_matrix and it contains the averaged probability of edges tips over the resulting cycle of graphs.
is_consistent: present only when consistency is activated. TRUE if the returned graph is consistent, FALSE otherwise.
time: execution time of the different steps and total run-time of the causal graph reconstruction by MIIC.
interrupted: TRUE if causal graph reconstruction has been interrupted, FALSE otherwise.
scores: present only when true edges have been supplied. Contains the scores of the returned graph in regard of the ground truth:
tp: number of edges marked as True Positive
fp: number of edges marked as False Positive
fn: number of edges marked as False Negative
precision: Precision
recall: Recall
fscore: F1-Score
params: the list of parameters used for the network reconstruction. The parameters not supplied are initialized to their default values. Otherwise, the parameters are checked and corrected if necessary.
state_order: the state order used for the network reconstruction. If no state order is supplied, it is generated by using default values. Otherwise, it is the state order checked and corrected if necessary.
black_box: present only if a black box has been supplied, the black box, checked and corrected if necessary, used for the network reconstruction.
true_edges: present only if the true edges have been supplied, the true edges, checked and corrected if necessary, used for the network evaluation.
tmiic: present only in temporal mode. Named list containing the full list of edges completed by stationarity, the lagged state order and, if a black box or true edges have been supplied, the lagged versions of these inputs.
Simon et al., eLife 2024, CausalXtract: a flexible pipeline to extract causal effects from live-cell time-lapse imaging data
Ribeiro-Dantas et al., iScience 2024, Learning interpretable causal networks from very large datasets, application to 400,000 medical records of breast cancer patients
Cabeli et al., NeurIPS 2021, Reliable causal discovery based on mutual information supremum principle for finite dataset
Cabeli et al., PLoS Comput. Biol. 2020, Learning clinical networks from medical records based on information estimates in mixed-type data
Li et al., NeurIPS 2019, Constraint-based causal structure learning with consistent separating sets
Verny et al., PLoS Comput. Biol. 2017, Learning causal networks with latent variables from multivariate information in genomic data
Affeldt et al., UAI 2015, Robust Reconstruction of Causal Graphical Models based on Conditional 2-point and 3-point Information
discretizeMutual
for optimal discretization and
(conditional) independence test.
library(miic)
# EXAMPLE HEMATOPOIESIS
data(hematoData)
# execute MIIC (reconstruct graph)
miic_obj <- miic(
input_data = hematoData[1:1000,], latent = "yes",
n_shuffles = 10, conf_threshold = 0.001
)
# plot graph
if(require(igraph)) {
plot(miic_obj, method="igraph")
}
# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp"))
# EXAMPLE CANCER
data(cosmicCancer)
data(cosmicCancer_stateOrder)
# execute MIIC (reconstruct graph)
miic_obj <- miic(
input_data = cosmicCancer, state_order = cosmicCancer_stateOrder, latent = "yes",
n_shuffles = 100, conf_threshold = 0.001
)
# plot graph
if(require(igraph)) {
plot(miic_obj)
}
# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
writeCytoscapeNetwork(miic_obj, file = file.path(tempdir(), "temp"))
# EXAMPLE COVID CASES (time series demo)
data(covidCases)
# execute MIIC (reconstruct graph in temporal mode)
tmiic_obj <- miic(input_data = covidCases, mode = "TS", n_layers = 3, delta_t = 1, mov_avg = 14)
# to plot the default graph (compact)
if(require(igraph)) {
plot(tmiic_obj)
}
# to plot the raw temporal network
if(require(igraph)) {
plot(tmiic_obj, display="raw")
}
# to plot the full temporal network
if(require(igraph)) {
plot(tmiic_obj, display="lagged")
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.