ModiSkeleton: Infer a graph skeleton (undirected graph)

ModiSkeletonR Documentation

Infer a graph skeleton (undirected graph)

Description

This function implements the MRPC algorithm in Badsha and Fu (2019) and Badsha et al. (2021) to infers a graph skeleton (i.e., an undirected graph). It is based on the function skeleton from the pcalg package (Kalisch et al., 2012). Both functions perform marginal and conditional indpenendence tests. However, ModiSkeleton implements an online false discovery rate (FDR) control method in order to control the overall FDR, whereas skeleton controls only the type I error rate for each individual test. See details below.

Usage

ModiSkeleton(data, suffStat, FDR, alpha, indepTest, labels, p,
             method = c("stable", "original", "stable.fast"),
             m.max = Inf, fixedGaps = NULL, fixedEdges = NULL,
             NAdelete = TRUE, FDRcontrol = c("LOND", "ADDIS", "NONE"),
             tau, lambda, verbose = FALSE)

Arguments

Many arguments are similar to those in skeleton and pc in the pcalg package. Several arguments here are also arguments for the function MRPC.

data

Data matrix, where the rows are samples and the columns are features (e.g., genetic variants (GVs) and phenotypes). Columns are for GVs, if available, appear before other columns for phenotypes (e.g., gene expression). For example, if there is one GV, then the first column of the data matrix is the GV and the remaining columns are the gene expression data.

suffStat

A list of sufficient statistics. When the data is continuous or can be viewed as continuous, this list contains the correlation matrix of the data and the sample size, which are the necessary elements for the conditional independence tests in gaussCItest. When the data is discrete, this list contains the entire dataset.

FDR

Desired overall FDR level.

alpha

significance level (number in (0,1) for the individual tests.

indepTest

Name of the statistical test. It is used to test the independence of x and y given S, where x and y are variables and S is a vector, possibly empty, of variables. The return value of indepTest is the p-value of the test for conditional independence. Different tests may used for different data types. For example, indepTest='gaussCItest' for Gaussian data, indepTest='disCItest' for discrete data, and indepTest='binCItest' for binary data. See additional details in help(gaussCItest).

ci.test in the bnlearn package (Marco Scutari, 2010) may also be used for testing conditional independence and return a p-value. The default test statistic is the mutual information for categorical variables, the Jonckheere-Terpstra test for ordered factors and the linear correlation for continuous variables. See help(ci.test).

labels

A character vector of names of variables (nodes). These are typically the column names of the data matrix.

p

(optional) The number of variables (nodes). Need to be specified if the labels are not provided, in which case the labels are set to 1:p.

method

(optional) Character string specifying method. The default, "stable" provides an order-independent skeleton.

m.max

(optional) Maximum size of the conditioning sets that are considered in the conditional independence tests.

fixedGaps

(optional) A logical matrix of dimension p*p. If entry [x, y], [y, x], or both are TRUE, the edge x—y is removed before starting the algorithm. Therefore, this edge is guaranteed to be absent in the resulting graph.

fixedEdges

(optional) A logical matrix of dimension p*p. If entry [x, y], [y, x], or both are TRUE, the edge x—y is never considered for removal. Therefore, this edge is guaranteed to be present in the resulting graph.

NAdelete

(optional) If indepTest returns NA and this option is TRUE, the corresponding edge is deleted. If this option is FALSE, the edge is not deleted.

FDRcontrol

A character string specifying whether online FDR control should be applied, and if so, what method to use. The two FDR control options are "LOND" (Javanmard and Montanari, 2015) or "ADDIS" (Tian and Ramdas, 2019). If "NONE" is specified, the type I error rate "alpha" will be used for each test.

tau

A number between 0 and 1. This value is used to determine if a p-value will be considered for testing. For example, if a p-value is greater than tau then it is discarded and no test will be performed.

lambda

A number between 0 and tau. This value is used to determine if a p-value is a candidate for rejection. For example, if a p-value is smaller than lambda then it can be rejected when testing the hypothesis (if the p-value is smaller than alphai).

verbose

(optional) If TRUE, detailed output is provided. Default is FALSE for no output details

Details

The ModiSkeleton function incorporates sequential hypothesis testing to infer the graph skeleton. This function starts with a complete graph (all nodes are connected with undirected edges) and performs a series of marginal and conditional independence tests, removing the corresponding edge if the test is not rejected.

First, all pairs of nodes are tested for marginal independence. If two nodes x and y are judged to be marginally independent at a type I error rate alpha, the edge between them is deleted and the empty set is saved as separation sets S[x, y] and S[y, x]. After all pairs have been tested for marginal independence, some edges may be removed.

Second, nodes (x, y) with an edge are tested for conditional independence given all subsets of the neighboring nodes. If there is any node z such that x and y are conditionally independent given z, the edge between x and y is removed and node z is saved as separation set, sepset, S[x, y] and S[y, x]. The algorithm continues in this way by increasing the size of the conditioning set step by step. The algorithm stops if all adjacency sets in the current graph are smaller than the size of the conditioning set. The result is the skeleton in which every edge is still undirected.

Unlike existing algorithms, which control only the type I error rate for each individual test, MRPC implements the LOND (Level On the Number of Discoveries) method (Javanmard and Montanari, 2015), which is a sequential hypothesis testing procedure and sets value of alpha for each test based on the number of discoveries (i.e., rejections), to control the overall false discovery rate.

Value

An object containing an estimate of the skeleton of the underlying DAG as follow:

call:

A call object: the original function call.

n:

The sample size used to estimate the graph.

max.ord:

The maximum size of the conditioning set used in the conditional independence tests of the first part of the algorithm.

n.edgetests:

The number of conditional independence tests performed by the first part of the algorithm.

sepset:

Separation sets.

pMax:

A square matrix , where the (i, j)th entry contains the maximum p-value of all conditional independence tests for edge i–j.

graph:

Object of class "graph": The undirected or partially directed graph that was estimated.

zMin:

Deprecated.

test:

The number of tests that have been performed.

alpha:

The level of significance for the current test.

R:

All of the decisions made from tests that have been performed. A 1 indicates a rejected null hypothesis and 0 represents a null hypothesis that was not rejected.

K:

The total number of rejections.

pval:

A vector of p-values calculated for each test.

normalizer:

The value that ensures the vector gammai sums to one.

exponent:

The exponent of the p-series used to calculate each value of the gammai vector.

alphai:

A vector containing the alpha value calculated for each test.

kappai:

A vector containing the iteration at which each rejected test occurs.

kappai_star:

Each element of this vector is the sum of the Si vector up to the iteration at which each rejection occurs.

Ci:

A vector indicating whether or not a p-value is a candidate for being rejected.

Si:

A vector indicating whether or not a p-value was discarded.

Ci_plus:

Each element of this vector represents the number of times each kappai value was counted when calculating each alphai value.

gammai:

The elements of this vector are the values of the p-series 0.4374901658/(m^(1.6)), where m is the iteration at which each test is performed.

Author(s)

Md Bahadur Badsha (mbbadshar@gmail.com)

References

1. Badsha MB and Fu AQ (2019). Learning causal biological networks with the principle of Mendelian randomization. Frontiers in Genetics, 10:460.

2. Badsha MB, Martin EA and Fu AQ (2021). MRPC: An R package for inference of causal graphs. Frontiers in Genetics, 10:651812.

3. Javanmard A and Montanari A (2015). On Online Control of False Discovery Rate. arXiv:150206197 [statME].

4. Kalisch M, Machler M, Colombo D, Maathuis MH and Buhlmann P (2012). Causal Inference Using Graphical Models with the R Package pcalg. Journal of Statistical Software, 47, 26.

5. Tian J and Ramdas A (2019). ADDIS: an adaptive discarding algorithm for online FDR control with conservative nulls. In Advances in Neural Information Processing Systems (pp. 9388-9396).

See Also

MRPC; EdgeOrientation; SimulateData.

Examples

## Not run: 
# Model 1 (mediation)

# The 1st column of the data matrix is a genetic variant
# and the remaining columns are gene expression nodes.
data <- simu_data_M1 # load data for model 1
n <- nrow(data)      # Number of row
V <- colnames(data)  # Column names

# Calculate Pearson correlation
suffStat_C <- list(C = cor(data),
                   n = n)

# Infer a graph skeleton
Skel.fit <- ModiSkeleton(data, 
                         suffStat = suffStat_C,
                         FDR = 0.05, 
                         indepTest = 'gaussCItest',
                         labels = V, 
                         FDRcontrol = 'LOND', 
                         verbose = FALSE)

# Plot the results

plot(Skel.fit@graph,
     main ="Estimated Skeleton")

# Other models are available and may be called as follows:
# Model 0
# data <- simu_data_M0

# Model 2
# data <- simu_data_M2

# Model 3
# data <- simu_data_M3

# Model 4
# data <- simu_data_M4

# Model Multiparent
# data <- simu_data_multiparent

# Model Star
# data <- simu_data_starshaped

# Model Layered
# data <- simu_data_layered


## End(Not run)

audreyqyfu/mrpc2 documentation built on April 17, 2022, 7:36 a.m.