knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette comes with the GADAG R-package. It can be used either to get some additional information about the model and methods or to get examples of how to use the functions.
The GADAG R-package aims at learning sparse large Directed Acyclic Graphs (DAGs) from observational data. The particularity of this algorithm is that it separates the exploration of the set of DAGs into two easier sub-searches: order of the nodes of the graph, modelled as a permutation, and graph structure, modelled as a lower triangular matrix. It is thus encoded as a combination of a convex program and a tailored genetic algorithm. Typically, the DAG we aim at recovering is a Gene Regulatory Network, which nodes are genes and edges represent regulations between genes. In this case, observational data are gene expression.
The package can be tested on the provided data toy_data
. This small dataset contains two variables: a "star" DAG $\mathcal{G}$ with 10 nodes, defined by its adjacency matrix $G$ (node 1 activates all other 9 nodes) and a design matrix $X$ with 100 observations of the distribution of the 10 nodes.
library(igraph) library(MASS) library(cvTools) library(lattice)
library(GADAG2) data(toy_data) # This is the design matrix dim(toy_data$X) # This is the adjacency matrix of the graph dim(toy_data$G)
Here are the main functions of GADAG
GADAG2_CV_results <- GADAG2_CV(X=toy_data$X)
GADAG2_results <- GADAG2_Run(X=toy_data$X, lambda=GADAG2_CV_results$lambda.1se)
GADAG2_results <- GADAG2_Analyze(GADAG2_results, G=toy_data$G, X=toy_data$X)
The inference algorithm implemented in the package is based on the minimization of the penalized negative log-likelihood of the model. To produce sparse graphs (with a limited number of edges), the optimization problem to solve is penalized using a $l_1$-penalty, controlled by a parameter lambda
, to be defined.
Here, the main issue consists in exploring the set of DAGs, which can be extremly large (exponentially increasing with the number of nodes), and which is not convex. To avoid its whole exploration, a DAG is modelled as a combination of a permutation, which controls the nodes ordering, and a lower triangular matrix, which controls the graph structure. GADAG then works as follows:
it looks for the best permutation using a genetic algorithm (outer loop)
it computes the best lower triangular matrix associated to each visited permutation using a gradient descent approach (inner loop)
The minimal inputs for running GADAG are:
a design matrix $X$, with samples $n$ in rows and variables $p$ in columns
a non-negative parameter of penalization lambda
, which control the number of edges of the inferred graph
# An example of how to infer a DAG GADAG2_results <- GADAG2_Run(X=toy_data$X, lambda=0.1) GADAG2_results$G.best # inferred DAG in a matrix form
Internal parameters can be specified to refine the algorithm, such that GADAG.control
and grad.control
, which mainly control the convergence of the genetic algorithm and the inner convex program. However, it makes the algorithm slower.
# An example of how to infer a DAG, with refinments n.gen <- 1e10 # we allow a very large number of iterations max.eval <- 1e10 # maximal number of nested evaluations GADAG2_results <- GADAG2_Run(X=toy_data$X,lambda=0.1, GADAG.control=list(n.gen=n.gen,max.eval=max.eval)) print(GADAG_results$G.best) # inferred DAG in a matrix form
For each potential solution, in the form of a permutation matrix and a lower triangular matrix, its fitness (penalized likelihood) can be computed and then compared to another solution. This function is highly internally used.
p=10 Perm <- sample(p) # permutation in a vector form P <- matrix(0,p,p) P[p*0:(p-1) + Perm] <- 1 # Perm is tranformed into a matrix form T <- matrix(rnorm(p),p,p) T[upper.tri(T,diag=TRUE)]<-0 Fitness <- fitness(P=P, X=toy_data$X, T=T, lambda=0.1) print(Fitness) # here is the fitness of the candidate solution (P,T)
Instead of setting the parameter of penalization lambda
to a predefined value, GADAG includes a function that runs k
-fold cross-validation to optimally tune it.
The usage of the fonction GADAG2_CV
is exactly the same as GADAG2_Run
. Given an optional sequence of lambdas
, data are split into k
groups. The k-1
first groups are iteratively used as training set for running GADAG2_Run
with each of the lambdas
, whereas the last group is used to test the performances.
The best lambda
, in the sense that it minimizes the mean squared error ($\pm$ 1\%) is returned.
# optimally tune the parameter of penalization GADAG2_CV_results <- GADAG2_CV(X=toy_data$X) print(GADAG2_CV_results$lambda.1se) # best lambda
In addition, the averaged cross validation error given sequence of lambdas can be plotted with the plot.CV
argument.
GADAG2_CV_results <- GADAG2_CV(X=toy_data$X,plot.CV=1) # plot the results
knitr::include_graphics("erreurCV.eps")
The goal of the GADAG R-package is to infer a DAG that explains how the nodes are connected to each other. If the true DAG is known, the GADAG2_Analyse
function offers the possibility to compute classical graph performances:
TP: number of true positives, inferred edges that are in the true DAG
FP: number of false positives, inferred edges by mistake
FN: number of false negatives, forgotten edges
TN: number of true negatives, not inferred edges that are not in the true DAG
precision = $\frac{TP}{TP+FP}$
recall = $\frac{TP}{TP+FN}$
# Run the main code GADAG2_results <- GADAG2_Run(X=toy_data$X, lambda=0.1) # Analyze the results GADAG2_analysis <- GADAG2_Analyze(GADAG2_results, G=toy_data$G, X=toy_data$X) print(GADAG2_analysis) # here are the results
In addition, graphs (estimated and true graph) can be plotted and the evolution of the algorithm can be represented using various graphical representations.
# Plot the graphs plot.graph <- 1 GADAG2_analysis <- GADAG2_Analyze(GADAG2_results, G=toy_data$G, X=toy_data$X, plot.control = list(plot.graph= plot.graph)) # run GADAG2 with return.level set to 1 beforehand GADAG2_results <- GADAG2_Run(X=toy_data$X, lambda=0.1,return.level=1) # Plot the evolution of the algorithm plot.evol <- 1 GADAG2_analysis <- GADAG2_Analyze(GADAG2_results, G=toy_data$G, X=toy_data$X, plot.control = list(plot.evol=1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.