MCMC scheme for posterior inference of Gaussian DAG models: the `learn_DAG()` function
In BCDAG: Bayesian Structure and Causal Learning of Gaussian Directed Graphs

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
oldpar <- par(no.readonly = TRUE)
oldoptions <- options()

```{css, echo = FALSE} .math.inline { font-size: 11px; }

```r
library(BCDAG)

This is the second of a series of three vignettes for the R package BCDAG. In this vignette we focus on function learn_DAG(), which implements a Markov Chain Monte Carlo (MCMC) algorithm to sample from the joint posterior of DAG structures and DAG-parameters under the Gaussian assumption.

Model description

The underlying Bayesian Gaussian DAG-model can be summarized as follows: \begin{eqnarray} X_1, \dots, X_q \,|\,\boldsymbol L, \boldsymbol D, \mathcal{D} &\sim& \mathcal{N}q\left(\boldsymbol 0, (\boldsymbol{L}\boldsymbol{D}^{-1}\boldsymbol{L}^\top)^{-1}\right)\ (\boldsymbol L, \boldsymbol D)\,|\,\mathcal{D} &\sim& \text{DAG-Wishart}(\boldsymbol{a}{c}^{\mathcal{D}}, \boldsymbol U) \ p(\mathcal{D}) &\propto& w^{|\mathcal{S}\mathcal{D}|}(1-w)^{\frac{q(q-1)}{2} - {|\mathcal{S}\mathcal{D}|}} \end{eqnarray}

In particular $\mathcal{D}=(V,E)$ denotes a DAG structure with set of nodes $V={1,\dots,q}$ and set of edges $E\subseteq V \times V$. Moreover, $(\boldsymbol L, \boldsymbol D)$ are model parameters providing the decomposition of the precision (inverse-covariance) matrix $\boldsymbol{\Omega} = \boldsymbol{L}\boldsymbol{D}^{-1}\boldsymbol{L}^\top$; specifically, $\boldsymbol{L}$ is a $(q, q)$ matrix of coefficients such that for each $(u, v)$-element $\boldsymbol{L}{uv}$ with $u \ne v$, $\boldsymbol{L}{uv} \ne 0$ if and only if $(u, v) \in E$, while $\boldsymbol{L}{uu} = 1$ for each $u = 1,\dots, q$; also, $\boldsymbol{D}$ is a $(q, q)$ diagonal matrix with $(u, u)$-element $\boldsymbol{D}{uu}$.

The latter decomposition follows from the equivalent Structural Equation Model (SEM) representation of a Gaussian DAG-model; see also Castelletti \& Mascaro (2021).

Conditionally to $\mathcal{D}$, a prior to $(\boldsymbol{L}, \boldsymbol{D})$ is assigned through a compatible DAG-Wishart distribution with rate hyperparameter $\boldsymbol{U}$, a $(q,q)$ s.p.d. matrix, and shape hyperparameter $\boldsymbol{a}^{\mathcal {D}}_{c}$, a $(q,1)$ vector; see also Cao et al. (2019) and Peluso \& Consonni (2020).

Finally, a prior on DAG $\mathcal{D}$ is assigned through a Binomial distribution on the number of edges in the graph; in $p(\mathcal{D})$, $w \in (0,1)$ is a prior probability of edge inclusion, while $|\mathcal{S_{\mathcal{D}}}|$ denotes the number of edges in $\mathcal{D}$; see again Castelletti \& Mascaro (2021) for further details.

Target of the MCMC scheme is therefore the joint posterior of $(\boldsymbol{L},\boldsymbol{D},\mathcal{D})$, \begin{equation} p(\boldsymbol L, \boldsymbol D, \mathcal{D}\,|\, \boldsymbol X) \propto p(\boldsymbol{X}\,|\,\boldsymbol L, \boldsymbol D, \mathcal{D})p(\boldsymbol{L},\boldsymbol{D}\,|\,\mathcal{D}) \,p(\mathcal{D}), \end{equation} where $\boldsymbol{X}$ denotes a $(n,q)$ data matrix as obtained through $n$ i.i.d. draws from the Gaussian DAG-model and $p(\boldsymbol{X}\,|\,\boldsymbol L, \boldsymbol D, \mathcal{D})$ is the likelihood function. See also Castelletti \& Mascaro (2022+) for full details.

Generating data

We first randomly generate a DAG $\mathcal{D}$, the DAG parameters $(\boldsymbol{L},\boldsymbol{D})$ and then $n=1000$ i.i.d. observations from a Gaussian DAG-model as follows:

set.seed(1)
q <- 8
w <- 0.2
DAG <- rDAG(q,w)
a <- q
U <- diag(1,q)
outDL <- rDAGWishart(n=1, DAG, a, U)
L <- outDL$L; D <- outDL$D
Omega <- L %*% solve(D) %*% t(L)
Sigma <- solve(Omega)
n <- 1000
X <- mvtnorm::rmvnorm(n = n, sigma = Sigma)

`learn_DAG()`

Function learn_DAG() implements an MCMC algorithm to sample from the joint posterior of DAGs and DAG parameters. This is based on a Partial Analytic Structure (PAS) algorithm (Godsill, 2012) which, at each iteration:

Updates the DAG through a Metropolis-Hastings (MH) step where, given the current DAG, a new (direct successor) DAG is drawn from a suitable proposal distribution and accepted with a probability given by the MH acceptance rate (see also section [A note on fast = TRUE]);
Samples from the posterior distribution of the (updated DAG) parameters; see also Castelletti \& Consonni (2021) for more details.

We implement it as follows:

out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = FALSE, collapse = FALSE)

out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = FALSE, collapse = FALSE)

Input

Inputs of learn_DAG() correspond to three different sets of arguments:

S, burn and data are standard inputs required by any MCMC algorithm. In particular, S defines the desired length of the chain, which is obtained by discarding the first burn observations (the total number of sampled observations is therefore S + burn); data is instead the $(n,q)$ matrix $\boldsymbol{X}$;
a, U and w are hyperparameters of the priors on DAGs (w) and DAG parameters (a, U); see also Equation [REF]. The same appear in functions rDAG() and rDAGWishart() which were introduced in our vignette [ADD REF TO THE VIGNETTE].
fast, save.memory and collapse are boolean arguments which allow to: implement an approximate proposal distribution within the MCMC scheme (fast = TRUE); change the array structure of the stored MCMC output into strings in order to save memory (save.memory = TRUE); implement an MCMC for DAG structure learning only, without sampling from the posterior of parameters (collapse = TRUE). See also [A note on fast = TRUE] and Castelletti \& Mascaro (2022+) for full details.

Output

The output of learn_DAG() is an object of class bcdag:

class(out)

bcdag objects include the output of the MCMC algorithm together with a collection of meta-data representing the input arguments of learn_DAG(); these are stored in the attributes of the object::

str(out)

Attribute type refers to the output of learn_DAG(), whose structure depends on the choice of the arguments save.memory and collapse.

\vspace{0.2cm}

When both are set equal to FALSE, as in the previous example, the output of learn_DAG() is a complete bcdag object, collecting three $(q,q,S)$ arrays with the DAG structures (in the form of $q \times q$ adjacency matrices) and the DAG parameters $\boldsymbol{L}$ and $\boldsymbol{D}$ (both as $q \times q$ matrices) sampled by the MCMC:

out$Graphs[,,1]
round(out$L[,,1],2)
round(out$D[,,1],2)

\vspace{0.2cm}

When collapse = TRUE and save.memory = FALSE the output of learn_DAG() is a collapsed bcdag object, consisting of a $(q,q,S)$ array with the adjacency matrices of the DAGs sampled by the MCMC:

collapsed_out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = FALSE, collapse = TRUE)

collapsed_out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = FALSE, collapse = TRUE)

names(collapsed_out)
class(collapsed_out)
attributes(collapsed_out)$type
collapsed_out$Graphs[,,1]

\vspace{0.2cm}

When save.memory = TRUE and collapse = FALSE, the output is a compressed bcdag object, collecting samples from the joint posterior on DAGs and DAG parameters in the form of a vector of strings:

compressed_out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = TRUE, collapse = FALSE)

compressed_out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = TRUE, collapse = FALSE)

names(compressed_out)
class(compressed_out)
attributes(compressed_out)$type

In such a case, we can access to the MCMC draws as:

compressed_out$Graphs[1]
compressed_out$L[1]
compressed_out$D[1]

In addition, we implement bd_decode, an internal function that can be used to visualize the previous objects as matrices:

BCDAG:::bd_decode(compressed_out$Graphs[1])
round(BCDAG:::bd_decode(compressed_out$L[1]),2)
round(BCDAG:::bd_decode(compressed_out$D[1]),2)

\vspace{0.2cm}

Finally, if save.memory = TRUE and collapse = TRUE, the output of learn_DAG() is a compressed and collapsed bcdag object collecting only the sampled DAGs represented as vector of strings:

comprcoll_out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = TRUE, collapse = TRUE)

comprcoll_out <- learn_DAG(S = 5000, burn = 1000, data = X,
                 a, U, w, 
                 fast = FALSE, save.memory = TRUE, collapse = TRUE)

names(comprcoll_out)
class(comprcoll_out)
attributes(comprcoll_out)$type
BCDAG:::bd_decode(comprcoll_out$Graphs[1])

A note on `fast = TRUE`

Step 1. of the MCMC scheme implemented by learn_DAG() updates DAG $\mathcal{D}$ by randomly drawing a new candidate DAG $\mathcal{D}'$ from a proposal distribution and then accepting it with probability given by the Metropolis Hastings (MH) acceptance rate; see also Castelletti \& Mascaro (2021). For a given DAG $\mathcal{D}$, the proposal distribution $q(\mathcal{D}'\,|\,\mathcal{D})$ is built over the set $\mathcal{O}{\mathcal{D}}$ of \emp{all} direct successors DAGs that can be reached from $\mathcal{D}$ by inserting, deleting or reversing a single edge in $\mathcal{D}$. A DAG $\mathcal{D}'$ is then proposed uniformly from the set $\mathcal{O}{\mathcal{D}}$ so that $q(\mathcal{D}'\,|\,\mathcal{D})=1/|\mathcal{O}{\mathcal{D}}|$. Moreover, the MH rate requires to evaluate the ratio of proposals $q(\mathcal{D}'\,|\,\mathcal{D})/q(\mathcal{D}\,|\,\mathcal{D}') = |\mathcal{O}{\mathcal{D}'}|/|\mathcal{O}{\mathcal{D}}|$, and accordingly the construction of both $\mathcal{O}{\mathcal{D}}$ and $\mathcal{O}_{\mathcal{D}'}$.

If fast = FALSE, the proposal ratio is computed exactly; this requires the enumerations of $\mathcal{O}\mathcal{D}$ and $\mathcal{O}{\mathcal{D}'}$ which may become computationally expensive, especially when $q$ is large. However, the ratio approaches $1$ as the number of variables $q$ increases: option fast = TRUE implements such an approximation, which therefore avoids the construction of $\mathcal{O}\mathcal{D}$ and $\mathcal{O}{\mathcal{D}'}$. A comparison between fast = FALSE and fast = TRUE in the execution of learn_DAG() produces the following results in terms of computational time:

# No approximation
time_nofast <- system.time(out_nofast <- learn_DAG(S = 5000, burn = 1000, data = X, 
                      a, U, w, 
                      fast = FALSE, save.memory = FALSE, collapse = FALSE))
# Approximation
time_fast <- system.time(out_fast <- learn_DAG(S = 5000, burn = 1000, data = X, 
                      a, U, w, 
                      fast = TRUE, save.memory = FALSE, collapse = FALSE))

time_nofast
time_fast

Finally, the corresponding estimated posterior probabilities of edge inclusion are the following:

round(get_edgeprobs(out_nofast), 2)
round(get_edgeprobs(out_fast), 2)

References

Ben-David E, Li T, Massam H, Rajaratnam B (2015). “High dimensional Bayesian inference for Gaussian directed acyclic graph models.” arXiv pre-print.
Cao X, Khare K, Ghosh M (2019). “Posterior graph selection and estimation consistency for high-dimensional Bayesian DAG models.” The Annals of Statistics, 47(1), 319–348.
Castelletti F, Consonni G (2021). “Bayesian causal inference in probit graphical models” Bayesian Analysis, In press.
Castelletti F, Mascaro A (2021). “Structural learning and estimation of joint causal effects among network-dependent variables.” Statistical Methods & Applications, 30, 1289–1314.
Castelletti F, Mascaro A (2022). “BCDAG: An R package for Bayesian structural and Causal learning of Gaussian DAGs.” arXiv pre-print.
Godsill, SJ (2012). "On the relationship between Markov chain Monte Carlo methods for model uncertainty." Journal of computational and graphical statistics, 10(2), 230-248.
Peluso S, Consonni G (2020). “Compatible priors for model selection of high-dimensional Gaussian DAGs.” Electronic Journal of Statistics, 14(2), 4110–4132.