knitr::opts_chunkset( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, fig.dpi = 200, fig.align = "center", warning = FALSE, message = FALSE ) options(digits = 3)  What is the package designed for? The three main problems addressed by this R package are: 1. Selecting the most probable structure based on a cache of pre-computed scores. 2. Controlling for overfitting. 3. Sampling the landscape of high scoring structures. The latter could be beneficial in an applied perspective to avoid reducing the richness of Bayesian network modeling to only one structure. Indeed, it allows users to quantify the marginal impact of relationships (arcs in a network) of interest by marginalizing out over structures or nuisance dependencies (i.e., all other possible relationships). Structural Monte Carlo Markov Chain (MCMC) seems an elegant and natural way to estimate the true marginal impact, so one can determine if it's magnitude is big enough to consider as a worthwhile intervention. Note: in this vignette structural MCMC means that the MCMC moves are performed at structure level (i.e., Bayesian network or Directed Acyclic Graph (DAG)) and not at node ordering level for example. The advantage is that explicit priors can be formulated. A DAG is a graph used to encode the joint probability of a set of variables. It is made of nodes (random variables) and arrows or arcs (directed links representing the relationships among variables). How does it work? The mcmcabn R package takes a cache of pre-computed network scores as input (possibly computed using buildScoreCache() function from abn R package). Then, the mcmcabn() function generates MCMC samples from the posterior distribution of the most supported DAGs. To do so the user should choose a structural prior and a learning scheme. mcmcabn is equipped with multiple structural priors: the so-called Koivisto prior (an uninformative prior), the Ellis prior (a modular flat prior) and a possibly user define prior and three algorithms for inferring the structure of Bayesian networks: the classical Monte Carlo Markov Chain Model Choice ((MC)^3$), the new edge reversal (REV) from Grzegorczyk and Husmeier (2008) and the Markov blanket resampling (MBR) from Su and Borsuk (2016). What are the R package functionalities? The workhorse of the R package is the mcmcabn() function. It generates MCMC samples from the posterior distribution of the DAGs. This object of class mcmcabn can be summarized with a comprehensive verbal description or a plot. From those samples, one can either compute the most probable structure and plot the cumulative maximum score achieved or analyze with the query() R function that recognizes the formula statement. For example, what is the probability of having an arc between two nodes, one node being in the Markov blanket of another, or what is the probability to not have an arc between two nodes? Alternatively, one can query the frequency of an arc in one direction relative to the opposite direction. Alternative R packages BiDAG is an MCMC sampler that combined partition MCMC with constrained based methods. This R package is efficient to sample large DAGs. What is the structure of this document? We first illustrate the package with a simple example. Then some more technical details are provided. # Simple mcmcabn example The package is available on CRAN and can be installed directly in R. However it needs two R packages: abn and gRbase that requires libraries not stored on CRAN but on bioconductor. Hence those two packages must be installed before installing mcmcabn: if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("RBGL", "Rgraphviz", "graph"), version = "3.8") install.packages("mcmcabn")  Once installed, the mcmcabn R package can be loaded using: library(mcmcabn)  Let us start with an example from the bnlearn R package from Scutari (2010). It is about a small synthetic dataset from Lauritzen and Spiegelhalter (1988) about lung diseases (tuberculosis, lung cancer, or bronchitis) and visits to Asia (8 nodes and 8 arcs). One needs to pre-compute a cache of scores. We use the R package abn to do it. But first, let us define the list of distribution for the nodes and plot the network. data(asia, package='bnlearn') #for the dataset library(abn) #to pre-compute the scores library(ggplot2) #plotting library(ggpubr) #plotting library(cowplot) #plotting library(ggdag) #to plot the DAG #renaming columns of the dataset colnames(asia) <- c("Asia", "Smoking", "Tuberculosis", "LungCancer", "Bronchitis", "Either", "XRay", "Dyspnea") #lets define the distribution list dist.asia <- list(Asia = "binomial", Smoking = "binomial", Tuberculosis = "binomial", LungCancer = "binomial", Bronchitis = "binomial", Either = "binomial", XRay = "binomial", Dyspnea = "binomial") #plot the BN as described in the paper dag <- dagify(Tuberculosis~Asia, Either~Tuberculosis, XRay~Either, Dyspnea~Either, Bronchitis~Smoking, LungCancer~Smoking, Either~LungCancer, Dyspnea~Bronchitis) ggdag_classic(dag, size = 6) + theme_dag_blank()  For the first search, it is advised to limit the number of parents per node to 2, which is the maximum number of parents of the network. #loglikelihood scores bsc.compute.asia <- buildScoreCache(data.df = asia, data.dists = dist.asia, max.parents = 2)  We use the mcmcabn() function on the cache of pre-computed networks scores. One needs to define the type of score used (here is the marginal likelihood mlik other choice are: bic, aic or mdl). The maximum number of parents per node (same as the one used in buildScoreCache()). The MCMC learning scheme: number of MCMC samples, number of thinned sampled (to avoid autocorrelation), and the length of the burn-in phase. Here we choose: 1000 returned MCMC steps and 99 thinned steps in between. It means that for each returned MCMC move, 99 have been thinned. The burn-in phase is set to 10000 steps. We decide to initiate the algorithm with a random DAG start.dag = "random", which means that the function randomly selects a valid DAG. In this context, valid means no cycle and an appropriate maximum number of parents. The frequencies of selecting a REV or the MBR jumps are set to 3%. It is usually a good idea to keep those frequencies low. Indeed, REV and MBR are efficient in producing high scoring structure but computationally costly compared to classical MCMC jumps. We select the Koivisto prior prior.choice = 2. mcmc.out.asia <- mcmcabn(score.cache = bsc.compute.asia, score = "mlik", data.dists = dist.asia, max.parents = 2, mcmc.scheme = c(1000,99,10000), seed = 42, verbose = FALSE, start.dag = "random", prob.rev = 0.03, prob.mbr = 0.03, prior.choice = 2)  The function mcmcabn() returns a list with multiple entries: • dags is the list of sampled DAGs. • scores is the list of DAGs scores. • alpha is the acceptance probability of the Metropolis-Hasting sampler. • method is the algorithm chosen for the index MCMC step. • rejection if an MCMC jumps is rejected or not. • iterations is the total number of MCMC iterations. • thinning is the number of thinned steps for one returned MCMC jump • burnin is the length of the burn-in phase. • data.dist is the list giving the distribution for each node in the network. The object mcmc.out is of class mcmcabn. We can check that we reach the maximum possible posterior DAG score even with a small number of MCMC steps. To do so we use an exact search from Koivisto 2004 implemented in mostprobable() from abn. ##to speed up building of the vignette, we store the output data("mcmc_run_asia")  #maximum score get from the MCMC sampler max(mcmc.out.asia$scores)

#maximum scoring network using exact search (not MCMC based)
dag <- mostprobable(score.cache = bsc.compute.asia)

# References

• Kratzer, Gilles, et al. (2020). "Bayesian Networks modeling applied to Feline Calicivirus infection among cats in Switzerland." Frontiers in Veterinary Science 7: 73.
• Madigan, D. and York, J. (1995) "Bayesian graphical models for discrete data". International Statistical Review, 63:215–232.
• Giudici, P. and Castelo, R. (2003). "Improving Markov Chain Monte Carlo model search for data mining". Machine Learning, 50:127–158.
• Kratzer, G. and Furrer, R. (2019) "Is a single unique Bayesian network enough to accurately represent your data?". arXiv preprint arXiv:1902.06641.
• Friedman, N. and Koller, D. (2003). "Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks." Machine Learning, 50:95–125, 2003.
• Grzegorczyk, M. and Husmeier, D. "Improving the structure MCMC sampler for Bayesian networks by introducing a new edge reversal move", Machine Learning, vol. 71(2-3), pp. 265, 2008.
• Su, C. and Borsuk, M. E. "Improving structure MCMC for Bayesian networks through Markov blanket resampling", The Journal of Machine Learning Research, vol. 17(1), pp. 4042-4061, 2016.
• Koivisto, M. V. (2004). Exact Structure Discovery in Bayesian Networks, Journal of Machine Learning Research, vol 5, 549-573.
• Werhli, A. V., and Husmeier, D. (2007). "Reconstructing gene regulatory networks with Bayesian networks by combining expression data with multiple sources of prior knowledge". Statistical Applications in Genetics and Molecular Biology, 6 (Article 15).
• Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., and Miyano, S. (2003). Using Bayesian networks for estimating gene networks from microarrays and biological knowledge. In Proceedings of the European Conference on Computational Biology.
• Goudie, R. J., and Mukherjee, S. (2016). A Gibbs Sampler for Learning DAGs. Journal of machine learning research: JMLR, 17(30), 1-39.
• Kuipers, J. and Moffa, G. (2017). Partition MCMC for Inference on Acyclic Digraphs, Journal of the American Statistical Association, 112:517, 282-299, DOI: 10.1080/01621459.2015.1133426
• Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3), 1 - 22. \doi{http://dx.doi.org/10.18637/jss.v035.i03.}

## Try the mcmcabn package in your browser

Any scripts or data that you put into this service are public.

mcmcabn documentation built on June 2, 2021, 9:08 a.m.