knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this vignette, we introduce a specification for augmented Directed Acyclic Graphs (aDAGs).
We will overload regular DAGs as specified using the dagitty
R-package.
While dagitty
provides standard facilities for declaring nodes, edges, exposures, and outcomes in causal frameworks,
we augment the DAG with additional metadata fields to make it more conducive to theory specification.
These metadata fields do not interfere with regular use of the DAG in dagitty
.
The metadata fields in an aDAG include:
tags
for identifying nodes of interest in causal inference. This field can take on values like exposure
, outcome
, and unobserved
. pos
for nodes, which defines the layout position in the X and Y dimension, e.g., pos="0,1
is positioned at coordinates X = 0 and Y = 1. This metadata field is used by dagitty
.label
for nodes or edges: A descriptive label used for visualization and reporting. This is a new metadata field.distribution
for nodes: The assumed distribution-generating function for the variable associated with a node. For exogenous nodes, this constitutes the distribution of the variable associated with the node itself; for endogenous nodes, this constitutes the residual distribution of the associated variable. This is a new metadata field.form
for edges: A function specification (in a form interpretable by as.formula()
) that describes how the variable associated with a child node is calculated from its parents. This is a new metadata field.Throughout the vignette, we will illustrate how to write an augmented DAG, how to parse and inspect it with dagitty
, theorytools
, and tidySEM
(for plotting), and how to leverage these additional properties for further modeling or simulation tasks.
library(theorytools) library(dagitty) library(tidySEM)
The usual syntax for specifying a DAG in the dagitty
R-package is something like:
library(dagitty) dagitty("dag { X -> Y Z -> X Z -> Y }")
There are several tags that can be used in dagitty
. Note that quotation marks used in tags must be double quotes "
, so it makes sense to wrap the whole DAG syntax in single quotes '
:
library(dagitty) dagitty('dag { X [exposure, pos="0,1"] Y [outcome, pos="1,1"] Z [unobserved, pos="1,0"] X -> Y Z -> X Z -> Y }')
In our augmented specification, we add additional properties as metadata fields.
Below, we detail each new property:
label
(Nodes/Edges)X [label="Study hours"]
The label is used, for example, by tidySEM
to label nodes and edges:
library(tidySEM) g <- dagitty('dag { X [label="Predictor", pos="0,0"] Y [label="Outcome", pos="1,0"] X -> Y [label="effect"] }') graph_sem(g, text_size = 2)
library(tidySEM) library(ggplot2) g <- dagitty('dag { X [label="Predictor", pos="0,0"] Y [label="Outcome", pos="1,0"] X -> Y [label="effect"] }') p <- graph_sem(g, text_size = 4) ggsave("dag_basic.png", p, device = "png", width = 4, height = 1) knitr::include_graphics("dag_basic.png")
distribution
(Nodes)Usage: References a function that generates data for exogenous variables, or that describes the residual distribution for endogenous variables.
The function can reference the argument n
to determine sample size.
For example, to specify a node comprising five groups with total sample size n
,
one could use sample.int(n = 5, size = n, replace=TRUE)
.
If the argument n
is not explicitly provided, theorytools
checks if n
is a formal argument of the function, and assigns it.
Examples:
X [distribution="rnorm()"]
: Node X
is an exogenous variable drawn from a normal distribution with default arguments.Y [distribution="rnorm()"]
: Node Y
has residuals assumed to be normally distributed with default arguments.g <- dagitty('dag { X [distribution="rbinom(size = 2, prob = .5)"] Y [distribution="rnorm()"] X -> Y [form=".2*X"] }')
form
(Edges)as.formula()
can parse.Examples:
X -> Y [form=".2*X"]
indicates that Y
is a linear function of .2
times X
X -> Y [form="X:Z"]
indicates that Y
depends on an interaction between X
and Z
X -> Y [form="X^2"]
indicates that Y
depends on a quadratic function of X
g <- dagitty('dag { X [distribution="rbinom(size = 2, prob = .5)"] Y [distribution="rnorm()"] X -> Y [form=".2*X"] }')
Below is a simple, hypothetical DAG showing how to combine these ideas. This DAG posits:
X
: Number of study hours, an exposure
. Values are randomly sampled from 1-20 hours.Z
: Stress level, an exogenous covariate, exponentially distributed (i.e., right-skewed, most people are not very stressed).Y
: Exam performance an outcome depending on X
and Z
, with normally distributed residuals.sg <- dagitty('dag { X [exposure, pos="0,0", label="Study Hours", distribution="sample.int(n = 20, size = n, replace = TRUE)"] Z [label="Stress Level", pos=".5,1", distribution="rexp()"] Y [outcome, pos="1,.2", label="Exam Performance", distribution="rnorm()"] X -> Y [label="direct", form="0.5+X"] X -> Z Z -> Y [label="indirect", form="2*Z"] }') graph_sem(g, text_size = 3)
g <- dagitty('dag { X [exposure, pos="0,0", label="Study Hours", distribution="sample.int(n = 20, size = n, replace = TRUE)"] Z [label="Stress Level", pos=".5,1", distribution="rexp()"] Y [outcome, pos="1,.2", label="Exam Performance", distribution="rnorm()"] X -> Y [label="direct", form="-X^2+4*X"] X -> Z Z -> Y [label="indirect", form="2*Z"] }') graph_sem(g, text_size = 4) -> p ggsave("dag_three.png", p, device = "png", width = 6, height = 3) knitr::include_graphics("dag_three.png")
Augmented DAGs are interoperable with dagitty
, but the dagitty
package is not natively aware of the additional metadata fields used in theorytools
, like distribution
or form
.
To access the augmented properties of aDAGs, the theorytools
package uses tidySEM
.
The purpose of the tidySEM
package is to plot graphs (structural equation models and DAGs) as ggplot
objects,
which can be further customized using regular ggplot2
code.
It contains parsing functions to extract nodes and edges from a variety of objects, including dagitty
graphs.
The functions get_nodes()
and get_edges()
parse the nodes and edges of aDAGs, respectively:
get_nodes(g) get_edges(g)
distribution
and form
in SimulationA primary motivation for these augmented properties is simulation. For example, you might simulate data by:
X
from sample.int(n)
.Z
from rexp(n)
.Y
using a formula that includes X
and Z
plus a residual from rnorm(n)
.Code to simulate data in line with these metadata can be automatically generated:
set.seed(1) cat(simulate_data(g, run = FALSE), sep = "\n")
To illustrate, we show a scatter plot of data simulated using this code:
df <- simulate_data(g, run = TRUE) ggplot2::ggplot(df, aes(x=X,y=Y,color=Z))+geom_point()
You can use this script, for example, to generate synthetic data and build a reproducible analysis pipeline for a Preregistration-As-Code [@peikertReproducibleResearchTutorial2021; @vanlissaComplementingPreregisteredConfirmatory2022].
dagitty
package only recognizes double quotes (" "
) inside graph specifications. This means you must wrap the graph specification text in single quotes (' '
). Alternatively, you can escape every double quote inside the graph specification, which is not recommended because it is a hassle.form
properties or a single edge with a combined formula. They are combined, and unique terms are retained.dagitty
does not mind the order in which nodes are declared, but you’ll need a topological order (no cycles) for valid DAG generation and simulation. dagitty
Functions: The standard dagitty
functions (e.g., adjustmentSets()
) only look for recognized tags like exposure
and outcome
. They ignore custom properties like distribution
and form
, but these do not interfere with normal usage.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.