knitr::opts_chunk$set( collapse = TRUE, eval = all(c( require("tidySEM") )), comment = "#>" )
Directed acyclic graphs (DAGs) are a powerful tool for expressing and testing causal assumptions.
They allow researchers to identify potential confounders or colliders,
and guide decisions about which variables to control for (or not) in statistical analyses [@cinelliCrashCourseGood2022].
DAGs can be implemented as FAIR theories, or can be derived from FAIR theories.
In this vignette, we'll illustrate how to use DAGs for causal inference in R
,
inspired by the Tripartite Model of the impact of the family on children’s emotion regulation and adjustment [@morrisRoleFamilyContext2007].
By the end of this tutorial, you will be able to:
```{asis echo = knitr::is_html_output()}
This video lecture gives an introduction to causal inference, with a focus on DAGs comprising three variables, including the distinction between confounders and colliders.
## Install and Load Required Packages We'll use the following packages: - `theorytools` for functions pertaining to FAIR theory and a sample dataset - `dagitty` for interacting with Directed Acyclic Graphs (DAGs) - `tidySEM` for visualizing DAGs Only run this code if you haven't already installed these packages: ```r install.packages("theorytools") install.packages("dagitty") install.packages("tidySEM")
library(theorytools) library(dagitty) library(tidySEM) library(ggplot2)
The tripartite model identifies three major familial influences on children's emotion regulation (ER):
These three factors, together with parent characteristics (PC) and child characteristics (CC), shape the child's emotion regulation (ER), which in turn influences the child's adjustment (A) (e.g., internalizing/externalizing problems, social competence).
The model is visually represented below:
lo <- get_layout( "", "O", "", "", "", "", "PP", "", "ER", "A", "", "EC", "", "", "", "PC", "", "CC", "", "", rows = 4 ) edg <- data.frame( from = c("EC", "EC", "EC", "ER", "O", "PC", "PC", "PC", "PC", "PP", "PP", "PP", "O"), to = c("A", "ER", "O", "A", "ER", "CC", "EC", "O", "PP", "EC", "ER", "O", "A"), curvature = c(NA, NA, 60, NA, NA, NA, NA, NA, NA, -60, NA, 60, NA)) p <- prepare_graph(edges = edg, layout = lo) p$edges$connect_from = c("right", "right", "left", "right", "right", "right", "top", "top", "top", "left", "right", "left", "right") p$edges$connect_to = c("left", "left", "left", "left", "left", "left", "left", "left", "left", "left", "left", "left", "left") p$edges$arrow <- "both" p$edges$arrow[which(p$edges$from == "PC")] <- "last" g <- plot(p) + geom_segment(aes( x = p$nodes$x[p$nodes$name == "CC"], y = p$nodes$node_ymax[p$nodes$name == "CC"]+.2, yend = (p$nodes$node_ymax[p$nodes$name == "CC"]+2), linewidth = 4), arrow = arrow(length = unit(0.03, "npc"), ends = "both")) ggsave("morris.png", g, device = "png", width = 6, height = 4) knitr::include_graphics("morris.png")
This figure, loosely based on the visual representation of the theory in the paper by @morrisRoleFamilyContext2007 (see their Figure 1, p.362). is relatively well formalized compared to the standard for developmental psychological theories. Before we can FAIRify this theory, however, we have to clarify a few things, and make some decisions.
First, notice that the model does not explicitly follow any graphing convention. It resembles a path diagram, as used to visualize a structural equation model - but notice that there is an unconnected bold double-headed arrow near CC (Child Characteristics). The written description of the text clarifies the meaning of the bold arrow: the authors argue that Child Characteristics "moderate relations between family context [and] ER" (Morris et al., 2007, p. 364).
Second, note that most arrows are bidirectional. This would be fine if we wished to specify the model as a structural equation model where associations can be undirected. However, since our goal is to perform causal inference, we need a directed acyclic graph. This means we have to assume a direction of causality for all bidirectional arrows. Reading the text helps us direct some arrows: we read that "children learn about ER through [Observation]", "parenting practices affect ER", and "ER is affected by the emotional climate". This is all causal language [@norouziCapturingCausalClaims2024]. Even though we later read that "ER and familial influences are bidirectional processes in our model", and even though some recent publications have argued that child effects on parents outweigh parents' effects on children in emotional development [@vanlissaMothersFathersQuantitative2020], for the purpose of this tutorial, we will assume only parent effects on children. It is fine to make strong assumptions when specifying theory; if they are inconsistent with the data, we can revise them later.
Third, note that it is not clear if there is a path from Parenting Practices to Adjustment. While there are paths from Observation and Emotional Climate to Adjustment, we read in the text: "although there are direct effects of the family context on children’s adjustment [...] a mediational model is proposed". Elsewhere in the text, "family context" is defined in terms of all three predictors (O, PP, and EC). We can thus either include all three direct effects, or omit them and assume full mediation. Since the latter option results in a much simpler DAG, we opt for full mediation.
We can implement the model as a DAG using the dagitty
package.
We start with an "ontology", or simply, by stating which constructs exist in our DAG:
tripartite <- dagitty('dag { O PP EC PC CC ER A
Next, we add the directed edges that were present in the original theory:
PC -> CC PC -> EC PC -> PP PC -> O
Next, we direct the edges that run from Observation, Parenting Practices, and Emotional Climate to Emotion Regulation. We omit direct effects to Adjustment, and include an effect from ER to Adjustment.
O -> ER PP -> ER EC -> ER ER -> A
Next, we incorporate the effects of Child Characteristics. These are moderation effects. In a DAG, the interaction effect of two predictors on an outcome is simply represented by specifying both predictors as common causes of the outcome. When specifying a model, however, we may have to explicitly include interaction effects. We can annotate the graph in such a way that it is clear which variables are involved in interactions with O, PP, and EC, like this:
CC -> ER [form="CC:O+CC:PP+CC:EC"];
Note that we use R's formula syntax, which you may already know from linear regression models.
Here, we specified three interaction terms, which are specified using the interaction operator :
and which are connected by the +
operator, which means that the interaction effects form a weighted sum.
Finally, we have to make a decision about the interrelations between the predictors of emotion regulation.
We could omit them, or delve further into the literature to direct them.
For now, I made the following assumptions:
PP -> O EC -> O PP -> EC }')
Note that the final line closes our DAG specification, which is stored in a variable called tripartite
.
Now, we have a complete DAG!
tripartite <- dagitty('dag { O PP EC PC CC ER A PC -> CC PC -> EC PC -> PP PC -> O O -> ER [form="CC:O"]; PP -> ER [form="CC:PP"]; EC -> ER [form="CC:EC"]; ER -> A CC -> ER [form="CC:O+CC:PP+CC:EC"]; PP -> O EC -> O PP -> EC }')
theorytools:::quizz( "A DAG can have arrows that run left to right, arrows that run right to left, and bidirectional arrows." = FALSE, "An interaction effect of X1 and X2 on Y is represented in a DAG by drawing a directed arrow from X1 to Y, and from X2 to Y." = TRUE, "Information about functional form, like `[form=\"CC:O+CC:PP+CC:EC\"]`, is part of the DAG." = FALSE)
A DAG is not yet a FAIR theory. We can FAIRify the DAG we specified above by following the steps outlined in the Making a Theory FAIR vignette. We briefly go over the steps here. First, let's save the DAG to a text file:
writeLines(tripartite, "tripartite_model.txt")
Your GitHub integration must be set up for the next step to work. You can check if everything is set up correctly by running:
worcs::check_git() worcs::check_github()
Next, we can create a FAIR theory repository:
create_fair_theory( path = file.path("c:/theories", "tripartite_model"), title = "Tripartite Model", theory_file = "tripartite_model.txt", remote_repo = "tripartite_model", add_license = "cc0")
Update the README.md
file as necessary.
Then, go to Zenodo, and switch on archiving for this GitHub repository.
Once this is done, run:
worcs::git_release_publish(repo = file.path("c:/theories", "tripartite_model"))
Finally, you can edit the metadata for the archived version on Zenodo.
The following DOI shows my version of the FAIR Tripartite Model: https://doi.org/10.5281/zenodo.14921521
There are several ways in which we can access an existing FAIR theory. If our goal is simply to use it in our analysis workflow (as is the case in this tutorial), then we can access the static archived version on Zenodo by running:
download_theory( id = "https://doi.org/10.5281/zenodo.14921521", path = "c:/theories/tripartite_downloaded")
If we want to contribute to further theory development,
we might instead prefer to create a local clone of the Git repository.
Note that it is typically a good idea to first fork the original GitHub repository to your own GitHub account,
and then clone your fork, instead of the original author's version.
Regardless, the function download_theory()
takes a Git remote address too, in which case it clones the theory:
download_theory( id = "https://github.com/cjvanlissa/tripartite_model.git", path = "c:/theories/tripartite_clone")
The difference between these two approaches is that the former only copies the statically archived files to your device, whereas the latter copies the entire Git repository with its history of commits and all branches, and continues to version control changes you make.
For this tutorial, please download my version of the Tripartite Model using download_zenodo()
.
Interoperability pertains to the ability to use a theory in scientific workflows. X-interoperability refers to the ability to use a theory for specific operations in scientific workflows. When we go through the remainder of the tutorial, we are demonstrating that our FAIR tripartite model is X-interoperable for visualization in R, for selecting control variables, for performing causal inference, et cetera. That is to say: we can directly ingest the FAIR theory in R, and interact with it in our analysis environment and use it to construct fully reproducible analysis workflows, including for causal inference.
First, let's ingest the theory into our R environment:
tripartite <- dagitty(paste(readLines("c:/theories/tripartite_downloaded/tripartite_model.txt"), collapse = "\n"))
Then, we can plot the model using the tidySEM
package:
graph_sem(tripartite)
We can optionally specify a layout for the graph, so that it resembles the model as visualized by Morris and colleagues:
lo <- get_layout( "", "O", "", "", "", "", "PP", "", "ER", "A", "", "EC", "", "", "", "PC", "", "CC", "", "", rows = 4 ) graph_sem(tripartite, layout = lo)
The tidySEM vignette on plotting structural equation models further explains how to customize this figure, which is a ggplot
object.
theorytools:::quizz( "Which of the following describes the property of X-interoperability of a FAIR theory?" = c(answer = "It can be reused for a specific operation.", "It can be reused by any person.", "It can be reused for any operation.", "It is licenced to be reused."), "If you are contributing to theory development, which platform provides infrastructure to coordinate collaborate with known others and strangers?" = c(answer = "GitHub", "Zenodo", "RStudio"))
Simulation studies allow us to explore the implications of model assumptions, to plan our analyses before data collection, to conduct power analysis and plan our sample size, and to preregister a fully reproducible analysis pipeline [Preregistration-As-Code, @peikertReproducibleResearchTutorial2021; @vanlissaComplementingPreregisteredConfirmatory2022].
Below is a simple code snippet to generate synthetic data using the theorytools::simulate_data()
function.
This function interprets the DAG as a structural equation model,
samples random values for the path coefficients (unless specific values are provided),
and assumes normal residual variances.
Note that many other functions for simulating data exist,
and some may be better suited to particular use cases.
set.seed(1) df_sim <- simulate_data(tripartite, n = 497) head(df_sim)
This synthetic dataset is consistent with the structure encoded in our Tripartite Model, though the parameter values are arbitrary. In real research, you might use prior studies or expert knowledge to set more realistic parameter values. For this tutorial, we select some ad-hoc values, and assign zero (0), small (.2), or medium (.4) effect sizes to the paths in our DAG:
tripartite_coef <- dagitty('dag { O PP EC PC CC ER A PC -> CC [beta=.4] PC -> EC [beta=.2] PC -> PP [beta=.2] PC -> O [beta=0] O -> ER [beta=.2] PP -> ER [beta=0] EC -> ER [beta=.2] ER -> A [beta=.4] CC -> ER [beta=.4]; PP -> O [beta=0] EC -> O [beta=0] PP -> EC [beta=.2] }') set.seed(51) df_sim <- simulateSEM(tripartite_coef, N = 497)
One advantage of a DAG is the ability to identify which variables should be controlled for (e.g., to avoid confounding) and which variables should not be controlled for (e.g., colliders). In dagitty
, you can use the function adjustmentSets()
to find minimal sufficient adjustment sets for estimating specific causal effects.
For instance, let’s say we want to examine the causal effect of Observation on Emotion Regulation. We can run:
adjustmentSets(tripartite, exposure="O", outcome="ER")
The DAG-based algorithm identifies which sets of variables are sufficient to block backdoor paths, given some strong assumptions [@pearlCausalDiagramsEmpirical1995]. The result suggests that it is enough to control for Child Characteristics, Emotional Climate, and Parenting Practices if we wish to obtain an unbiased estimate of the effect of Observation on Emotion Regulation. Alternatively, we can control for Parent Characteristics, Emotional Climate, and Parenting Practices.
theorytools:::quizz( "When estimating the effect of Observation on Emotion Regulation, It is fine to control for both Child Characteristics and Parent Characteristics." = FALSE, "What is the smallest possible simple adjustment set for the effect of Child Characteristics on Adjustment?" = c(answer = "PC", "EC, O, PP", "EC, PC, PP"))
Once we know which variables to control for, we can use standard regression to estimate the causal effect.
Using regression assumes that all causal effects are linear and additive, with normally distributed residuals and predictors without measurement error.
Other methods exist which do not make these assumptions.
We can use the function select_controls()
to construct a data.frame
with our exposure, outcome, and the relevant control variables.
This facilitates conducting causal inference with the appropriate control variables,
as you can just use the model formula outcome ~ exposure
to obtain the uncontrolled effect, and outcome ~ .
to obtain the causal estimate.
df_controls <- select_controls(tripartite, df_sim, exposure = "O", outcome = "ER") model_bivariate <- lm(ER ~ O, df_controls) model_causal <- lm(ER ~., df_controls) summary(model_bivariate) summary(model_causal)
theorytools:::quizz( "What is the causal effect of Observation on Emotion Regulation?" = c(0.14, .01), "There is a significant causal effect of Observation on Emotion Regulation." = TRUE, "If the DAG is correct, then `model_bivariate` gives us an unbiased estimate of the effect of Observation on Emotion Regulation." = FALSE)
When working with real data, causal inference quickly becomes more complicated.
Therefore, we provide a "real data" example here, to practice the relevant skills.
The theorytools
package includes a dataset that takes inspiration from "Growing Up in Australia - the Longitudinal Study of Australian Children" [LSAC, @australianinstituteoffamilystudiesGrowingAustralia2020].
As the LSAC dataset is accessible by explicit permission only,
this is a synthetic dataset with similar properties to the real data.
Let's access the data:
head(lsac)
Using real data, we need to find operationalizations of the theoretical constructs in our DAG, that is, we need to map the theoretical constructs onto measured variables. If we inspect the documentation of the data, we can conclude that the most likely mapping of constructs to variables is:
operationalizations <- c(PP = "warmth", EC = "relationship_quality", CC = "temperament_negreact", ER = "emotion_regulation", A = "social_functioning", PC = "coping")
Let's rename the variables, so the data and the DAG are consistent. We will also perform a rudimentary imputation, as missing data can cause problems later on. Note that there is abundant literature on best practices in handling missing data; here, we use single imputation for pragmatic reasons.
# Impute missing data df_real <- VIM::kNN(lsac, numFun = median) names(df_real) <- names(operationalizations)[match(operationalizations, names(df_real))]
# saveRDS(df_real, "df_real.RData") df_real <- readRDS("df_real.RData")
Note that one variable is missing: Observation (O). This is unfortunate, as for the previous example we used O as our exposure variable. For the remainder of the examples, we will use Parenting Practices as our exposure variable. Obtain the adjustment set for the effect of Parenting Practices on Emotion Regulation:
adjustmentSets(tripartite, exposure = "EC", outcome = "ER")
Any DAG implies conditional independencies: variables that should be statistically independent from one another, after controlling for a specific adjustment set. As a kind of assumption check, we could test if the variables in our dataset indeed show these statistical independencies. If our data show dependencies where the DAG implies independence, this can be taken as evidence against the veracity of our DAG. Of course, there are alternative explanations: how we operationalized the construct, the presence of measurement error, sampling bias, et cetera.
The function localTests
applies the d-separation criterion to determine all conditional independencies implied by a DAG,
and then performs tests for each of them.
Different types of tests are available for different types of variables.
For continuous variables (which we have),
the "cis.loess"
method provides non-parametric conditional independence tests using bootstrapped loess regression.
Let's first conduct conditional independence tests for the dataset that we simulated from the DAG. Because we're conducting a lot of significance tests, we can control the overall probability of making a Type I error (i.e., drawing false positive conclusions about discrepancies between DAG and data). We can use Bonferroni-corrected bootstrapped confidence intervals for testing, as demonstrated in the following code block:
# Get all DAG-implied conditional independencies cis <- impliedConditionalIndependencies(tripartite) # Bonferroni-corrected confidence interval bonferroni <- 1-(.05/length(cis)) # Conduct the tests ci_tests <- localTests(tripartite, df_sim, type = "cis.loess", R = 1000, tests = cis, conf.level = bonferroni) # Print result, with added significance asterisks add_significance(ci_tests)
Now, let's perform the same tests for the real data. Because we have a missing variable, we cannot test the complete set. We can filter all conditional independencies that cannot be tested. Don't forget to use a different Bonferroni correction for the resulting (smaller number of) tests!
cis_real <- filter_conditional_independencies(cis, df_real) bonferroni <- 1-(.05/length(cis_real)) # Conduct the tests ci_tests <- localTests(tripartite, df_real, type = "cis.loess", R = 1000, tests = cis_real, conf.level = bonferroni) # Print result, with added significance asterisks add_significance(ci_tests)
theorytools:::quizz( "We can use Parenting Practices because O is not in its adjustment set." = TRUE, "Which of these causal effects can we estimate using these data?" = c(answer = "All of these ", "PP -> A", "ER -> PC", "CC -> A"), "The `ci_tests` give us reasons to doubt that `df_sim` is consistent with the DAG." = FALSE, "The `ci_tests` give us reasons to doubt that `df_real` is consistent with the DAG." = TRUE, "Consider the sample size of `df_real`. Could this be related to your answers in the previous two questions?" = c(answer = "Yes", "No") )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.