In rje42/causl: Methods for Specifying, Simulating from and Fitting Causal Models

knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
options(digits=3)

Introduction

This document describes how to use the causl package to perform plasmode simulation; that is, producing datasets that combining real covariates (and possibly treatments) with synthetic outcomes. This enables us to have realistic datasets, but also to have knowledge of the underlying causal effect, making them very useful for testing the relative effectiveness of different inference methods.

Start by loading the causl and dplyr packages.

library(dplyr)
library(causl)

Dataset

We will use the dataset from a competition held at the Atlantic Causal Inference Conference in 2016. To obtain this, first run the following commands:

install.packages("remotes")
remotes::install_github("vdorie/aciccomp/2016")

Then load the aciccomp2016 package.

library(aciccomp2016)
(dat <- as_tibble(input_2016)) # show 10 rows of first few variables

Model

Let us consider a model for the causal effect of smoking on birthweight. We will start with a binary variable $A$ to indicate whether the mother smoked during the pregnancy, and later extend this to a zero-inflated continuous variable. We start by setting up the model inputs:

forms <- list(list(),
              list(A ~ x_1 + x_3 + x_4),
              list(Y ~ A),
              list(~ 1))
# fams <- list(integer(0), 5, 1, 1)
fams <- list(integer(0), "binomial", "gaussian", 1)
pars <- list(A = list(beta=c(-1.5,0.03,0.02,0.05)),
             Y = list(beta=c(3200, -500), phi=400^2),
             cop = list(beta=-1))

Then call rfrugalParam to simulate $A$ and $Y$.

set.seed(123)   # to obtain consistent results
datAY <- rfrugalParam(formulas=forms, family=fams, pars=pars, dat=dat)

We can now check that this basic simulation was performed correctly. First we fit the treatment variable using the correct model.

glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY)
summary(glmA)$coefficients

Indeed, the parameters appear correct. Then we can use inverse probability weighting (IPW) to estimate the parameters for the outcome model. We will need to load the survey package to get robust standard errors after the weighting.

library(survey)
ps <- predict(glmA, type="response")
wt <- datAY$A/ps + (1-datAY$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY))
summary(glmY)$coef

We can also compare with a naïve method that ignores reweighting.

glmYn <- glm(Y ~ A, data=datAY)
summary(glmYn)$coef

Automatic parameter generation

We can also use the function gen_cop_pars() to sample parameters for the copula, rather than having to specify a vector for every bivariate copula. To do this, we simply provide the list of formulas, the dataset, and a range that we would like the (partial) correlations to ultimately fall into.

For example, if we want the correlations in the range $(-0.5,0)$, we can call

pars2 <- pars
pars2$cop <- gen_cop_pars(formulas = forms, data=dat, range=c(-0.5,0))

cors <- 2*expit(unlist(pars2$cop)) - 1

This method gives (partial) correlations r cors[1], r cors[2], and r cors[3].

We can then simulate using this set of parameters.

set.seed(124)
datAY2 <- rfrugalParam(formulas=forms, family=fams, pars=pars2, dat=dat)

glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY2)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY2$A/ps + (1-datAY2$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY2))
# summary(glmY)$coef
est2 <- summary(glmY)$coef[1:2,1:2]

We can check the parameters in the same manner as above, and indeed we obtain an intercept of r round(est2[1,1]) (se $=$ r round(est2[1,2],1)) and a causal effect of r round(est2[2,1],1) (r round(est2[2,2],1)).

We can also use a more complicated method for generating copula parameters. Suppose we want these to depend upon x_2, which is a factor with six levels. Then we specify

forms3 <- list(list(),
              list(A ~ x_1 + x_3 + x_4),
              list(Y ~ A),
              list(~ x_2))

and simulate as:

set.seed(125)
pars3 <- pars
pars3$cop <- gen_cop_pars(formulas = forms3, data=dat, range=c(-0.5,0))
datAY3 <- rfrugalParam(formulas=forms3, family=fams, pars=pars3, dat=dat)

glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY3)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY3$A/ps + (1-datAY3$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY3))
# summary(glmY)$coef
est3 <- summary(glmY)$coef[1:2,1:2]

This time we obtain an intercept of r round(est3[1,1]) (r round(est3[1,2],1)) and a causal effect of r round(est3[2,1],1) (r round(est3[2,2],1)). In this case the copula parameter is a vector of six entries; for example for the $Y$-$X_1$ effect this is (r round(pars3$cop$Y$x_1$beta, 2)).

Other copula families

Until now we have only used Gaussian copulas, but we can easily specify a different family. For example, suppose we wish to consider Student t-copulas. In this case we need to choose a degrees of freedom parameter, though this is easily managed with gen_cop_pars() which allows additional arguments as ....

set.seed(126)
pars4 <- pars3
(pars4$cop <- gen_cop_pars(forms3, data=dat, range=c(-0.5,0), par2=4))
fams4 <- fams
fams4[[4]] <- 2
datAY4 <- rfrugalParam(formulas=forms3, family=fams4, pars=pars4, dat=dat)

glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY4)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY4$A/ps + (1-datAY4$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY4))
# summary(glmY)$coef
est4 <- summary(glmY)$coef[1:2,1:2]

Now the estimates are r round(est4[1,1]) (r round(est4[1,2],1)) and a causal effect of r round(est4[2,1],1) (r round(est4[2,2],1)).

Strength of relationships

We also introduce the function adj_vars(), which allows users to modify the strength of the partial correlations (or other parameters) by scaling the parameters of the linear predictor. The argument factor can be modified to control the strength of the strong and weak variables respectively; the default values are 5 and 0.2. Returning to the example above, suppose that we want x_4 to be much more closely related than x_1 or x_3. Then we can apply:

set.seed(127)
pars5 <- pars3
pars_tmp <- gen_cop_pars(forms3, data=dat, range=c(-0.25,0))
(pars5$cop <- causl:::adj_vars(pars_tmp, strong="x_4", weak=c("x_1","x_3"), factor=c(2,0.25)))
fams5 <- fams
fams5[[4]] <- 5  # use Frank copula
datAY5 <- rfrugalParam(formulas=forms3, family=fams5, pars=pars5, dat=dat)

glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY5)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY5$A/ps + (1-datAY5$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY5))
# summary(glmY)$coef
est5 <- summary(glmY)$coef[1:2,1:2]