knitr::opts_chunk$set(echo = TRUE, cache = TRUE) options(digits=3)
This document describes how to use the causl
package to perform plasmode
simulation; that is, producing datasets that combining real covariates (and
possibly treatments) with synthetic outcomes. This enables us to have realistic
datasets, but also to have knowledge of the underlying causal effect, making
them very useful for testing the relative effectiveness of different inference
methods.
Start by loading the causl
and dplyr
packages.
library(dplyr) library(causl)
We will use the dataset from a competition held at the Atlantic Causal Inference Conference in 2016. To obtain this, first run the following commands:
install.packages("remotes") remotes::install_github("vdorie/aciccomp/2016")
Then load the aciccomp2016
package.
library(aciccomp2016) (dat <- as_tibble(input_2016)) # show 10 rows of first few variables
Let us consider a model for the causal effect of smoking on birthweight. We will start with a binary variable $A$ to indicate whether the mother smoked during the pregnancy, and later extend this to a zero-inflated continuous variable. We start by setting up the model inputs:
forms <- list(list(), list(A ~ x_1 + x_3 + x_4), list(Y ~ A), list(~ 1)) # fams <- list(integer(0), 5, 1, 1) fams <- list(integer(0), "binomial", "gaussian", 1) pars <- list(A = list(beta=c(-1.5,0.03,0.02,0.05)), Y = list(beta=c(3200, -500), phi=400^2), cop = list(beta=-1))
Then call rfrugalParam
to simulate $A$ and $Y$.
set.seed(123) # to obtain consistent results datAY <- rfrugalParam(formulas=forms, family=fams, pars=pars, dat=dat)
We can now check that this basic simulation was performed correctly. First we fit the treatment variable using the correct model.
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY) summary(glmA)$coefficients
Indeed, the parameters appear correct. Then we can use inverse probability
weighting (IPW) to estimate the parameters for the outcome model. We will need
to load the survey
package to get robust standard errors after the weighting.
library(survey) ps <- predict(glmA, type="response") wt <- datAY$A/ps + (1-datAY$A)/(1-ps) glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY)) summary(glmY)$coef
We can also compare with a naïve method that ignores reweighting.
glmYn <- glm(Y ~ A, data=datAY) summary(glmYn)$coef
We can also use the function gen_cop_pars()
to sample parameters for the
copula, rather than having to specify a vector for every bivariate copula. To
do this, we simply provide the list of formulas, the dataset, and a range that
we would like the (partial) correlations to ultimately fall into.
For example, if we want the correlations in the range $(-0.5,0)$, we can call
pars2 <- pars pars2$cop <- gen_cop_pars(formulas = forms, data=dat, range=c(-0.5,0))
cors <- 2*expit(unlist(pars2$cop)) - 1
This method gives (partial) correlations r cors[1]
, r cors[2]
, and r cors[3]
.
We can then simulate using this set of parameters.
set.seed(124) datAY2 <- rfrugalParam(formulas=forms, family=fams, pars=pars2, dat=dat)
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY2) # summary(glmA)$coefficients ps <- predict(glmA, type="response") wt <- datAY2$A/ps + (1-datAY2$A)/(1-ps) glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY2)) # summary(glmY)$coef est2 <- summary(glmY)$coef[1:2,1:2]
We can check the parameters in the same manner as above, and indeed we obtain
an intercept of r round(est2[1,1])
(se $=$ r round(est2[1,2],1)
) and a
causal effect of r round(est2[2,1],1)
(r round(est2[2,2],1)
).
We can also use a more complicated method for generating copula parameters. Suppose
we want these to depend upon x_2
, which is a factor with six levels. Then
we specify
forms3 <- list(list(), list(A ~ x_1 + x_3 + x_4), list(Y ~ A), list(~ x_2))
and simulate as:
set.seed(125) pars3 <- pars pars3$cop <- gen_cop_pars(formulas = forms3, data=dat, range=c(-0.5,0)) datAY3 <- rfrugalParam(formulas=forms3, family=fams, pars=pars3, dat=dat)
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY3) # summary(glmA)$coefficients ps <- predict(glmA, type="response") wt <- datAY3$A/ps + (1-datAY3$A)/(1-ps) glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY3)) # summary(glmY)$coef est3 <- summary(glmY)$coef[1:2,1:2]
This time we obtain an intercept of r round(est3[1,1])
(r round(est3[1,2],1)
) and a
causal effect of r round(est3[2,1],1)
(r round(est3[2,2],1)
). In this case
the copula parameter is a vector of six entries; for example for the $Y$-$X_1$
effect this is (r round(pars3$cop$Y$x_1$beta, 2)
).
Until now we have only used Gaussian copulas, but we can easily specify a
different family. For example, suppose we wish to consider Student t-copulas.
In this case we need to choose a degrees of freedom parameter, though this is
easily managed with gen_cop_pars()
which allows additional arguments as ...
.
set.seed(126) pars4 <- pars3 (pars4$cop <- gen_cop_pars(forms3, data=dat, range=c(-0.5,0), par2=4)) fams4 <- fams fams4[[4]] <- 2 datAY4 <- rfrugalParam(formulas=forms3, family=fams4, pars=pars4, dat=dat)
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY4) # summary(glmA)$coefficients ps <- predict(glmA, type="response") wt <- datAY4$A/ps + (1-datAY4$A)/(1-ps) glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY4)) # summary(glmY)$coef est4 <- summary(glmY)$coef[1:2,1:2]
Now the estimates are r round(est4[1,1])
(r round(est4[1,2],1)
) and a
causal effect of r round(est4[2,1],1)
(r round(est4[2,2],1)
).
We also introduce the function adj_vars()
, which allows users to modify the
strength of the partial correlations (or other parameters) by scaling the
parameters of the linear predictor. The argument factor
can be modified to
control the strength of the strong
and weak
variables respectively; the
default values are 5 and 0.2. Returning to the example above, suppose
that we want x_4
to be much more closely related than x_1
or x_3
. Then
we can apply:
set.seed(127) pars5 <- pars3 pars_tmp <- gen_cop_pars(forms3, data=dat, range=c(-0.25,0)) (pars5$cop <- causl:::adj_vars(pars_tmp, strong="x_4", weak=c("x_1","x_3"), factor=c(2,0.25))) fams5 <- fams fams5[[4]] <- 5 # use Frank copula datAY5 <- rfrugalParam(formulas=forms3, family=fams5, pars=pars5, dat=dat)
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY5) # summary(glmA)$coefficients ps <- predict(glmA, type="response") wt <- datAY5$A/ps + (1-datAY5$A)/(1-ps) glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY5)) # summary(glmY)$coef est5 <- summary(glmY)$coef[1:2,1:2]
This time the estimates are r round(est5[1,1])
(r round(est5[1,2],1)
) and a
causal effect of r round(est5[2,1],1)
(r round(est5[2,2],1)
).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.