knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(fig.pos = "!H", out.extra = "") theme_cot <- function(base_size = 11, base_family = "", base_line_size = base_size/22, base_rect_size = base_size/22, legend.position = 'bottom', legend.box = "horizontal", legend.justification = "center", legend.margin = ggplot2::margin(0,0,0,0), legend.box.margin = ggplot2::margin(-10,-10,0,-10)) { ggplot2::`%+replace%`(ggplot2::theme_bw(base_size = base_size, base_family = "", base_line_size = base_line_size, base_rect_size = base_rect_size), ggplot2::theme( plot.title = ggplot2::element_text(hjust = 1), panel.grid.minor = ggplot2::element_blank(), panel.grid.major = ggplot2::element_blank(), strip.background = ggplot2::element_blank(), strip.text.x = ggplot2::element_text(face = "bold"), strip.text.y = ggplot2::element_text(face = "bold"), legend.position = legend.position, legend.box = legend.box, legend.justification = legend.justification, legend.margin = legend.margin, legend.box.margin = legend.box.margin # change stuff here )) }
This R Markdown document will reproduce the tables and figures in the paper Optimal transport weights for causal inference. If this file is compiled in this directory, the tables and figures will be generated from the source files found in the code
folder.
By default, this document will re-compile the tables and figures from the simulation studies using the already run simulation data from the original paper. This document will also input the tables and figures from the case study. The case study can be re-run by setting the corresponding chunk to eval = TRUE
.
This will be a lengthy process if not on a cluster. The original files used to run the simulations on the cluster are found in the subfolder code\original_sim
for both the convergence analysis and Hainmueller simulations. How to run these on a cluster is included below.
To re-run these simulations and/or recompile tables and figures, there are several libraries needed.
For just compiling tables and figures, here's all of the libraries that should be loaded:
library(causalOT) library(dplyr) library(ggplot2) library(scales) library(ggsci) library(xtable) library(cowplot) library(tidyr)
For re-running simulations, the following packages are needed
library(causalOT) library(doRNG) library(Rmosek)
as well as the Python libraries numpy
, scipy
, pykeops
, and geomloss
.
To generate the figures and tables for the simulations, we can use the following code
source("code/Hainmueller.R")
This will take the simulations for the Hainmueller setting and calculate the summary statistics. Then it generates the Latex table in Table 1. Some of the references in the table will not work since this is not in the larger paper with the .bib file.
Note this will download the simulation data if it's not already found in the data folder of this workflow.
To re-run this data, I recommend using a cluster. There are several other files in code/original_sim
to discuss.
hain_setting_array.R
generates a setting array that the cluster run will refer to in order to setup the simulation settings.generate_seeds.R
generates a list of seeds to be used by the analysis (and also for the convergence simulations)combine_sim_res.R
will combine the raw simulation results into one folder.The files in code/original_sim/Hainmueller
will re-run the raw analysis. The command
``` {bash eval = FALSE}
sbatch --array=1-1500 hain.sh
should re-run the results as desired. ### Output At the end we get the following table, which should match the paper. ```{=latex} \input{tables/hainmueller.tex} \newpage
Similarly, the convergence and confidence interval simulations can be compiled with the following code chunk
source("code/convergence.R")
Note this will download the simulation data if it's not already found in the data folder of this workflow.
To re-run this data, I again recommend using a cluster. There are several other files in code/original_sim
as mentioned previously:
generate_seeds.R
generates a list of seeds to be used by the analysis (and also for the convergence simulations)combine_sim_res.R
will combine the raw simulation results into one folder.Then the files in code/original_sim/Convergence
will re-run the analysis. On a Slurm-based cluster running
``` {bash eval = FALSE}
sbatch --array=1-1000 conv.sh
should be sufficient. ### Outputs for convergence Then we can look at the plots demonstrating convergence in terms of Sinkhorn divergence and $L_2$ norm. ```{=latex} \begin{figure}[htb] \includegraphics[width=\textwidth]{figures/gray_compare_meth_conv_wass} \caption{Convergence of the weights to the distributions specified by the empirical distributions (top) and the distributions specified by the true propensity score/Radon-Nikodym derivatives (bottom). Weights are a Causal Optimal Transport (COT), Nearest Neighbor Matching (NNM), a Probit model (GLM), and Stable Balancing Weights (SBW). Lines denote means across 1000 simulations. Both axes are on the log scale.} \label{fig:wass_conv_B} \end{figure} \begin{figure}[htb] \includegraphics[width=\textwidth]{figures/gray_compare_meth_conv_l2} \caption{Convergence of the estimated weights to the values of the true inverse propensity score in terms of the $L_2$ norm. Weights are a Causal Optimal Transport (COT), Nearest Neighbor Matching (NNM), a Probit model (GLM), and Stable Balancing Weights (SBW). Lines denote means across 1000 simulations. Both axes are on the log scale.} \label{fig:l2_conv_B} \end{figure} \newpage
This analysis also gets the following figures for confidence interval coverage.
```{=latex} \begin{figure}[!tb] \centering \includegraphics[width=\linewidth]{figures/gray_ot_meth_ci_coverage_both} \caption{Coverage of the true treatment effect} \end{figure} \begin{figure}[!tb] \includegraphics[width=\linewidth]{figures/gray_ot_meth_ci_coverage_both_expectation} \caption{Coverage of the estimated average treatment effect} \end{figure} \newpage
## Algorithm check In the paper, we offer an algorithm for tuning the hyperparameters of the optimal transport distances. We can generate the figures for the algorithm check with this code ```r source("code/algorithm.R")
To re-run this data, I used a cluster. There are several other files in code/original_sim
as mentioned previously:
generate_seeds.R
generates a list of seeds to be used by the analysis (and also for the convergence simulations)combine_sim_res.R
will combine the raw simulation results into one folder.Then the files in code/original_sim/Algorithm
will re-run the analysis. On a Slurm-based cluster run
``` {bash eval = FALSE}
sbatch --array=1-1000 algor.sh
### Outputs for algorithm This generates the following figures. ```{=latex} \begin{figure}[!tb] \centering \includegraphics[width=\linewidth]{figures/algorithm_check_treated} \caption{Selection of penalty parameter for the treated} \end{figure} \begin{figure}[!tb] \includegraphics[width=\linewidth]{figures/algorithm_check_control} \caption{Selection of penalty parameter for the control} \end{figure} \newpage
A few notes. The Anderson-Darling statistic in this case measures the closeness of the estimated weights, $w$, to the Radon-Nykodim derivatives, $w^\star$: [ \frac{1}{n} \sum_{i=1}^n \frac{(w_i - w_i^\star)^2}{w_i^\star (1 - w_i^\star)}. ] This gives us a sense of how well chosen penalty parameter, $\lambda$, gives weights that approximate the true Radon-Nykodim derivatives. As we can see, the algorithm finds the optimal hyperparameter on average as the sample size grows.
The data come from a study by Blum et al. looking at the effect of misoprostol vs. oxytocin at stopping post-partum hemorrhage. The data is described in the documents for the causalOT
package which can be accessed by running
?causalOT::pph
. A more detailed description is found in the paper Optimal transport weights for causal inference.
The analysis can be run by running the following chunk with eval = TRUE
source("code/misoprostol.R")
We now turn to the tables and figures generated by this analysis. We can see that the effect estimates look good for the optimal transport methods! ```{=latex} \begin{figure}[htb] \centering \includegraphics[width =\textwidth]{figures/pph} \caption{Results for treatment effect estimation averaged across treatment groups and study sites. The grey vertical line is the original treatment effect estimate for the entire study while the dotted vertical lines are the original confidence interval. The weighting methods under examination are logistic regression (GLM), Covariate Balancing Propensity Score (CBPS), Stable Balancing Weights (SBW), Synthetic Control Method (SCM), Nearest Neighbour Matching (NNM), and Causal Optimal Transport (COT).} \label{fig:pph} \end{figure} \newpage
COT also has the best performance in terms of coverage! ```{=latex} \input{tables/pph.tex} \newpage
This document reproduces the results for the paper Optimal transport weights for causal inference by Eric Dunipace. Any questions or concerns can be addressed by filing an issue at https://github.com/ericdunipace/causalOT/issues.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.