In ericdunipace/causalOT: Optimal Transport Weights for Causal Inference

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.pos = "!H", out.extra = "")
theme_cot <- function(base_size = 11, base_family = "", 
                       base_line_size = base_size/22, 
                       base_rect_size = base_size/22,
                       legend.position = 'bottom',
                       legend.box = "horizontal",
                       legend.justification = "center",
                       legend.margin = ggplot2::margin(0,0,0,0),
                       legend.box.margin = ggplot2::margin(-10,-10,0,-10)) { 
  ggplot2::`%+replace%`(ggplot2::theme_bw(base_size = base_size, base_family = "", 
                                          base_line_size = base_line_size, 
                                          base_rect_size = base_rect_size),
    ggplot2::theme(
      plot.title = ggplot2::element_text(hjust = 1),
      panel.grid.minor = ggplot2::element_blank(),
      panel.grid.major = ggplot2::element_blank(),
      strip.background = ggplot2::element_blank(),
      strip.text.x = ggplot2::element_text(face = "bold"),
      strip.text.y = ggplot2::element_text(face = "bold"),
      legend.position = legend.position,
      legend.box = legend.box,
      legend.justification = legend.justification,
      legend.margin = legend.margin,
      legend.box.margin = legend.box.margin
      # change stuff here
    ))

}

Basics

This R Markdown document will reproduce the tables and figures in the paper Optimal transport weights for causal inference. If this file is compiled in this directory, the tables and figures will be generated from the source files found in the code folder.

To generate tables and figures

By default, this document will re-compile the tables and figures from the simulation studies using the already run simulation data from the original paper. This document will also input the tables and figures from the case study. The case study can be re-run by setting the corresponding chunk to eval = TRUE.

To re-run simulations

This will be a lengthy process if not on a cluster. The original files used to run the simulations on the cluster are found in the subfolder code\original_sim for both the convergence analysis and Hainmueller simulations. How to run these on a cluster is included below.

Required R pacakges

To re-run these simulations and/or recompile tables and figures, there are several libraries needed.

For just compiling tables and figures, here's all of the libraries that should be loaded:

library(causalOT)
library(dplyr)
library(ggplot2)
library(scales)
library(ggsci)
library(xtable)
library(cowplot)
library(tidyr)

For re-running simulations, the following packages are needed

library(causalOT)
library(doRNG)
library(Rmosek)

as well as the Python libraries numpy, scipy, pykeops, and geomloss.

Simulations studies

Study of bias/RMSE using Hainmueller (2012)

To generate the figures and tables for the simulations, we can use the following code

source("code/Hainmueller.R")

This will take the simulations for the Hainmueller setting and calculate the summary statistics. Then it generates the Latex table in Table 1. Some of the references in the table will not work since this is not in the larger paper with the .bib file.

Note this will download the simulation data if it's not already found in the data folder of this workflow.

Re-running on a cluster

To re-run this data, I recommend using a cluster. There are several other files in code/original_sim to discuss.

hain_setting_array.R generates a setting array that the cluster run will refer to in order to setup the simulation settings.
generate_seeds.R generates a list of seeds to be used by the analysis (and also for the convergence simulations)
combine_sim_res.R will combine the raw simulation results into one folder.

The files in code/original_sim/Hainmueller will re-run the raw analysis. The command ``` {bash eval = FALSE} sbatch --array=1-1500 hain.sh

should re-run the results as desired.

### Output
At the end we get the following table, which should match the paper.
```{=latex}
\input{tables/hainmueller.tex}
\newpage

Convergence and confidence intervals

Similarly, the convergence and confidence interval simulations can be compiled with the following code chunk

source("code/convergence.R")

Note this will download the simulation data if it's not already found in the data folder of this workflow.

Re-running on a cluster

To re-run this data, I again recommend using a cluster. There are several other files in code/original_sim as mentioned previously:

generate_seeds.R generates a list of seeds to be used by the analysis (and also for the convergence simulations)
combine_sim_res.R will combine the raw simulation results into one folder.

Then the files in code/original_sim/Convergence will re-run the analysis. On a Slurm-based cluster running ``` {bash eval = FALSE} sbatch --array=1-1000 conv.sh

should be sufficient.

### Outputs for convergence
Then we can look at the plots demonstrating convergence in terms of Sinkhorn divergence and $L_2$ norm.

```{=latex}
\begin{figure}[htb]
    \includegraphics[width=\textwidth]{figures/gray_compare_meth_conv_wass}
    \caption{Convergence of the weights to the distributions specified by the empirical distributions (top) and the distributions specified by the true propensity score/Radon-Nikodym derivatives (bottom). Weights are a Causal Optimal Transport (COT), Nearest Neighbor Matching (NNM), a Probit model (GLM), and Stable Balancing Weights (SBW). Lines denote means across 1000 simulations. Both axes are on the log scale.}
    \label{fig:wass_conv_B}
\end{figure}
\begin{figure}[htb]
    \includegraphics[width=\textwidth]{figures/gray_compare_meth_conv_l2}
    \caption{Convergence of the estimated weights to the values of the true inverse propensity score in terms of the $L_2$ norm. Weights are a Causal Optimal Transport (COT), Nearest Neighbor Matching (NNM), a Probit model (GLM), and Stable Balancing Weights (SBW). Lines denote means across 1000 simulations. Both axes are on the log scale.}
    \label{fig:l2_conv_B}
\end{figure}
\newpage

Outputs for confidence intervals

This analysis also gets the following figures for confidence interval coverage.

```{=latex} \begin{figure}[!tb] \centering \includegraphics[width=\linewidth]{figures/gray_ot_meth_ci_coverage_both} \caption{Coverage of the true treatment effect} \end{figure} \begin{figure}[!tb] \includegraphics[width=\linewidth]{figures/gray_ot_meth_ci_coverage_both_expectation} \caption{Coverage of the estimated average treatment effect} \end{figure} \newpage

## Algorithm check
In the paper, we offer an algorithm for tuning the hyperparameters of the optimal transport distances.
We can generate the figures for the algorithm check with this code
```r
source("code/algorithm.R")

Re-running on a cluster

To re-run this data, I used a cluster. There are several other files in code/original_sim as mentioned previously:

generate_seeds.R generates a list of seeds to be used by the analysis (and also for the convergence simulations)
combine_sim_res.R will combine the raw simulation results into one folder.

Then the files in code/original_sim/Algorithm will re-run the analysis. On a Slurm-based cluster run ``` {bash eval = FALSE} sbatch --array=1-1000 algor.sh

### Outputs for algorithm
This generates the following figures.

```{=latex}
\begin{figure}[!tb]
    \centering
        \includegraphics[width=\linewidth]{figures/algorithm_check_treated}
        \caption{Selection of penalty parameter for the treated}
        \end{figure}
    \begin{figure}[!tb]
        \includegraphics[width=\linewidth]{figures/algorithm_check_control}
        \caption{Selection of penalty parameter for the control}
\end{figure}
\newpage

A few notes. The Anderson-Darling statistic in this case measures the closeness of the estimated weights, $w$, to the Radon-Nykodim derivatives, $w^\star$: [ \frac{1}{n} \sum_{i=1}^n \frac{(w_i - w_i^\star)^2}{w_i^\star (1 - w_i^\star)}. ] This gives us a sense of how well chosen penalty parameter, $\lambda$, gives weights that approximate the true Radon-Nykodim derivatives. As we can see, the algorithm finds the optimal hyperparameter on average as the sample size grows.

Case study

The data

The data come from a study by Blum et al. looking at the effect of misoprostol vs. oxytocin at stopping post-partum hemorrhage. The data is described in the documents for the causalOT package which can be accessed by running ?causalOT::pph. A more detailed description is found in the paper Optimal transport weights for causal inference.

Running the analysis

The analysis can be run by running the following chunk with eval = TRUE

source("code/misoprostol.R")

Figures/Tables

We now turn to the tables and figures generated by this analysis. We can see that the effect estimates look good for the optimal transport methods! ```{=latex} \begin{figure}[htb] \centering \includegraphics[width =\textwidth]{figures/pph} \caption{Results for treatment effect estimation averaged across treatment groups and study sites. The grey vertical line is the original treatment effect estimate for the entire study while the dotted vertical lines are the original confidence interval. The weighting methods under examination are logistic regression (GLM), Covariate Balancing Propensity Score (CBPS), Stable Balancing Weights (SBW), Synthetic Control Method (SCM), Nearest Neighbour Matching (NNM), and Causal Optimal Transport (COT).} \label{fig:pph} \end{figure} \newpage

COT also has the best performance in terms of coverage!
```{=latex}
\input{tables/pph.tex}
\newpage

Conclusions

This document reproduces the results for the paper Optimal transport weights for causal inference by Eric Dunipace. Any questions or concerns can be addressed by filing an issue at https://github.com/ericdunipace/causalOT/issues.

ericdunipace/causalOT documentation built on Aug. 8, 2024, 6:14 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ericdunipace/causalOT
Optimal Transport Weights for Causal Inference

In ericdunipace/causalOT: Optimal Transport Weights for Causal Inference

Basics

To generate tables and figures

To re-run simulations

Required R pacakges

Simulations studies

Study of bias/RMSE using Hainmueller (2012)

Re-running on a cluster

Convergence and confidence intervals

Re-running on a cluster

Outputs for confidence intervals

Re-running on a cluster

Case study

The data

Running the analysis

Figures/Tables

Conclusions

R Package Documentation

Browse R Packages

We want your feedback!

ericdunipace/causalOT Optimal Transport Weights for Causal Inference

In ericdunipace/causalOT: Optimal Transport Weights for Causal Inference

Basics

To generate tables and figures

To re-run simulations

Required R pacakges

Simulations studies

Study of bias/RMSE using Hainmueller (2012)

Re-running on a cluster

Convergence and confidence intervals

Re-running on a cluster

Outputs for confidence intervals

Re-running on a cluster

Case study

The data

Running the analysis

Figures/Tables

Conclusions

R Package Documentation

Browse R Packages

We want your feedback!

ericdunipace/causalOT
Optimal Transport Weights for Causal Inference