In audreyrenson/didgformula: The difference-in-differences g-formula

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(didgformula)
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(forcats)
library(splines)
library(glue)
theme_set(theme_minimal())

Introduction

This vignette implements an example analysis using parallel trends g-formula estimators and data included in the package. The scientific question is, "what effect would a U.S. federal stay-at-home order have had on all-cause mortality in the early months of the COVID-19 pandemic?" Mathematically, this question can be expressed as $$ \mathbb{E}[Y_t(\bar a)] - \mathbb{E}[Y_t] $$ where $Y_t$ is an indicator (0/1) of mortality in week $t$ in the observable data, and $Y_t(\bar a)$ is that variable under intervention $\bar A=\bar a$, where $\bar a$ refers to the intervention "set all states to be under stay-at-home order during weeks 0 through $t$.

Data

The data are included in the package as:

data("stayathome2020")
stayathome2020

The dataset consists of state-week level observations for all 43 states that ever implemented a stay-at-home order during spring-summer 2020, for 12 weeks (week=0,1,...,11) starting April 4. The data consist of mortality counts (mort) from the National Death Index, state population estimates (pop) based on the 2015-2019 Ammerican Community Survey, counts of new COVID-19 cases in the previous week (case_gr_1wk) and the previous four weeks (case_gr_4wk) from the Johns Hopkins COVID-19 dashboard, and an indicator for whether a state was under stay-at-home or shelter-in-place order (stayathome), from COVID-19 US Policies Databsase (CUSP). See ?stayathome2020 for more info about sources.

First we do some recoding in preparation for the analysis:

sah = stayathome2020
#the analysis is simpler if the exposure variable is 0 under the intervention in question
sah$open = 1-sah$stayathome

#coding the outcome as an Nx2 matrix for binomial likelihoods
sah$mort_bin = cbind(sah$mort, sah$pop - sah$mort)

#coding lags
sah$mort_lag = sah %>% group_by(state) %>% mutate(x=dplyr::lag(mort)) %>% pull()
sah$mort_bin_lag = cbind(sah$mort_lag, sah$pop - sah$mort_lag)
sah = sah %>% group_by(state) %>% append_lags(n_lags = 11, lag_vars = c('open', 'case_gr_4wk'), default = NA) %>% ungroup()

#most modeling uses a dataset starting with week=1, so create a separate dataset to simplify
sah_mod = sah[sah$week >0, ]

append_lags() is a convenience function provided in didgformula that is essentially a wrapper around dplyr's lag() to allow the user to quickly add multiple lags for multiple variables to the dataset.

Estimation

Here we estimate the counterfactual rates using three estimators, the interated conditional expectation (ICE) estimator based on outcome modeling, the inverse-probability-of-treatment-weighted (IPTW) estimator based on treatment modeling, and the targeted maximum likelihood estimator (TMLE) that combines the two and is doubly robust. We attempt to fit reasonably flexible (e.g., splines, interaction terms) models for the outcome and treatment to the extent possible given the data constraints.

Note that we specify the estimators to assume parallel trends on the additive scale, which is the default in *_pipeline() functions in didgformula (other scales are available by passing a function to the argument pt_link_fun). That is, we assume:

$$ \mathbb{E}[Y_t(\bar a)-Y_{t-1}(\bar a)|\bar A_{k-1}=\bar a_{k-1}, \bar W_k] =\mathbb{E}[Y_t(\bar a)-Y_{t-1}(\bar a)|\bar A_k=\bar a_k, \bar W_k] $$ for $k=1,...,t$. In this example, we consider $W$ to be the percentage growth in new cases of COVID-19 in the state in the previous week. For simplicity we also assume that $\mathbb E[Y_t|\bar A_t, \bar W_t]=\mathbb E[Y_t|\bar A_t, W_t, W_{t-1}]$, or that the outcome model only depends on the most recent values of $W$. This simplifies outcome modeling for the sake of exposition, but we may not want to assume this in realistic applications.

# a function to set the exposure variable equal to the intervened status
intervene = function(df_obs) {
  df_obs %>% mutate(across(contains('open'), ~ifelse(is.na(.x), NA, 0)))
}

# convenience function for "positive part" of a variable, useful for keeping non-degenerative parts of time*exposure interaction terms
pos <- function(x) ifelse(x<0, 0, x)

# append IPTW weights to the dataset for use in TMLE
sah_mod$w0 = 1/( 1 - predict(glm(open ~ ns(week, df=3) + log(case_gr_4wk), data= sah_mod[sah_mod$open_lag1==0, ], family=binomial), newdata=sah_mod, type='response'))
sah_mod$w0 = sah_mod %>% group_by(state) %>% mutate(w0 = cumprod(w0)) %>% pull(w0)
for(n in 1:11) {
  sah_mod[[glue('w{n}')]] = sah_mod %>% group_by(state) %>% mutate(w = lag(.data[[glue('w{n-1}')]])) %>% pull(w)
}

#currently, tmle needs to subset by vars A_lag{n} so we append these
sah_mod <- sah_mod %>%
  mutate(A=open) %>%
  group_by(state) %>%
  append_lags(11, 'A', default=NA) %>%
  ungroup()


ice_estimate = ice_pipeline_long(df_obs = sah_mod,
                                 df_interv = intervene(sah_mod),
                                 inside_formula_t = 'mort_bin ~ factor(open)/ns(week, df=3) + log(case_gr_4wk)',
                                 inside_formula_tmin1 = 'mort_bin_lag ~ factor(open)/ns(week, df=3) + lag(case_gr_4wk) ',
                                 outside_formula = 'preds ~  factor(open_lag{n})/ns(week, df={min(3, pos(Tt-n-5)+1)}) + log(case_gr_4wk_lag{n})',
                                 Tt = 11,
                                 t_col = sah_mod$week,
                                 n_nested = 7,
                                 inside_family = binomial,
                                 pt_link_fun = NULL,
                                 binomial_n = sah_mod$pop, models = TRUE)

iptw_estimate = iptw_pipeline_long(df_obs = sah_mod,
                                   df_interv = intervene(sah_mod),
                                   den_formula = 'open ~ ns(week, df=3) + log(case_gr_4wk)*open_lag1',
                                   num_formula = 'open ~ ns(week, df=3) + open_lag1',
                                   family = binomial,
                                   yvar = 'mort',
                                   ylagvar = 'mort_lag',
                                   idvar='state',
                                   timevar = 'week',
                                   pt_link_fun = NULL,
                                   binomial_n = sah_mod$pop)

tmle_estimate = ice_pipeline_long(df_obs = sah_mod,
                                 df_interv = intervene(sah_mod),
                                 inside_formula_t = 'mort_bin ~ factor(open)/ns(week, df=3) + log(case_gr_4wk)',
                                 inside_formula_tmin1 = 'mort_bin_lag ~ factor(open)/ns(week, df=3) + lag(case_gr_4wk)',
                                 outside_formula = 'preds ~  factor(open_lag{n})/ns(week, df={min(3, pos(Tt-n-5)+1)}) + log(case_gr_4wk_lag{n})',
                                 Tt = 11,
                                 t_col = sah_mod$week,
                                 n_nested = 7,
                                 inside_family = binomial,
                                 pt_link_fun = NULL,
                                 binomial_n = sah_mod$pop,
                                 tmle = TRUE,
                                 weights = 'w{n}')
ice_estimate
iptw_estimate
tmle_estimate

Under the stated assumptions, the estimator pipelines return estimates of $$ \mathbb{E}[Y_t(\bar a)-Y_{t-1}(\bar a)] $$ Since the method only helps identify counterfactual trends (i.e. differences over time), it is our responsibility to combine these with quantities that identify counterfactual levels to arrive at estimates for the parameter of interest. In the current setting, the whole population represented in the dataset (i.e., the 43 states that ever implemented stay-at-home orders) began follow up with exposure at the intervened status (i.e. under stay-at-home order), which means that by causal consistency,

$$ \mathbb E[Y_0(\bar a)] = \mathbb E[Y_0] $$

Thus we can write the target parameter $E[Y_t(\bar a)]$ as $$ \mathbb E[Y_t(\bar a)] = \mathbb E[Y_0] + \sum_{k=1}^t \mathbb{E}[Y_t(\bar a)-Y_{t-1}(\bar a)] $$ We substitute estimates (under the stated assumptions) of the above RHS quantities to estimate the target parameter:

EY0 = sah %>%
  filter(week==0) %>%
  summarise(EY0 = sum(mort)/sum(pop)) %>%
  pull()


make_rates = function(estimates, EY0) {
  estimates %>% tibble::add_row(t=0, estimate=0, .before=1) %>%
    mutate(estimate = EY0 + cumsum(estimate) )
}

combine_estimates(iptw = iptw_estimate %>% make_rates(EY0),
                  ice  = ice_estimate %>% make_rates(EY0), Tt=11) %>%
  left_join(make_rates(tmle_estimate, EY0) %>% rename(tmle=estimate))

Rate estimates are reasonably similar between the three estimators, though it is clear the ICE and TMLE estimates are smoother owing to the parametric (spline) form of the outcome models. To better interpret those counterfactual rate estimates, we convert them to death counts and compare them to the observed death counts:

make_death_counts = function(estimates, EY0, pop) {
  estimates %>% make_rates(EY0) %>% mutate(deaths = estimate * pop) %>% select(t, deaths)
}

make_lives_saved = function(estimates, data) {
  EY0 = data %>%
    filter(week==0) %>%
    summarise(EY0 = sum(mort)/sum(pop)) %>%
    pull()

  counterfactual_deaths = make_death_counts(estimates, EY0, pop=sum(data$pop[data$week==0])) %>% pull(deaths) %>% sum()
  factual_deaths = sum(data$mort)

  factual_deaths - counterfactual_deaths
}

make_lives_saved(ice_estimate, sah)
make_lives_saved(iptw_estimate, sah)
make_lives_saved(tmle_estimate, sah)

Bootstrap confidence intervals

Analytic uncertainty estimators are not yet available for didgformula. However, since the estimators implemented are asymptotically normal, we can use Wald confidence intervals based on the nonparametric bootstrap. Here we assume that the observed population represents an iid sample from a superpopulation.

We perform 500 bootstrap resamples, exploiting the group structure of the data. That is, since the usual bootstrap places probability $1/N$ on each observation, when the data are grouped into $J$ mutually exclusive groups of size $n_j, j=1,2,...,J$, where data are constant within groups, the usual bootstrap places probability $1/n_j$ on observations falling into group $j$. Here, we group observations based on the week of mortality ($0, 1, ..., 11$ or later) and the state.

set.seed(40)
R=500 #number of resamples

#an equivalent function to *_pipeline()s to estimate rates under the observed treatments.
obs_pipeline <- function(data) {
  data %>%
    filter(week>0) %>%
    group_by(t=week) %>%
    summarise(estimate = sum(mort)/sum(pop) - sum(mort_lag) / sum(pop) )
}


# bootstrap based on iid sampling -----------------------------------------

#add 'censored' observations - this is used in bootstrapping the aggregate data
boot_template = sah %>%
  mutate(A=open) %>%
  group_by(state) %>%
  append_lags(11, 'A', default=NA) %>%
  ungroup() %>%
  bind_rows(sah %>%
              group_by(state, pop) %>%
              summarise(mort =  max(pop) - sum(mort)) %>%
              mutate(week = 12)) %>% #just have to remember that 12=never)
  arrange(state, week)

boot_append_weights = function(df) {
  df$w0 = 1/( 1 - predict(glm(open ~ ns(week, df=3) + log(case_gr_4wk), data= df[df$open_lag1==0, ], family=binomial), newdata=df, type='response'))
  df$w0 = df %>% group_by(state) %>% mutate(w0 = cumprod(w0)) %>% pull(w0)
  for(n in 1:11) {
    df[[glue('w{n}')]] = df %>% group_by(state) %>% mutate(w = lag(.data[[glue('w{n-1}')]])) %>% pull(w)
  }
  df
}

boot_agg = tibble(rep = 1:R) %>%
  mutate(data = map(rep, ~boot_template %>% #bootstrap the data with the 'censored' row
                      mutate(mort = rmultinom(1, sum(mort), prob=mort/sum(mort))[,1]) %>% 
                      group_by(state) %>%
                      mutate(pop = sum(mort),
                             mort_lag = dplyr::lag(mort),
                             mort_bin = cbind(mort, pop),
                             mort_bin_lag = cbind(mort_lag, pop)) %>%
                      ungroup() %>%
                      filter(week < 12)),
         mod_data = map(data, filter, week>0) %>% map(boot_append_weights),
         iptw = map(mod_data, ~iptw_pipeline_long(df_obs = .,
                                                  df_interv = intervene(.),
                                                  den_formula = 'open ~ ns(week, df=3) + log(case_gr_4wk)*open_lag1',
                                                  num_formula = 'open ~ ns(week, df=3) + open_lag1',
                                                  family = binomial,
                                                  yvar = 'mort',
                                                  ylagvar = 'mort_lag',
                                                  idvar='state',
                                                  timevar = 'week',
                                                  pt_link_fun = NULL,
                                                  binomial_n = .$pop)),
         ice = map(mod_data, ~ice_pipeline_long(df_obs = .,
                                                df_interv = intervene(.),
                                                inside_formula_t = 'mort_bin ~ factor(open)/ns(week, df=3) + log(case_gr_4wk)',
                                                inside_formula_tmin1 = 'mort_bin_lag ~ factor(open)/ns(week, df=3) + lag(case_gr_4wk)',
                                                outside_formula = 'preds ~  factor(open_lag{n})/ns(week, df={min(3, pos(Tt-n-5)+1)})  + log(case_gr_4wk_lag{n})',
                                                Tt = 11,
                                                t_col = .$week,
                                                n_nested = 7,
                                                inside_family = binomial,
                                                pt_link_fun = NULL,
                                                binomial_n = .$pop)),
         tmle = map(mod_data, ~ice_pipeline_long(df_obs = .,
                                                df_interv = intervene(.),
                                                inside_formula_t = 'mort_bin ~ factor(open)/ns(week, df=3) + log(case_gr_4wk)',
                                                inside_formula_tmin1 = 'mort_bin_lag ~ factor(open)/ns(week, df=3) + lag(case_gr_4wk)',
                                                outside_formula = 'preds ~  factor(open_lag{n})/ns(week, df={min(3, pos(Tt-n-5)+1)}) + log(case_gr_4wk_lag{n})',
                                                Tt = 11,
                                                t_col = .$week,
                                                n_nested = 7,
                                                inside_family = binomial,
                                                pt_link_fun = NULL,
                                                binomial_n = .$pop,
                                                tmle=TRUE, weights='w{n}')),
         obs = map(mod_data, obs_pipeline))

#calculate standard errors for the relevant quantities
boot_ses = boot_agg %>%
  pivot_longer(iptw:obs, names_to='method') %>%
  mutate(EY0 = map(data, ~with(.[.$week==0, ], sum(mort)/sum(pop))),
         rates = map2(value, EY0, make_rates),
         lives = map2_dbl(value, data, make_lives_saved)) %>%
  unnest(rates) %>%
  group_by(t, method) %>%
  summarise(rate_mean = mean(estimate),
            rate_se = sd(estimate),
            lives_mean = mean(lives),
            lives_se = sd(lives))

Results

Estimate the number of lives saved with 95% confidence intervals:

lives_estimates_se = tibble(method = c('iptw','ice','tmle'),
                            lives = sapply(list(iptw_estimate, ice_estimate, tmle_estimate),
                                           make_lives_saved, sah)) %>%
  left_join(boot_ses %>% ungroup() %>% filter(method != 'obs') %>% select(method, se=lives_se) %>% unique()) %>%
  mutate(lwr = lives - 1.96*se, upr = lives + 1.96*se)


lives_estimates_se

Clearly the IPTW estimator is less precise than the other two estimators in this example. We can visualize the estimated rates as follows:

df_figure = list(ice=ice_estimate,
                 iptw=iptw_estimate,
                 tmle=tmle_estimate,
                 obs = obs_pipeline(sah)) %>%
  map(make_rates, EY0) %>%
  bind_rows(.id='method')  %>%
  left_join(boot_ses %>% select(method, t, se=rate_se)) %>%
  mutate(lwr = estimate - 1.96*se, upr = estimate+1.96*se)

df_figure %>%
  mutate(across(estimate:upr, ~.*1e5),
         Estimator = case_when(method == 'ice' ~'Iterated conditional',
                               method == 'iptw' ~'Inverse probability weighted',
                               method == 'tmle'~'Targeted maximum likelihood',
                               method == 'obs' ~ 'Observed rates')) %>%
  mutate(Estimator = fct_reorder(Estimator, .x = 1*(Estimator != 'Observed rates'))) %>%
  ggplot(aes(x=t, y=estimate, ymin=lwr, ymax=upr, 
             color=Estimator, fill=Estimator)) +
  geom_line() +
  geom_ribbon(alpha=0.2) +
  theme(legend.position = c(1,1), legend.justification = c(1,1)) +
  labs(x='Weeks since April 6, 2020', y='U.S. all-cause mortality rate per 100k (95% CI)')