ATbounds: An R Vignette

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

ATbounds is an R package that provides estimation and inference methods for bounding average treatment effects (on the treated) that are valid under an unconfoundedness assumption. The bounds are designed to be robust in challenging situations, for example, when the the conditioning variables take on a large number of different values in the observed sample, or when the overlap condition is violated. This robustness is achieved by only using limited "pooling" of information across observations. Namely, the bounds are constructed as sample averages over functions of the observed outcomes such that the contribution of each outcome only depends on the treatment status of a limited number of observations. No information pooling across observations leads to so-called "Manski bounds" [@manski1989anatomy; @manski1990nonparametric], while unlimited information pooling leads to standard inverse propensity score weighting. The ATbounds package provides inference methods for exploring the intermediate range between these two extremes.

The methodology used in the ATbounds package is described in detail in Lee and Weidner (2021), "Bounding Treatment Effects by Pooling Limited Information across Observations," "Bounding Treatment Effects by Pooling Limited Information across Observations," available at https://arxiv.org/abs/2111.05243.

We begin by calling the ATbounds package.

library(ATbounds)

Case Study 1: Bounding the Effects of a Job Training Program

To illustrate the usefulness of the package, we first use the well-known @LaLonde-AER dataset available on Rajeev Dehejia's web page at http://users.nber.org/~rdehejia/nswdata2.html.

LaLonde's Experimental Sample

We fist look at LaLonde's original experimental sample.

  nsw_treated <- read.table("http://users.nber.org/~rdehejia/data/nsw_treated.txt")
  colnames(nsw_treated) <- c("treat","age","edu","black","hispanic",
                           "married","nodegree","RE75","RE78")

  nsw_control <- read.table("http://users.nber.org/~rdehejia/data/nsw_control.txt")
  colnames(nsw_control) <- c("treat","age","edu","black","hispanic",
                           "married","nodegree","RE75","RE78")

The outcome variable is RE78 (earnings in 1978). The binary treatment indicator is treat (1 if treated, 0 if not treated). We now combine the treatment and control samples and define the variables.

  nsw <- rbind(nsw_treated,nsw_control)
  attach(nsw)
  D <- treat  
  Y <- (RE78 > 0) 

In this vignette, we define the outcome to be whether employed in 1978 (that is, earnings in 1978 are positive).

The LaLonde dataset is from the National Supported Work Demonstration (NSW), which is a randomized controlled temporary employment program. In view of that, we set the reference propensity score to be independent of covariates.

  rps <- rep(mean(D),length(D))  

The average treatment effect is obtained by

  ate_nsw <- mean(D*Y)/mean(D)-mean((1-D)*Y)/mean(1-D)
  print(ate_nsw)

Alternatively, we run simple regression

  model <- lm(Y ~ D)
  summary(model)
  confint(model)

The 95% confidence interval $[0.01,0.14]$ is rather wide but excludes zero.

Dehejia-Wahba Sample

@DehejiaWahba-JASA and @DehejiaWahba-RESTAT extract a further subset of LaLonde's NSW experimental data to obtain a subset containing information on RE74 (earnings in 1974).

  detach(nsw)
  nswre_treated <- read.table("http://users.nber.org/~rdehejia/data/nswre74_treated.txt")
  colnames(nswre_treated) <- c("treat","age","edu","black","hispanic",
                           "married","nodegree","RE74","RE75","RE78")

  nswre_control <- read.table("http://users.nber.org/~rdehejia/data/nswre74_control.txt")
  colnames(nswre_control) <- c("treat","age","edu","black","hispanic",
                           "married","nodegree","RE74","RE75","RE78")
  nswre <- rbind(nswre_treated,nswre_control)
  attach(nswre)
  D <- treat  
  Y <- (RE78 > 0) 
  X <- cbind(age,edu,black,hispanic,married,nodegree,RE74/1000,RE75/1000)

The covariates are as follows:

If we assume that the Dehejia-Wahba sample still preserves initial randomization, we can set the reference propensity score to be independent of covariates. However, it may not be the case and therefore, our approach can provide a robust method to check whether the Dehejia-Wahba sample can be viewed as as a random sample from a randomized controlled experiment.

We first define the the reference propensity score.

  rps <- rep(mean(D),length(D))  

Using this reference propensity score, the average treatment effect is obtained by

  ate_nswre <- mean(D*Y)/mean(D)-mean((1-D)*Y)/mean(1-D)
  print(ate_nswre)

Alternatively, we run simple regression

  model <- lm(Y ~ D)
  summary(model)
  confint(model)

The resulting 95% confidence interval $[0.02,0.20]$ is wide but again excludes zero.

Bounds on ATE

We now introduce our bounds on the average treatment effect (ATE).

  bns_nsw <- atebounds(Y, D, X, rps)

In implementing atebounds, there are several options:

The clusters are constructed via hierarchical, agglomerative clustering with complete linkage, where the distance is measured by the Euclidean distance after studentizing each of the covariates. As mentioned above, the number $m$ of clusters is set by $$m = \left\lceil \frac{n}{L} \right\rceil$$ for $L = 10$.

We show the summary results saved in bns_nsw.

  summary(bns_nsw)

Note that the estimate of the lower bound is larger than that of the upper bound. This crossing problem can occur in finite samples due to random sampling errors. While statistical theory guarantees that this problem cannot occur asymptotically, it is desirable to have a non-empty confidence interval in applications. We therefore use the method in @stoye2020 to obtain a valid confidence interval that is never empty.

The 95% confidence interval $[-0.01,0.19]$ obtained here is similar to the previous interval $[0.02,0.20]$, which was obtained under the assumption that the Dehejia-Wahba sample is a random sample from NSW. This suggests that first, there is no evidence against violation of the random sampling assumption in the Dehejia-Wahba sample and second, our inference method does not suffer from unduly enlargement of the confidence interval to achieve robustness, although a null effect is now included in our confidence interval.

With $Q=2$, the bounds for ATE are

summary(atebounds(Y, D, X, rps, Q = 2))

With $Q=4$, the bounds for ATE are

summary(atebounds(Y, D, X, rps, Q = 4))

Recall that the point estimate of ATE under the random sampling assumption was

print(ate_nswre)

Thus, the ATE estimate based on the simple mean difference well within the 95% confidence intervals across $Q=2, 3, 4$.

Recall that covariates $X$ include non-discrete variables: RE74 and RE75. As a default option, atebounds chooses the number of hierarchical clusters to be n_hc = ceiling(length(Y)/10). To check sensitivity to n_hc, we now run the following:

summary(atebounds(Y, D, X, rps))
summary(atebounds(Y, D, X, rps, n_hc = ceiling(length(Y)/5)))
summary(atebounds(Y, D, X, rps, n_hc = ceiling(length(Y)/20)))

It can be seen that the alternative estimation result is more similar to the default one with n_hc = ceiling(length(Y)/20) than with n_hc = ceiling(length(Y)/5). Overall, the results are qualitatively similar across the three different specifications of n_hc.

Finally, to see what is saved in bns_nsw, we now print it out.

  print(bns_nsw)

The output list contains:

Bounds on ATT

We now look at bounds on the average treatment effect on the treated (ATT).

  bns_nsw_att <- attbounds(Y, D, X, rps)
  summary(bns_nsw_att)

We experiment with Q.

  summary(attbounds(Y, D, X, rps, Q = 2))
  summary(attbounds(Y, D, X, rps, Q = 4))

We also experiment with n_hc.

summary(attbounds(Y, D, X, rps))
summary(attbounds(Y, D, X, rps, n_hc = ceiling(length(Y)/5)))
summary(attbounds(Y, D, X, rps, n_hc = ceiling(length(Y)/20)))

Bound estimates cross (that is, the lower bound is larger than the upper bound) and furthermore sensitive to the choice of Q and n_hc. This is likely to be driven by the relatively small sample size. However, once we factor into sampling uncertainty and look at the confidence intervals, all the results are more or less similar.

NSW treated and PSID control

The Dehejia-Wahba sample can be regarded as a data scenario where the propensity score is known and satisfies the overlap condition. We now turn to a different data scenario where it is likely that the propensity score is unknown and may not satisfy the overlap condition.

  psid2_control <- read.table("http://users.nber.org/~rdehejia/data/psid2_controls.txt")
  colnames(psid2_control) <- c("treat","age","edu","black","hispanic",
                           "married","nodegree","RE74","RE75","RE78")
  psid <- rbind(nswre_treated,psid2_control)
  detach(nswre)
  attach(psid)
  D <- treat  
  Y <- (RE78 > 0) 
  X <- cbind(age,edu,black,hispanic,married,nodegree,RE74/1000,RE75/1000)

Here, we use one of non-experimental comparison groups constructed by LaLonde from the Population Survey of Income Dynamics, namely PSID2 controls. We now estimate the reference propensity score using the sample proportion and obtain the new bound estimates:

  rps_sp <- rep(mean(D),length(D))  
  bns_psid <- atebounds(Y, D, X, rps_sp)
  summary(bns_psid)

The confidence interval $[-0.35,0.31]$ here is much larger than the confidence interval $[-0.01,0.19]$ with the Dehejia-Wahba sample. It is worth noting that the sample proportion is unlikely to be correctly specified in the NSW-treated/PSID-control sample. Thus, it seems that our inference method produces a wider confidence interval in order to be robust against misspecification of the propensity scores.

We now consider the Manski bounds, which can be obtained by setting $Q = 1$.

  summary(atebounds(Y, D, X, rps_sp, Q=1))

We can see that the Manski bounds are even larger and the same regardless of the specification of the reference propensity scores. Recall the Manski bounds do not impose the unconfoundedness assumption and do not rely on any pooling information (hence, it does not matter how to specify the reference propensity score).

Finally, we obtain the bounds for the average treatment effect on the treated (ATT).

  summary(attbounds(Y, D, X, rps_sp))
  detach(psid)

We find that the bounds on ATT are wide, as in ATE.

Case Study 2: Bounding the Effect of Right Heart Catheterization

As a second empirical example, we revisit the well-known Right Heart Catheterization Dataset. In particular, we apply our methods to @Connors1996's study of the efficacy of right heart catheterization (RHC), which is a diagnostic procedure for directly measuring cardiac function in critically ill patients. This dataset has been subsequently used in the context of limited overlap by @crump2009dealing, @Rothe:2017, and @Li-et-al:2018 among others. The dataset is publicly available on the Vanderbilt Biostatistics website at https://hbiostat.org/data.

In this example, the dependent variable is 1 if a patient survived after 30 days of admission, and 0 if a patient died within 30 days. The binary treatment variable is 1 if RHC was applied within 24 hours of admission, and 0 otherwise. The sample size was $n = 5735$, and 2184 patients were treated with RHC. There are a large number of covariates: @hirano2001estimation constructed 72 variables from the dataset and the same number of covariates were considered in both @crump2009dealing and @Li-et-al:2018 and 50 covariates were used in @Rothe:2017. In our exercise, we constructed the same 72 covariates. A cleaned version of the dataset is available in the package.

  Y <- RHC[,"survival"]
  D <- RHC[,"RHC"]
  X <- as.matrix(RHC[,-c(1,2)])

As in the aforementioned papers, we estimated the propensity scores by a logit model with all 72 covariates being added linearly.

  # Logit estimation of propensity score
  glm_ps <- stats::glm(D~X,family=binomial("logit"))
  ps <- glm_ps$fitted.values
  ps_treated <- ps[D==1]
  ps_control <- ps[D==0]
  # Plotting histograms of propensity scores
  df <- data.frame(cbind(D,ps))
  colnames(df)<-c("RHC","PS")
  df$RHC <- as.factor(df$RHC)
  levels(df$RHC) <- c("No RHC (Control)", "RHC (Treated)")

  ggplot2::ggplot(df, ggplot2::aes(x=PS, color=RHC, fill=RHC)) +
    ggplot2::geom_histogram(breaks=seq(0,1,0.1),alpha=0.5,position="identity")

The figure above shows the histograms of estimated propensity scores for treated and control groups. It is very similar to Fig.1 in @crump2009dealing and to Figure 2 in @Rothe:2017. The support of the estimated propensity scores are almost on the unit interval for both treated and control units, although there is some visual evidence on limited overlap (that is, control units have much fewer propensity scores close to 1).

Bounding the Average Treatment Effect on the Treated

In this section, we focus on ATT. We first estimate ATT by the normalized inverse probability weighted estimator: $$\begin{aligned} \widehat{\text{ATT}}{\text{PS}} := \frac{\sum{i=1}^n D_i Y_i}{\sum_{i=1}^n D_i} - \frac{\sum_{i=1}^n (1-D_i) W_i Y_i}{\sum_{i=1}^n (1-D_i) W_i}, \end{aligned}$$ where $W_i := \widehat{p}(X_i)/[1-\widehat{p}(X_i)]$ and $\widehat{p}(X_i)$ is the estimated propensity score for observation $i$ based on the logit model described above. See, e.g., equation (3) and discussions in @busso2014new for details of the normalized inverse probability weighted ATT estimator. The estimator $\widehat{\text{ATT}}{\text{PS}}$ requires that the assumed propensity score model is correctly specified and the overlap condition is satisfied. The resulting estimate is $\widehat{\text{ATT}}{\text{PS}} = -0.0639$.

  # ATT normalized estimation
  y1_att <- mean(D*Y)/mean(D)
  att_wgt <- ps/(1-ps)
  y0_att_num <- mean((1-D)*att_wgt*Y)
  y0_att_den <- mean((1-D)*att_wgt)
  y0_att <- y0_att_num/y0_att_den
  att_ps <- y1_att - y0_att
  print(att_ps)

We now turn to our methods. As before, we take the reference propensity score to be $\widehat{p}{\text{RPS}}(X_i) = n^{-1} \sum{i=1}^n D_i$ for each observation $i$. That is, we assign the sample proportion of the treated to the reference propensity scores uniformly for all observations. Of course, this is likely to be misspecified; however, it has the advantage that $1/\widehat{p}_{\text{RPS}}(X_i)$ is never close to 0 or 1.

  rps <- rep(mean(D),length(D))  

The resulting inverse reference-propensity-score weighted ATT estimator is $$\begin{aligned} \widehat{\text{ATT}}{\text{RPS}} := \frac{\sum{i=1}^n D_i Y_i}{\sum_{i=1}^n D_i} - \frac{\sum_{i=1}^n (1-D_i) Y_i}{\sum_{i=1}^n (1-D_i)} = -0.0507. \end{aligned}$$

  att_rps <- mean(D*Y)/mean(D) - mean((1-D)*Y)/mean(1-D)
  print(att_rps)

When the sample proportion is used as the propensity score estimator, there is no difference between unnormalized and normalized versions of ATT estimates. In fact, it is simply the mean difference between treatment and control groups.

   Xunique <- mgcv::uniquecombs(X) # A matrix of unique rows from X
   print(c("no. of unique rows:", nrow(Xunique))) 
   print(c("sample size       :", nrow(X)))   

Note that none of the covariates in the observed sample are identical. We therefore implement hierarchical, agglomerative clustering with complete linkage.

We show empirical results using the default option.

  summary(attbounds(Y, D, X, rps))

to check sensitivity with respect to tuning parameters, we vary $L$ and $Q$.

  # Bounding  ATT: sensitivity analysis
  # not run to save time
  nhc_set <- c(5, 10, 20)
  results_att <- {}

  for (hc in nhc_set){
    nhc <- ceiling(length(Y)/hc)

    for (q in c(1,2,3,4)){
      res <- attbounds(Y, D, X, rps, Q = q, n_hc = nhc)
      results_att <- rbind(results_att,c(hc,q,res$est_lb,res$est_ub,res$ci_lb,res$ci_ub))
    }
  }
  colnames(results_att) = c("L","Q","LB","UB","CI-LB","CI-UB")
  print(results_att, digits = 3)

The table below reports estimation results of ATT bounds for extended values of $L$ and $Q$. When $Q=1$, our estimated bounds correspond to Manski bounds, which includes zero and is wide with the interval length of almost one in all cases of $L$. Our bounds with $Q=1$ are different across $L$ because we apply hierarchical clustering before obtaining Manski bounds. With $Q=2$, the bounds shrink so that the estimated upper bound is zero for all cases of $L$; with $Q = 3$, they shrink even further so that the upper end point of the 95% confidence interval excludes zero. Among three different values of $L$, the case of $L=5$ gives the tightest confidence interval but in this case, the lower bound is larger than the upper bound, indicating that the estimates might be biased. In view of that, we take the bound estimates with $L=10$ as our preferred estimates $[-0.077, -0.039]$ with the 95% confidence interval $[-0.117,0.006]$. When $Q=4$, the lower bound estimates exceed the upper bound estimates with $L = 5, 10$. However, the estimates with $L = 20$ give almost identical results to our preferred estimates. It seems that the pairs of $(L, Q) = (10, 3)$ or $(L, Q) = (20, 4)$ provide reasonable estimates.

 L   Q       LB       UB    CI-LB    CI-UB

 5   1   -0.638    0.282   -0.700    0.330
     2   -0.131   -0.000   -0.174    0.033
     3   -0.034   -0.048   -0.076   -0.007
     4   -0.006   -0.073   -0.079   -0.006
10   1   -0.664    0.307   -0.766    0.376
     2   -0.169    0.004   -0.216    0.039
     3   -0.077   -0.039   -0.117   -0.006
     4   -0.049   -0.057   -0.090   -0.016
20   1   -0.675    0.316   -0.843    0.430
     2   -0.178   -0.005   -0.238    0.034
     3   -0.099   -0.046   -0.149   -0.007
     4   -0.065   -0.060   -0.112   -0.017

Table: ATT Bounds: Right Heart Catheterization Study

The study of @Connors1996 offered a conclusion that RHC could cause an increase in patient mortality. Based on our preferred estimates, we can exclude positive effects with confidence. This conclusion is based solely on the unconfoundedness condition, but not on the overlap condition, nor on the correct specification of the logit model. Overall, our estimates seem to be consistent with the qualitative findings in @Connors1996.

Bounding the Average Treatment Effect

We now turn to bounds on ATE. Using again the sample proportion of the treated as the reference propensity score, we bound the ATE.

We start with $Q=1$.

  summary(atebounds(Y, D, X, rps, Q = 1))

We now consider $Q=2$

  summary(atebounds(Y, D, X, rps, Q = 2))

and $Q=3$.

  summary(atebounds(Y, D, X, rps, Q = 3))

Finally, we take $Q = 4$.

  summary(atebounds(Y, D, X, rps, Q = 4))

Overall, the results are similar to ATT.

Case Study 3: EFM

The electronic fetal monitoring (EFM) and cesarean section (CS) dataset from @EFMdata consists of observations on 14,484 women who delivered at Beth Israel Hospital, Boston from January 1970 to December 1975. The purpose of the study is to evaluate the impact of EFM on cesarean section (CS) rates. @EFMdata report that relevant confounding factors are: nulliparity (nullipar), arrest of labor progression (arrest), malpresentation (breech), and year of study (year). The dataset included in the R package is from the supplementary materials of @RRW-JASA, who used this dataset to illustrate their proposed methods for modeling and estimating relative risk and risk difference. In this dataset, all covariates are discrete.

  Y <- EFM[,"cesarean"]
  D <- EFM[,"monitor"]
  X <- as.matrix(EFM[,c("arrest", "breech", "nullipar", "year")])
  year <- EFM[,"year"] 
  ate_rps <- mean(D*Y)/mean(D) - mean((1-D)*Y)/mean(1-D)
  print(ate_rps)

We take the reference propensity score to be the sample proportion of the treatment.

  rps <- rep(mean(D),length(D))  
  print(rps[1])

Bounding the Average Treatment Effect

Using again the sample proportion of the treated as the reference propensity score, we bound the ATE.

We start with $Q=1$.

  summary(atebounds(Y, D, X, rps, Q = 1, x_discrete = TRUE))

We now consider $Q=2$

  summary(atebounds(Y, D, X, rps, Q = 2, x_discrete = TRUE))

and $Q=3$.

  summary(atebounds(Y, D, X, rps, Q = 3, x_discrete = TRUE))

In this example, the reference propensity score is close to 0.5, thus implying that the results will be robust even if we take a very large $Q$. In view of that, we take $Q = 5, 10, 20, 50, 100$.

  summary(atebounds(Y, D, X, rps, Q = 5, x_discrete = TRUE))
  summary(atebounds(Y, D, X, rps, Q = 10, x_discrete = TRUE))
  summary(atebounds(Y, D, X, rps, Q = 20, x_discrete = TRUE))
  summary(atebounds(Y, D, X, rps, Q = 50, x_discrete = TRUE))
  summary(atebounds(Y, D, X, rps, Q = 100, x_discrete = TRUE))

Overall, the empirical results suggest that there is no significant effect of EFM on cesarean section rates.

References



Try the ATbounds package in your browser

Any scripts or data that you put into this service are public.

ATbounds documentation built on Nov. 25, 2021, 1:06 a.m.