In seschaub/robomit: Robustness Checks for Omitted Variable Bias

# install and load packages
if(!require(tidyverse)){
  install.packages("tidyverse")
  require(tidyverse)}


if(!require(Ecdat)){
  install.packages("Ecdat")
  require(Ecdat)}

if(!require(robomit)){
  install.packages("robomit")
  require(robomit)}

if(!require(foreign)){
  install.packages("foreign")
  require(foreign)}

Summary

The recently developed methodological framework by @oster:2019 (hereafter Oster framework) helps to understand if inferences based on estimation results are likely to hold despite omitted variables bias. robomit [@robomit] implements the Oster framework in R and offers features for sensitivity analyses of the estimates parameters of the Oster framework and visualization of those sensitivity analyses.

Statement of need

Researchers frequently encounter omitted variables bias in their estimations in nonexperimental work, which can lead to flawed inferences. Recent methodological developments help understand whether inferences of these estimations are likely to hold despite this bias [@imbens2003; @harada2013; @oster:2019; @cinelli2020] (see @oster:2019 and @cinelli2020 for a recent overview).

The Oster framework offers an option to compute i) the bias-adjusted treatment effect or correlation, $\beta^{}$, and ii) the degree of selection on unobservables relative to observables (with respect to the treatment variable) that would be necessary to eliminate the result, $\delta^{}$, using standard regression output. Thus, researchers can assess the potential severity of the omitted variables bias for their inferences. The two variables, $\beta^{}$ and $\delta^{}$, can be estimated by only specifying two parameters.

Here, we present the R-package robomit [@robomit] and its features. robomit implements the Oster framework, i.e., the estimation of $\beta^{}$ and $\delta^{}$, for linear cross-sectional and panel models. Additionally, robomit offers features for sensitivity analyses of $\beta^{}$ and $\delta^{}$ (concerning the sample and external parameter specification) and their visualization.^[The Oster framework is available in Stata under the command psacalc.]

The remaining sections introduce a) briefly omitted variable bias and the central intuition of the Oster framework, and b) the functions of robomit.

Framework

Omitted variable bias occurs when we omit one or more (unobserved) variables in our estimation correlated with our independent variable and our treatment variable, i.e., variable of interest. This omission might be because we do not observe these omitted variables, which are so-called unobservables. Let's assume the simple case and that we want to estimate [e.g., @stock2007; @wooldridge2016]:

\begin{align} Y = \alpha + \beta T + \varphi X + \omega Z + u. \end{align}

Where $Y$ is the outcome variable, $T$ the treatment variable, $X$ a vector of observed control variables (i.e., observables), $Z$ an unobserved variable (i.e., unobservables), and $u$ the error term. If $Z$ can be represented as a function of $T$, let's say $Z=\psi+\eta T+e$ (where $e$ is the error term), and $Z$ is not part of our estimation, we estimate:

\begin{align} Y = (\alpha + \omega \psi) + (\beta + \omega \eta) T + \varphi X + (u + \omega e). \end{align}

Thus, we estimate a biased coefficient for $T$ when we omit $Z$.

@oster:2019 developed a framework based on @altonji2005 to assess the potential severity of selection on unobservables (i.e., omitted variable bias). The central intuition of the Oster framework is that the omitted variable bias is proportional to the coefficient movements scaled by the movement of the $R^{2}$. Using this intuition, @oster:2019 presents a framework to approximate $\beta^{*}$, which is the bias-adjusted treatment effect (if the estimation is causal) or the bias-adjusted treatment correlation. The approximation is:

\begin{align} \beta^{*} \approx \widetilde{\beta} - \delta (\dot{\beta} - \widetilde{\beta}) \frac{R^{2}_{max} - \widetilde{R}^{2}}{\widetilde{R}^{2} - \dot{R}^{2}}. \end{align}

Where $\widetilde\beta$ is the coefficient of the controlled model (i.e., the intermediate regression of $Y$ on $T$ and $X$), $\delta$ the value of relative importance of the selection of the observed variables compared to the unobserved variables, $\dot{\beta}$ the coefficient of the uncontrolled model (i.e., the auxiliary regression of $Y$ on $T$), $R_{max}^{2}$ the $R^{2}$ of the hypothetical model (i.e., the hypothetical regression of $Y$ on $T$, $X$, and $Z$), $\widetilde R^{2}$ the $R^{2}$ of the controlled model, and $\dot{R}^{2}$ the $R^{2}$ of the uncontrolled model. Hence, we only need to define the values of $\delta$ and $R_{max}^{2}$ to estimate $\beta^{*}$ while all other values are automatically derived from the regression output. As a default, it is often assumed that $\delta=1$ and $R_{max}^{2}=1.3\widetilde\beta$ [@altonji2005; @oster:2019]. $R_{max}^{2}=1.3\widetilde R^{2}$ is based on the 90%-survival rate of results of randomized studies [@oster:2019]. Researchers should also examine other values for $R_{max}^{2}$ to understand the sensitivity of the results to the specified value, especially when $\widetilde R^{2}$ is low (robomit also implements a sensitivity analysis of $R_{max}^{2}$).

Next to estimating $\beta^{}$ we can estimate $\delta^{}$, which is the degree of selection on unobservables relative to observables (with respect to the treatment variable) that would be necessary to produce $\beta=\hat\beta$. $\delta^{*}$ is defined as:

\begin{align} \delta^{*} = \frac{ \begin{gathered} (\widetilde{\beta} - \hat{\beta}) (\widetilde{R}^{2} - \dot{R}^{2}) \hat{\sigma}^{2}{Y} \hat{\tau}{X} + (\widetilde{\beta} - \hat{\beta}) \hat{\sigma}^{2}{X} \hat{\tau}{X} (\dot{\beta} - \widetilde{\beta})^{2} + \ 2 (\widetilde{\beta} - \hat{\beta})^{2} (\hat{\tau}{X}(\dot{\beta} - \widetilde{\beta}) \hat{\sigma}^{2}{X}) + (\widetilde{\beta} - \hat{\beta})^{3} (\hat{\tau}{X} \hat{\sigma}^{2}{X} - \hat{\tau}^{2}{X}) \end{gathered}} { \begin{gathered} (R^{2}{max} - \widetilde{R}^{2}) \hat{\sigma}^{2}{Y} (\dot{\beta} - \widetilde{\beta}) \hat{\sigma}^{2}{X} + (\widetilde{\beta} - \hat{\beta}) (R^{2}{max} - \widetilde{R}^{2}) \hat{\sigma}^{2}{Y} (\hat{\sigma}^{2}{X})-\hat{\tau}{X}) + \ (\widetilde{\beta} - \hat{\beta})^{2} (\hat{\tau}{X}(\dot{\beta} - \widetilde{\beta}) \hat{\sigma}^{2}{X}) + (\widetilde{\beta} - \hat{\beta})^{3} (\hat{\tau}{X} \hat{\sigma}^{2}{X}-\hat{\tau}^{2}_{X}) \end{gathered}}. \end{align}

Where $\sigma_{Y}^{2}$ is the variance of $Y$, $\sigma_{X}^{2}$ the variance of $X$, and $\hat\tau_{X}$ the variance of this residual in the sample. To estimate $\delta^{*}$ researchers need only to specify $\hat\beta$ and $R_{max}^{2}$. $\hat\beta$ is commonly defined as $\hat\beta=0$ [@oster:2019].

Demonstration of the robomit package

The demonstration uses a cross-sectional dataset including sales prices for houses in the city of Windsor (Canada) in 1987, which is taken from @anglin1996 using Ecdat [@Ecdat]). The dataset contains information about house sales prices (Canadian dollars), the lot size of the property (in square feet), and other control variables. In our demonstration, we are interested in the correlation of house sales prices (dependent variable; log-transformed) and the lot size of the property (treatment variable; log-transformed) and how robust this correlation is to the potential inclusion of unobservables (i.e., omitted variable bias). This analysis aims to illustrate the functions of robomit and not to build a causal model.

Estimation of $\beta^{}$ and $\delta^{}$

First, we estimate $\beta^{}$ and $\delta^{}$ using o_beta and o_delta, respectively, of robomit:

# load dataset
data("Housing")    # cross sectional dataset

# rename variables easy to understand and add the log of the price to Housing
## cross sectional dataset
Housing <- Housing %>% 
  mutate(price_ln = log(price),
         lot_size_ln = log(lotsize)) %>% 
  dplyr::rename(bathrooms = bathrms, 
                driveway_dummy = driveway) 

# show the first rows of the dataset 
head(Housing)      # cross sectional dataset

# save dataset to use later in Stata
#write.dta(Housing,    "Housing.dta")    # cross sectional dataset

\vspace{12pt}

# estimate beta* for the lot size variable
o_beta(y = "price_ln",                  # dependent variable
       x = "lot_size_ln",               # independent treatment variable
       con = "bedrooms + bathrooms +
              factor(driveway_dummy)",  # other control variables
       delta = 1,                       # delta (usually set to one)
       R2max = 0.5316*1.3,              # maximum R-square (often assumed 
                                        # to be the 1.3 times the R-square
                                        # of the controlled model)
       type = "lm",                     # model type
       data = Housing)                  # dataset

\vspace{12pt}

# estimate delta* for the lot size variable
o_delta(y = "price_ln",                  # dependent variable
        x = "lot_size_ln",               # independent treatment variable
        con = "bedrooms + bathrooms +
               factor(driveway_dummy)",  # other control variables
        beta = 0,                        # beta (usually set to zero)
        R2max = 0.5316*1.3,              # maximum R-square 
        type = "lm",                     # model type
        data = Housing)                  # dataset

\vspace{12pt} The results show that $\beta^{}$, i.e., the bias-adjusted coefficient of lot size, is 0.27. Moreover, we estimated a $\delta^{}$ of 1.99. Thus, the unobservables need to be 1.99 times more important than the observables (with respect to the treatment variable) to obtain a correlation of zero (as we defined: beta = 0). The results are equivalent to those of the Stata command psacalc (Fig. \ref{fig:unmodel_csd}). All functions of robomit also offers the option to include unrelated control variables (by specifying $m$ [@oster:2019]) and weights (by specifying $weights$).

\vspace{12pt}

\begin{figure}[!ht] \includegraphics[width=1.0\textwidth]{stata_oster_variables_cross_data.PNG} \caption{Stata results of $\beta^{}$ (panel a) and $\delta^{}$ (panel b).} \label{fig:unmodel_csd} \end{figure}

Features for sensitivity analyses and their visualization

robomit includes a set of functions for sensitivity analyses of $\beta^{}$ and $\delta^{}$: o_beta_boot, o_delta_boot, o_beta_boot_inf, o_delta_boot_inf, o_beta_boot_viz, o_delta_boot_viz, o_beta_rsq, o_delta_rsq, o_beta_rsq_viz, o_delta_rsq_viz. Here, we present the visualization of bootstrapped $\delta^{}$ using o_delta_boot_viz and $\delta^{}$ over a range of $R_{max}^{2}$ using o_delta_rsq_viz (the other functions follow the same logic). First, using o_delta_boot_viz:

# visualization of bootstrapped delta*s
o_delta_boot_viz(y = "price_ln",             # dependent variable
            x = "lot_size_ln",               # independent treatment 
                                             # variable
            con = "bedrooms + bathrooms +
                   factor(driveway_dummy)",  # other control variables
            beta = 0,                        # beta
            R2max = 0.5316*1.3,              # maximum R-square
            sim = 500,                       # number of simulations
            obs = 350,                       # draws per simulation
            rep = FALSE,                     # without replacement
            CI = c(90,95,99),                # confidence intervals
            type = "lm",                     # model type
            norm = TRUE,                     # normal distribution
            bin = 200,                       # number of bins
            useed = 123,                     # seed
            data = Housing)                  # dataset

The figure show that $\delta^{}$ is not sensitive to selecting different sub-samples, thus, sample selection. Second, using o_delta_boot_viz*:

# estimate delta*s over a range of maximum R-squares
o_delta_rsq_viz(y = "price_ln",                 # dependent variable
                x = "lot_size_ln",              # independent treatment 
                                                # variable
                con = "bedrooms + bathrooms +
                       factor(driveway_dummy)", # other control variables
                beta = 0,                       # beta
                type = "lm",                    # model type
                data = Housing)                 # dataset

$\delta^{}$ decrease from 3.83 to 1.005 with increasing $R_{max}^{2}$, from 0.54 to 1. Moreover, $\delta^{}$ remains above one when $R_{max}^{2}=1$. $\delta^{*} \geq 1$ is suggested as a reasonable heuristic threshold for indicating robustness [@altonji2005; @oster:2019].