# install and load packages if(!require(tidyverse)){ install.packages("tidyverse") require(tidyverse)} if(!require(Ecdat)){ install.packages("Ecdat") require(Ecdat)} if(!require(robomit)){ install.packages("robomit") require(robomit)} if(!require(foreign)){ install.packages("foreign") require(foreign)}
The recently developed methodological framework by @oster:2019 (hereafter Oster framework) helps to understand if inferences based on estimation results are likely to hold despite omitted variables bias. robomit
[@robomit] implements the Oster framework in R and offers features for sensitivity analyses of the estimates parameters of the Oster framework and visualization of those sensitivity analyses.
Researchers frequently encounter omitted variables bias in their estimations in nonexperimental work, which can lead to flawed inferences. Recent methodological developments help understand whether inferences of these estimations are likely to hold despite this bias [@imbens2003; @harada2013; @oster:2019; @cinelli2020] (see @oster:2019 and @cinelli2020 for a recent overview).
The Oster framework offers an option to compute i) the bias-adjusted treatment effect or correlation, $\beta^{}$, and ii) the degree of selection on unobservables relative to observables (with respect to the treatment variable) that would be necessary to eliminate the result, $\delta^{}$, using standard regression output. Thus, researchers can assess the potential severity of the omitted variables bias for their inferences. The two variables, $\beta^{}$ and $\delta^{}$, can be estimated by only specifying two parameters.
Here, we present the R-package robomit
[@robomit] and its features. robomit
implements the Oster framework, i.e., the estimation of $\beta^{}$ and $\delta^{}$, for linear cross-sectional and panel models. Additionally, robomit
offers features for sensitivity analyses of $\beta^{}$ and $\delta^{}$ (concerning the sample and external parameter specification) and their visualization.^[The Oster framework is available in Stata under the command psacalc
.]
The remaining sections introduce a) briefly omitted variable bias and the central intuition of the Oster framework, and b) the functions of robomit
.
Omitted variable bias occurs when we omit one or more (unobserved) variables in our estimation correlated with our independent variable and our treatment variable, i.e., variable of interest. This omission might be because we do not observe these omitted variables, which are so-called unobservables. Let's assume the simple case and that we want to estimate [e.g., @stock2007; @wooldridge2016]:
\begin{align} Y = \alpha + \beta T + \varphi X + \omega Z + u. \end{align}
Where $Y$ is the outcome variable, $T$ the treatment variable, $X$ a vector of observed control variables (i.e., observables), $Z$ an unobserved variable (i.e., unobservables), and $u$ the error term. If $Z$ can be represented as a function of $T$, let's say $Z=\psi+\eta T+e$ (where $e$ is the error term), and $Z$ is not part of our estimation, we estimate:
\begin{align} Y = (\alpha + \omega \psi) + (\beta + \omega \eta) T + \varphi X + (u + \omega e). \end{align}
Thus, we estimate a biased coefficient for $T$ when we omit $Z$.
@oster:2019 developed a framework based on @altonji2005 to assess the potential severity of selection on unobservables (i.e., omitted variable bias). The central intuition of the Oster framework is that the omitted variable bias is proportional to the coefficient movements scaled by the movement of the $R^{2}$. Using this intuition, @oster:2019 presents a framework to approximate $\beta^{*}$, which is the bias-adjusted treatment effect (if the estimation is causal) or the bias-adjusted treatment correlation. The approximation is:
\begin{align} \beta^{*} \approx \widetilde{\beta} - \delta (\dot{\beta} - \widetilde{\beta}) \frac{R^{2}_{max} - \widetilde{R}^{2}}{\widetilde{R}^{2} - \dot{R}^{2}}. \end{align}
Where $\widetilde\beta$ is the coefficient of the controlled model (i.e., the intermediate regression of $Y$ on $T$ and $X$), $\delta$ the value of relative importance of the selection of the observed variables compared to the unobserved variables, $\dot{\beta}$ the coefficient of the uncontrolled model (i.e., the auxiliary regression of $Y$ on $T$), $R_{max}^{2}$ the $R^{2}$ of the hypothetical model (i.e., the hypothetical regression of $Y$ on $T$, $X$, and $Z$), $\widetilde R^{2}$ the $R^{2}$ of the controlled model, and $\dot{R}^{2}$ the $R^{2}$ of the uncontrolled model. Hence, we only need to define the values of $\delta$ and $R_{max}^{2}$ to estimate $\beta^{*}$ while all other values are automatically derived from the regression output. As a default, it is often assumed that $\delta=1$ and $R_{max}^{2}=1.3\widetilde\beta$ [@altonji2005; @oster:2019]. $R_{max}^{2}=1.3\widetilde R^{2}$ is based on the 90%-survival rate of results of randomized studies [@oster:2019]. Researchers should also examine other values for $R_{max}^{2}$ to understand the sensitivity of the results to the specified value, especially when $\widetilde R^{2}$ is low (robomit
also implements a sensitivity analysis of $R_{max}^{2}$).
Next to estimating $\beta^{}$ we can estimate $\delta^{}$, which is the degree of selection on unobservables relative to observables (with respect to the treatment variable) that would be necessary to produce $\beta=\hat\beta$. $\delta^{*}$ is defined as:
\begin{align} \delta^{*} = \frac{ \begin{gathered} (\widetilde{\beta} - \hat{\beta}) (\widetilde{R}^{2} - \dot{R}^{2}) \hat{\sigma}^{2}{Y} \hat{\tau}{X} + (\widetilde{\beta} - \hat{\beta}) \hat{\sigma}^{2}{X} \hat{\tau}{X} (\dot{\beta} - \widetilde{\beta})^{2} + \ 2 (\widetilde{\beta} - \hat{\beta})^{2} (\hat{\tau}{X}(\dot{\beta} - \widetilde{\beta}) \hat{\sigma}^{2}{X}) + (\widetilde{\beta} - \hat{\beta})^{3} (\hat{\tau}{X} \hat{\sigma}^{2}{X} - \hat{\tau}^{2}{X}) \end{gathered}} { \begin{gathered} (R^{2}{max} - \widetilde{R}^{2}) \hat{\sigma}^{2}{Y} (\dot{\beta} - \widetilde{\beta}) \hat{\sigma}^{2}{X} + (\widetilde{\beta} - \hat{\beta}) (R^{2}{max} - \widetilde{R}^{2}) \hat{\sigma}^{2}{Y} (\hat{\sigma}^{2}{X})-\hat{\tau}{X}) + \ (\widetilde{\beta} - \hat{\beta})^{2} (\hat{\tau}{X}(\dot{\beta} - \widetilde{\beta}) \hat{\sigma}^{2}{X}) + (\widetilde{\beta} - \hat{\beta})^{3} (\hat{\tau}{X} \hat{\sigma}^{2}{X}-\hat{\tau}^{2}_{X}) \end{gathered}}. \end{align}
Where $\sigma_{Y}^{2}$ is the variance of $Y$, $\sigma_{X}^{2}$ the variance of $X$, and $\hat\tau_{X}$ the variance of this residual in the sample. To estimate $\delta^{*}$ researchers need only to specify $\hat\beta$ and $R_{max}^{2}$. $\hat\beta$ is commonly defined as $\hat\beta=0$ [@oster:2019].
The demonstration uses a cross-sectional dataset including sales prices for houses in the city of Windsor (Canada) in 1987, which is taken from @anglin1996 using Ecdat
[@Ecdat]). The dataset contains information about house sales prices (Canadian dollars), the lot size of the property (in square feet), and other control variables. In our demonstration, we are interested in the correlation of house sales prices (dependent variable; log-transformed) and the lot size of the property (treatment variable; log-transformed) and how robust this correlation is to the potential inclusion of unobservables (i.e., omitted variable bias). This analysis aims to illustrate the functions of robomit and not to build a causal model.
First, we estimate $\beta^{}$ and $\delta^{}$ using o_beta and o_delta, respectively, of robomit
:
# load dataset data("Housing") # cross sectional dataset # rename variables easy to understand and add the log of the price to Housing ## cross sectional dataset Housing <- Housing %>% mutate(price_ln = log(price), lot_size_ln = log(lotsize)) %>% dplyr::rename(bathrooms = bathrms, driveway_dummy = driveway) # show the first rows of the dataset head(Housing) # cross sectional dataset # save dataset to use later in Stata #write.dta(Housing, "Housing.dta") # cross sectional dataset
\vspace{12pt}
# estimate beta* for the lot size variable o_beta(y = "price_ln", # dependent variable x = "lot_size_ln", # independent treatment variable con = "bedrooms + bathrooms + factor(driveway_dummy)", # other control variables delta = 1, # delta (usually set to one) R2max = 0.5316*1.3, # maximum R-square (often assumed # to be the 1.3 times the R-square # of the controlled model) type = "lm", # model type data = Housing) # dataset
\vspace{12pt}
# estimate delta* for the lot size variable o_delta(y = "price_ln", # dependent variable x = "lot_size_ln", # independent treatment variable con = "bedrooms + bathrooms + factor(driveway_dummy)", # other control variables beta = 0, # beta (usually set to zero) R2max = 0.5316*1.3, # maximum R-square type = "lm", # model type data = Housing) # dataset
\vspace{12pt}
The results show that $\beta^{}$, i.e., the bias-adjusted coefficient of lot size, is 0.27. Moreover, we estimated a $\delta^{}$ of 1.99. Thus, the unobservables need to be 1.99 times more important than the observables (with respect to the treatment variable) to obtain a correlation of zero (as we defined: beta = 0). The results are equivalent to those of the Stata command psacalc
(Fig. \ref{fig:unmodel_csd}). All functions of robomit also offers the option to include unrelated control variables (by specifying $m$ [@oster:2019]) and weights (by specifying $weights$).
\vspace{12pt}
\begin{figure}[!ht] \includegraphics[width=1.0\textwidth]{stata_oster_variables_cross_data.PNG} \caption{Stata results of $\beta^{}$ (panel a) and $\delta^{}$ (panel b).} \label{fig:unmodel_csd} \end{figure}
robomit
includes a set of functions for sensitivity analyses of $\beta^{}$ and $\delta^{}$: o_beta_boot, o_delta_boot, o_beta_boot_inf, o_delta_boot_inf, o_beta_boot_viz, o_delta_boot_viz, o_beta_rsq, o_delta_rsq, o_beta_rsq_viz, o_delta_rsq_viz. Here, we present the visualization of bootstrapped $\delta^{}$ using o_delta_boot_viz and $\delta^{}$ over a range of $R_{max}^{2}$ using o_delta_rsq_viz (the other functions follow the same logic). First, using o_delta_boot_viz:
# visualization of bootstrapped delta*s o_delta_boot_viz(y = "price_ln", # dependent variable x = "lot_size_ln", # independent treatment # variable con = "bedrooms + bathrooms + factor(driveway_dummy)", # other control variables beta = 0, # beta R2max = 0.5316*1.3, # maximum R-square sim = 500, # number of simulations obs = 350, # draws per simulation rep = FALSE, # without replacement CI = c(90,95,99), # confidence intervals type = "lm", # model type norm = TRUE, # normal distribution bin = 200, # number of bins useed = 123, # seed data = Housing) # dataset
The figure show that $\delta^{}$ is not sensitive to selecting different sub-samples, thus, sample selection. Second, using o_delta_boot_viz*:
# estimate delta*s over a range of maximum R-squares o_delta_rsq_viz(y = "price_ln", # dependent variable x = "lot_size_ln", # independent treatment # variable con = "bedrooms + bathrooms + factor(driveway_dummy)", # other control variables beta = 0, # beta type = "lm", # model type data = Housing) # dataset
$\delta^{}$ decrease from 3.83 to 1.005 with increasing $R_{max}^{2}$, from 0.54 to 1. Moreover, $\delta^{}$ remains above one when $R_{max}^{2}=1$. $\delta^{*} \geq 1$ is suggested as a reasonable heuristic threshold for indicating robustness [@altonji2005; @oster:2019].
The R package was supported by the Mercator Foundation Switzerland within a Zürich-Basel Plant Science Center PhD Fellowship program.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.