Introduction

Despite recent advances in semi- and non-parametric techniques, most quantitative social scientists are still using regression models to substantiate their theoretical claims. In these models, there are generally two types of explanatory variables - those that are theoretically important and the subject of hypothesis testing (let's call these $X$) and those that are controls (let's call these $Z$). As any regression/research design text (e.g., [cite]) tells us, the control variables are meant to capture all alternative explanations of the dependent variable. One question that almost always remains unanswered is whether we are using the appropriate functional forms for these control variables. We propose a method that allows users to be confident that the control variables have the optimal relationship with both $y$ and $X$, without making functional form assumptions and without interfering with inference on the variables of interest.

The importance of these relationships cannot be overstated. Often control variables are considered to be "less important" than other more theoretically interesting variables and as such are often not treated with similar care in their specification. We argue that there are several problems that can arise from this. Most prominently are:

  1. If the relationship between the control variables ($Z$) and $y$ is wrong, then there is extra variance in $y$ left to be explained by $X$. This means that we are likely overstating the effects of $X$ on $y$. This is the classic omitted variable bias finding.
  2. Less well-known is a similar problem where the relationship between $X$ and $Z$ cannot be appropriately estimated with the design matrix for the regression of $y$ on $(X,Z)$. This would be the case if, for instance $z_{1}$ had a linear relationship with $y$ and a quadratic relationship with $x_{1}$. A conventional model would not remove enough of the variability in $x_{1}$, leaving more of its variance to explain variance in $y$, thus likely inducing some bias.

The first problem above can be easily handled by careful model specification of $y$ on $(X,Z)$. The solution to the second problem is a bit less clear.

We propose a solution based on Multivariate Adaptive Regression Splines (MARS) [cite]. The general idea is that we can use a flexible model to identify the appropriate design matrix for the regression of $y$ on $Z$, call this $m(Z|y)$. We can also generate the same result for the relationship between $X$ and $Z$, generating $m(Z|X)$. We can combine these two design matrices $(m(Z|y),m(Z|X))$ and use those to residualize both $y$ and $X$, leaving $e^{(y)}$ and $e^{(X)}$. By regressing the former on the latter, we can get clean estimates of the effect of $X$ on $y$ controlling for $Z$. We show through simulation that this estimator is unbiased and consistent while producing confidence intervals with approximately correct coverage in a wide range of circumstances.

The Model

The model we propose is based on the Multivariate Adaptive Regression Splines (MARS™). This is a model that uses piecewise linear "hinge functions" to model arbitrary non-linearity.

$$ \begin{aligned} (x-t)_{+} &= \begin{cases} x-t & \text{ if } x > t\ 0 & \text{ otherwise.} \end{cases} \end{aligned} $$

$$ \begin{aligned} (t-x)_{+} &= \begin{cases} t-x & \text{ if } x < t\ 0 & \text{ otherwise.} \end{cases} \end{aligned} $$



dmurdoch/venus documentation built on July 15, 2019, 4:30 a.m.