Despite recent advances in semi- and non-parametric techniques, most quantitative social scientists are still using regression models to substantiate their theoretical claims. In these models, there are generally two types of explanatory variables - those that are theoretically important and the subject of hypothesis testing (let's call these $X$) and those that are controls (let's call these $Z$). As any regression/research design text (e.g., [cite]) tells us, the control variables are meant to capture all alternative explanations of the dependent variable. One question that almost always remains unanswered is whether we are using the appropriate functional forms for these control variables. We propose a method that allows users to be confident that the control variables have the optimal relationship with both $y$ and $X$, without making functional form assumptions and without interfering with inference on the variables of interest.
The importance of these relationships cannot be overstated. Often control variables are considered to be "less important" than other more theoretically interesting variables and as such are often not treated with similar care in their specification. We argue that there are several problems that can arise from this. Most prominently are:
The first problem above can be easily handled by careful model specification of $y$ on $(X,Z)$. The solution to the second problem is a bit less clear.
We propose a solution based on Multivariate Adaptive Regression Splines (MARS) [cite]. The general idea is that we can use a flexible model to identify the appropriate design matrix for the regression of $y$ on $Z$, call this $m(Z|y)$. We can also generate the same result for the relationship between $X$ and $Z$, generating $m(Z|X)$. We can combine these two design matrices $(m(Z|y),m(Z|X))$ and use those to residualize both $y$ and $X$, leaving $e^{(y)}$ and $e^{(X)}$. By regressing the former on the latter, we can get clean estimates of the effect of $X$ on $y$ controlling for $Z$. We show through simulation that this estimator is unbiased and consistent while producing confidence intervals with approximately correct coverage in a wide range of circumstances.
The model we propose is based on the Multivariate Adaptive Regression Splines (MARS™). This is a model that uses piecewise linear "hinge functions" to model arbitrary non-linearity.
$$ \begin{aligned} (x-t)_{+} &= \begin{cases} x-t & \text{ if } x > t\ 0 & \text{ otherwise.} \end{cases} \end{aligned} $$
$$ \begin{aligned} (t-x)_{+} &= \begin{cases} t-x & \text{ if } x < t\ 0 & \text{ otherwise.} \end{cases} \end{aligned} $$
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.