In christophercschwarz/DOPE: Distributions of Possible Effects

knitr::opts_chunk$set(collapse=TRUE, comment="##", fig.retina=2, fig.path = "README_figs/README-")

knitr::opts_chunk$set(
  fig.align = "center"
)
knitr::knit_hooks$set(imgcenter = function(before, options, envir){
  if (before) {
    HTML("<p align='center'>")
  } else {
    HTML("</p>")
  }
})

Why should I use it?

The threat of endogeneity is ubiquitous within applied empirical research. Regressor-error dependencies are a common inferential concern not in the least because they may arise from any combination of omitted variable, systematic measurement errors, self selection, systematic missing data, reciprocal causation, or interference between units. Conventional statistical methods do not reflect any source of uncertainty other than random error and so do not express these additional doubts. The package impliments a "near Bayesian" method of sensitivity analysis which samples uniformly from the set of valid control functions under ignorance to build a distribution of possible causal effects. It also provides graphical tools for assessing the sensitivity of one's results to the threat of hidden biases.

What does it do?

Regardless of the source a regressor-error dependency is a regressor-error dependency and can be dealt with in the same manner; namely, via control functions. Unlike a number of existing forms of sensitivity analysis which treat omitted variables as one of many sources of endogeneity, the control function approach treats all endogeneity as an omitted variables problem.

A simple example aids intuition. Suppose that we are interested in the effect of some treatment(s) $\mathbf{X}$ on outcome $\mathbf{Y}$. Assuming constant effects, linearity, and including the intercept in the design matrix we can use the common linear specification

$$\mathbf{Y} = \mathbf{X\beta} + \mathbf{\epsilon^*}$$

where $\mathbf{\epsilon^}$ is a latent random variable and $\beta$ is the ATE. Regardless of the source, regressor-error dependencies manifest such that $E(\mathbf{\epsilon^}|\mathbf{X})=0$ and so

$\begin{align} \hat{\mathbf{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \ &= \mathbf{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{\epsilon^*} \neq \beta \end{align}$

The basic idea of a control function is to construct, usually using instrumental variables, some $\mathbf{V}$ for which $E(\mathbf{\epsilon^*}|\mathbf{X},\mathbf{V})=0$ and include that term as a regressor to purge the dependency. We are usually in the position where no such control function can be constructed. Regardless, we know that with the correct control function the regression of $\mathbf{Y}$ on $\mathbf{Z}=(\mathbf{X},\mathbf{V})$ yields coefficients

$$\hat{\mathbf{\beta}} = (\mathbf{Z}^T\mathbf{Z})^{-1}\mathbf{Z}^T\mathbf{Y}$$

which contains coefficients carrying a causal interpretation.

So what we will do is sample uniformly from the set of valid control functions to build out the distribution of causal effects given our ignorance. An algorithm to generate one such randomly drawn control function is the following:

This algorithm effectively repeatedly uses the Cholesky decomposition to build up a valid augmented covariance matrix from which quantities of interest can be calculated.

To apply this to regression, note that we may rearrange the augmented covariance matrix as

$\Sigma^* = \begin{bmatrix} \mathbf{K}{\tilde{\mathbf{Z}}\tilde{\mathbf{Z}}} & \mathbf{K}{\tilde{\mathbf{Z}}\mathbf{Y}}\ \mathbf{K}{\mathbf{Y}\tilde{\mathbf{Z}}} & \mathbf{K{YY}} \end{bmatrix}$

with $\tilde{\mathbf{Z}} = (\mathbf{X},\tilde{\mathbf{V}})$ where $\tilde{\mathbf{V}}$ is the control function defined by one run of Algorithm 1. The coefficients from the regression of $\mathbf{Y}$ on $\mathbf{X}$ and $\tilde{\mathbf{V}}$ are given by

$$\tilde{\mathbf{\beta}} = \mathbf{K}^{-1}{\tilde{\mathbf{Z}}\tilde{\mathbf{Z}}}\mathbf{K}{\tilde{\mathbf{Z}}\mathbf{Y}}$$

so one never actually have to generate a realization of $\tilde{V}$ once one has the covariances by which it is defined. We may similarly derive the R-squared with this information alone:

$\tilde{R}^2 = \frac{\mathbf{K}^{T}{\tilde{\mathbf{Z}}\mathbf{Y}}\mathbf{K}^{-1}{\tilde{\mathbf{Z}}\tilde{\mathbf{Z}}}\mathbf{K}{\tilde{\mathbf{Z}}\mathbf{Y}}}{\mathbf{K}{\mathbf{Y}\mathbf{Y}}}$

By drawing thousands of these control functions we get thousands of coefficient vectors and R-squared values which represent the additional uncertainty in our results to the threat of hidden biases.

How do I get it?

Until the package is released on CRAN you can install the developmental version of the package with the following lines of code:

Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS"=TRUE)
devtools::install_github("christophercschwarz/DOPE",
                         dependencies=TRUE)
library(DOPE)

Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS"=TRUE)
devtools::install_github("christophercschwarz/DOPE",
                         dependencies=TRUE)
library(DOPE)

How do I use it?

The DOPE algorithm is built upon linear regression targeting uncertainty in the ATE. The approach can be extended to semi-parametric distributional regression models, possibly including random effects, for various estimands with relative ease utilizing whitening and augmentation tricks to put the models in OLS form.

To illustrate a number of functions from the package, let's look at a simulated example. First, let's generate a random correlation matrix, generate some data, and run a linear model.

set.seed(8675309)
x_vars <- 5
n_obs <- 1000
corm <- RandomCormCPP(nvars = x_vars)
X_mat <- MASS::mvrnorm(n_obs, rep(0,x_vars), Sigma = corm)

betas <- 1:x_vars

y <- X_mat %*% betas + rnorm(n_obs, 0, 5)

dat <- data.frame(y,X_mat)
cov(dat)

The lower V1-V5 sub-matrix is a draw from the set of valid correlation matrices by repeatedly applying the DOPE augmentation algorithm. Within the context of regression, what we will do is take the observed covariance matrix as given and draw an additional row and column from the set of valid covariance matrices, calculate quantities of interest, and repeat. To begin, let's estimate the linear model:

mod <- lm(y ~ ., data=dat)

Since we have control over the process by which the data was generated we know that the model is correct and that the coefficients are unbiased/consistent estimates of the treatment effect of interest. Suppose, however, that we did not. We can take draws from the set of valid control functions to build out a distribution of possible effects reflecting ignorance of the regressor-error dependency plaguing our analysis. The number of draws can be set with nsims and the number of cores for parallel computation with n.cores.

set.seed(6161918)
dope <- DOPE(mod, nsims = 5000, n.cores = parallel::detectCores())

tail(dplyr::tbl_df(dope))

The result is a dataframe of nsims + 1 observations with columns for each of the estimated coefficients, the control function coefficient, and model R-squared. The last observation simply re-states the results from the naive model for use in plotting functions. The most basic plot shows the simulated distribution of possible effects and gives a number of useful summaries. Because these are ggplot objects, that can be easily modified by adding additional layers.

plot_DOPE(dope,"V2") + ggtitle("Distribution of Possible Effects: V2")

In this example, based upon 5000 draws around 60\% of the estimated effects are greater than zero with a 95% credible interval given by the lower and upper 95 bounds. The naive estimate is indicated in red. Since the distribution is simulated setting a seed and using a large number of draws is highly recommended, but these quantities usually become stable with over 20,000 draws.

We can get a little bit more information from the plot by shading based upon the R-squared from each regression and turning off the naive result.

plot_DOPE(dope,"V2",shade=T,include_naive = F) + ggtitle("Distribution of Possible Effects: V2")

Lighter shades indicate lower R-squared than darker shades, given a sense for how fits are distributed across the effect values. This is important as the distribution of possible effects changes as a function of the allowable R-squared. If we wanted to change the color away from greyscale we can add another layer like so:

plot_DOPE(dope,"V2",shade=T,include_naive = F) + ggtitle("Distribution of Possible Effects: V2") + scale_fill_discrete()

If there is a particular color palette you are looking for you can use it using something like the following.

library(wesanderson)

p <- plot_DOPE(dope,"V2",shade=T,include_naive = F) + ggtitle("Distribution of Possible Effects: V2") 

file <- tempfile()
st <- try(ggsave(file,device="pdf",p + scale_fill_manual(values = "black")),silent=T)
unlink(file)
n_fills <- as.numeric(regmatches(st,gregexpr("[[:digit:]]+",st))[[1]][1])

p + scale_fill_manual(values=colorRampPalette(wes_palette("Zissou1"))(n_fills))

We can get a sense for how the distribution of possible effects changes with restrictions on the R-squared with a sensitivity_plot.

require(magrittr)

sensitivity_plot(dope,"V2")

The solid dots indicate how the certainty in the effect changes as you say "the world is not that deterministic," reducing the maximum allowable R-squared. This reduces uncertainty relative to ignornace, indicated by the dashed line, at the cost of restrictiveness, indicated by the hollow points. In the other direction, one might say that "the world is more deterministic than reflected in my model." This is represented by the crossed points and increases uncertainty until the lower pessimistic bound on effect certainty. The proportion of draws rejected by this lower thresholding is given by the filled diamonds.

In this particular example, there is no regressor-error dependency actually biasing the result. The true R-squared is near the estimated R-squared, and accordingly when we restrict attention to those estimates close to this level we become certain that our estimates are appropriately signed. In actual empirical practice, however, such information is rarely known with any degree of confidence and so this extra uncertainty should be reported.

Notes on Development Path

There are a number of additional features that will be added to the package over time. Of particular interest is adding additional facilities for converting regression results into a format which may be easily used with the package. Currently only linear models and GLMs are supported, the latter requiring to be fit with the DOPE_irls function which retains the IRLS weights and then subsequently "whitens" the data to be fit with lm.

By staying firmly within a regression framework the package can be extended to conduct sensitivity analysis on a wite array of semiparametric distributional regression models as well as sampling variability. For example, a bootstrap step could be implimented by noting that $n$ samples from $n$ observations can be written as an $n \times n$ matrix $\mathbf{Q}$. For observed data $\mathbf{R}=(\mathbf{Y},\mathbf{X})$ a bootstrap replicate is simply $\mathbf{R}^*=\mathbf{QR}$, the least squares estimate being

$\beta^* = (\mathbf{X}^T \mathbf{Q}^T \mathbf{Q} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Q}^T \mathbf{QY}$

that is, a weighted least squares estimate. We can leverage the notion of whitening from Aitken estimators like generalized least squares

$$\hat{\beta} = (\mathbf{X}^T \Omega^{-1} \mathbf{X})^T \mathbf{X}^T \Omega^{-1} \mathbf{Y}$$

and linearly transform the data where $\Omega$ is Cholesky decomposed into $\mathbf{CC}^T$ and one regresses $\mathbf{Y}^ = \mathbf{C}^{-1} \mathbf{Y}$ on $\mathbf{X}^ = \mathbf{C}^{-1} \mathbf{X}$ as an equivalent estimator. A similar approach can be taken to account for heterogeneity utilizing Aronow and Samii's multiple regression weights or moving the estimand from the ATE to the ATT or ATC.

Other approaches to heterogenous effects are straightforward to apply. Splines, kernels, and random effects with quadratic penalties have the form

$$\hat{\beta} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{D}^T\mathbf{D})^{-1} \mathbf{X}^T\mathbf{Y}$$

for some smoothing parameter $\lambda$, penalty matrix $D$. With the data augmentation trick this can be shown to be identical to the regression of

$\begin{align} \tilde{\mathbf{Y}} = \begin{bmatrix} y \ 0 \end{bmatrix} \; \; \text{ on } \tilde{\mathbf{X}} = \begin{bmatrix} \mathbf{X}\ \sqrt{\lambda} \mathbf{D} \end{bmatrix} \end{align}$ that is $\hat{\beta} = (\tilde{\mathbf{X}}^T \tilde{\mathbf{X}})^T \tilde{\mathbf{X}}^T \tilde{\mathbf{Y}}$

Depending on the goal, one could estimate $\lambda$ as a mixed model, through cross-validation, or select a small value merely for computational convenience resulting in a continuous analogy to model saturation.

More generally, semiparametric regression models with non-normal responses often have the form

$\hat{\beta} = (\mathbf{X}^T \mathbf{W} \mathbf{X} + \lambda \mathbf{D}^T \mathbf{D})^T \mathbf{X}^T \mathbf{W} \mathbf{z}$

where $\mathbf{z}$ is the working variable. These may be put into the desired format by first applying the data augmentation trick and then whitening.

Another line for extensions would be towards inference beyond the mean. A convenient parametric framework is the family of Generalized Additive Models for Location, Scale, and Shape (GAMLSS). Unlike GLMs which are constrained to the single parameter exponential family of distributions, GAMLSS currently allows for well over 100 continuous, discrete, and mixed distributions as well as transformed, truncated, or censored versons of those distributions. In all cases, all parameters of the response variable may be modeled using explanatory variables and additive smoothers including those mentioned in the previous section. Currently limited to distributions with four or fewer parameters, the general form is usually characterized as

$\begin{equation} \mathbf{Y} \overset{ind}{\sim} \mathcal{D}(\mathbf{\mu},\mathbf{\sigma},\mathbf{\nu},\mathbf{\tau}) \end{equation}$

where

$\eta_k = g_k(\theta) = \mathbf{X}_k \beta_k + \mathbf{Z}_k \gamma_k$

for $\theta \in {\mathbf{\mu},\mathbf{\sigma},\mathbf{\nu},\mathbf{\tau}}$ corresponding to $k \in {1,2,3,4}$, i.e. location, scale, skew, and kurtosis, although the exact number and definition of parameters depends on the distribution being fit. In the above, $\mathbf{X}$ represents fixed effects and $\mathbf{Z}$ represents random effects -- including splines or kernels -- with $g(\cdot)$ as the link function; GLMs, GAMs, GLMMs, and GAMMs are all special cases. Such models are currently fit with a combination of IRLS and backfitting; relying on quadratic smoothers means that they may be "whitened" and "augmented" in the exact same way as those raised above, allowing for sensitivity analysis to be conducted for hypotheses beyond mean effects (i.e. treatment effects on variance, skew, kurtosis). If one is unwilling to make parametric assumptions, expectile smoothing is a viable alternative to quantile regression which gives an explicit solution as a least squares problem.

christophercschwarz/DOPE documentation built on April 10, 2020, 4:43 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com