In alexpavlakis/instruments: Estimate Regression Models with Instrumental Variables

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

instruments

Instruments makes it easy to fit instrumental variable regression models. It has two core functions:

iv.lm for estimating linear regression models with instrumental variables;
iv.glm for estimating generalized linear models with instrumental variables.

Each function takes arguments just like R's lm and glm functions, but allows you to express one variable as instrumented by one or more instrumental variables. Say you have an independent variable outcome, an endogenous dependent variable endo, and an instrument violin. You can express a linear instrumental variable regression naturally:

iv.lm(outcome ~ endo, endo ~ violin)

If outcome is binary, you may want to use a probit model.

iv.glm(outcome ~ endo, endo ~ violin, family = binomial, link = 'probit')

Linear models

First we'll create some fake data to demonstrate how instrumental variable regressions work and when they might be appropriate. Suppose you want to estimate the impact of a subscription program on customers' spending, but you know that customers who tend to spend more also tend to sign up for subscriptions. How can you disentangle those effects?

Suppose that you also offer incentives (free samples, buy-one-get-one-free, etc) to encourage customers to enroll in subscriptions, and that these incentives are offered at random. You may be able to use incentives as an instrumental variable to isolate the impact of subscriptions on spending.

# Fake data
set.seed(56)
N <- 10e4
baseline_spending <- rpois(N, 10)
incentive <- sample(c(0, 1), size = N, replace = TRUE)
subscription <- rbinom(N, 1, p = 1/(1 + exp(-(baseline_spending + incentive - 10))))
total_spending <- baseline_spending + 2*subscription + rnorm(N, 0, 1)

The true impact of a subscription on total spending is 2. Since the unobserved variable baseline speding propensity also impacts whether someone gets a subscription, we cannot identify the impact of a subscription on spending with an ordinary linear regression. The ordinary linear regression will tend to overestimate the impact of a subscription on spending.

lm_ols <- lm(formula = total_spending ~ subscription)
summary(lm_ols)

Instead we can use an instrumental variable regression, with incentive as the instrument.

library(instruments)

lm_2sls <- iv.lm(model_formula = total_spending ~ subscription,
                 instrument_formula = subscription ~ incentive)

summary(lm_2sls)

Generalized Linear Models

Suppose instead that we want to estimate the impact of subscriptions on customer churn. Continuing our fake data example from above, we define churn as a function of both subscription and baseline spending.

churn <- rbinom(N, 1, 1/(1 + exp(-(-10 + baseline_spending + 2*subscription + rnorm(N, 0, 1)))))

Since churn is a binary variable (0, 1), a linear regression is not appropriate. Instead we will estimate this model with a logistic regression using the glm function in base R and iv.glm function in instruments. As before, the simple glm function will overestimate the impact of subscription on churn due to the correlation between baseline spending rates and having a subscription.

glm_logit <- glm(formula = churn ~ subscription, family = binomial(link = 'logit'))
glm_iv <- iv.glm(model_formula = churn ~ subscription, 
                 instrument_formula = subscription ~ incentive, 
                 family = binomial, link = 'logit')

summary(glm_logit)
summary(glm_iv)

Diagnostics

You've probably noticed that iv.lm and iv.glm print the correlation between the independent variable and the instrument and the correlation betweent the instrument and the regression error term. These two metrics help understand whether the model is valid.

Instrumental variable regression models rely on two assumptions:

Exclusion restriction: the instrument is not correlated with the error term;
Validity: the instrument is correlated with the endogenous explanatory variable (subscription, in our example).

We can also use the diagnose funcion to automatically evaluate a fit iv model. If it returns nothing, that means that your instrumental variable model seems valid.