Details

The repository for this project can be accessed here. Check out my website to learn more about me and my research.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(digits = 3)
library(tidyverse)
results <- read_csv(paste0(getwd(),"/results.csv"))

Introduction

Random Forest Adjustment (RFA) is a procedure for estimating the relationship between some predictor and a response after partialing out variation in each given a set of covariates via random forests. Below I provide a brief explanation of the technique and demonstrate how it is implemented in R.

RFA in a Nutshell

RFA is a flexible approach to regression adjustment for controlling for covariates in observational settings. By leveraging a nonparametric machine learning algorithm such as random forests, RFA sidesteps the incidental functional form assumptions imposed by the standard multiple linear regression approach to covariate adjustment. This makes RFA robust to more complex (nonlinear or nonadditive) forms of confounding relationships in observational data that a parametric approach may fail to account for. The latest version of the package can accommodate fixed and random effects as well.

RFA works by partialing out variation in an outcome $y_i$ and some causal variable of interest $z_i$ as a function of covariates $x_{ip} \in X_i$ where $y_i,z_i \not!\perp!!!\perp X_i$. The variables $y_i$ and $z_i$ denote vectors of response and predictor values for the $i^\text{th}$ observation where $i \in I = {1,2,...,n }$.

Because $z_i$ and $y_i$ are not independent of $X_i$, the estimated slope coefficient ($\alpha_1$) from the following linear model may not reflect the correct relationship between $z_i$ on $y_i$: $$y_i = \alpha_0 + \alpha_1z_i + \epsilon_i. \tag{1}$$ RFA's solution to this problem is to partial out the variation in $y_i$ and $z_i$ explained by $X_i$ prior to estimating the relationship between the two using random forests to pre-process the explanatory variable and response. This is done via the following steps:

  1. Fit $y_i$ as a function of $X_i$ via random forests (the random forest fit for $y_i$ is denoted $\hat{f}_y(X_i)$), and then demean, or residualize, $y_i$ by its predicted conditional mean:

$$ \begin{aligned} \hat{y}_i & = \hat{f}_y(X_i), \ \hat{y}_i^\varepsilon & = y_i - \hat{y}_i. \end{aligned} \tag{2} $$

  1. Repeat this procedure for the explanatory variable of interest, $z_i$:

$$ \begin{aligned} \hat{z}_i & = \hat{f}_z(X_i), \ \hat{z}_i^\varepsilon & = z_i - \hat{z}_i. \end{aligned} \tag{3} $$

  1. Finally, the marginal relationship, adjusting for the confounding influence of $X_i$, is obtained by estimating the following linear model using the transformed data:

$$\hat{y}_i^\varepsilon = \beta_0 + \beta_1\hat{z}_i^\varepsilon + \mu_i.\tag{4}$$

If desired, fixed or random effects are incorporated by conducting an additional pre-processing step where the response, the causal variable, and each of the covariates are demeaned via fixed effect, random effect, or mixed effect regressions prior to random forest adjustment.

What is the Benefit of This Approach?

The advantage of RFA is that it sidesteps incidental functional form assumptions that the more conventional multiple regression adjustment strategy imposes. This can be easily demonstrated with a simple simulation.

For the simulation:

  1. I set $n = 500$ and set the the parameter for the true relationship between a response and explanatory variable of interest ($\beta_1$) to 5.
  2. The data-generating process for the outcome variable $y_i$ is given as

$$y_i = 1 + 5z_i + 0.5x_i + x_i^2 + e_i:e_i \sim \mathcal{N}(0, \sigma = 10).$$

  1. The data generating process for the causal variable $z_i$ is given as

$$z_i = -1 + 0.05\space \text{stand}(x_i) - 0.1 \space \text{stand}(x_i)^2 + u_i: u_i \sim \mathcal{N}(0, \sigma = 2).$$

  1. The confounding variable $x_i \sim \mathcal{N}(50, \sigma = 10)$.

For each iteration of the simulated data-generating process, I recover estimates of the effect of $z_i$ on $y_i$ via:

  1. A naive linear regression model:

$$y_i = \beta_0 + \beta_1z_i + \epsilon_i.$$

  1. A a multiple regression model with controls:

$$y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i.$$

  1. An interactive regression model with $x_i$ mean centered (otherwise known as the Lin estimator):

$$y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \beta_3z_i\cdot(x_i - \bar{x}) + \epsilon_i.$$

  1. Adjustment via a Support Vector Machine (SVM)---an alternative machine learner to random forests.

  2. Adjustment via random forests (RFA).

Because both SVM and random forests take a while to run, for time's sake I restricted the analysis to 100 iterations.

A summary of the performance of each of these approaches is given in the below table. Metrics I use to gauge performance include root mean squared error (MSE), average bias, and coverage of the 95 percent confidence intervals.

results %>%
  group_by(model) %>%
  summarize(
    RMSE = sqrt(sum((estimate - ATE)^2)),
    "Ave. Bias" = mean(estimate - ATE),
    Coverage = mean((((estimate - 1.96*std.error)<=ATE) + 
                          ((estimate + 1.96*std.error)>=ATE))==2) 
  ) %>%
  mutate_if(is.numeric, function(x) round(x, 2)) %>%
  kableExtra::kable(format = "html")

RFA clearly performs the best out of all of the adjustment strategies. This is also clear from the following figure showing the distribution of estimates after 100 runs of the simulation. The estimates provided by RFA are the least prone to error and least biased. However, in terms of coverage, naive OLS, RFA, and SVA are relatively similar.

ATE <- results$ATE[1]
results %>%
  select(-ATE) %>%
  dotwhisker::small_multiple(
    dot_args = list(alpha = 0.15)
  ) + 
  geom_hline(yintercept = c(0, ATE), linetype = c(1,2)) +
  labs(
    caption = "SVA = Support Vector Adjustment;\nRFA = Random Forest Adjustment\n(Dashed line denotes true effect)",
    y = "Estimate with 95% CI",
    title = "Comparison of Different Adjustment Strategies"
  ) +
  ggridges::theme_ridges(
    font_size = 10
  ) +
  theme(
    strip.background = element_blank(),
    strip.text = element_blank(),
    legend.position = "none"
  )

How to Implement in R

To perform RFA analysis in R I created a package called {RFA}, which can be installed by writing:

## install.packages("devtools")
devtools::install_github("milesdwilliams15/RFA")

Here, I demonstrate how the package functions can used to estimate causal effects with observational data. I'll do this with the GerberGreenImai dataset included in the {Matching} package. This dataset was used in Imai (2005) to replicate and extend Gerber and Green's (2000) get-out-the-vote (GOT) field experiment. The dataset was used to assess the causal effect of telephone calls on turnout. Note that this data is used for demonstration purposes only, and so results here should not be taken as confirmation or refutation of the findings of the original authors.

First, let's get the data into the working environment:

library(tidyverse) # grammar
library(Matching)  # for the data
data(GerberGreenImai)
ggi <- GerberGreenImai # give it a shorter name

Next, to implement RFA, we need to open the RFA package:

library(RFA)

The workhorse function in the package is rfa(). The help file for the function can be accessed by entering ?rfa. The function accepts the following arguments:

Below is an example of how to use rfa with the ggi dataset. In the formula object below, the left-hand side variable is a binary response that equals 1 when an individual citizen turned out to vote in the 1998 congressional election in New Haven, CT. The treatment variable (whether the individual received a GOT phone call) is the right-hand side variable in the formula. The covariates to control for are given in the covariates command in a right-hand side only formula. These covariates are: an individual's age, whether they voted in the previous election (in 1996), the number of persons residing in their household, whether they are a new voter, and whether they support the majority party.

rfa_fit <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  data = ggi
)

The object rfa_fit consists of a list containing the fitted regression results:

rfa_fit$fit

the random forest regression for the response as a function of covariates:

rfa_fit$yrf

the random forest regression for the causal variable of interest as a function of covariates:

rfa_fit$xrf

and an updated dataframe that adds the processed response and explanatory variable to the data used in estimation:

head(rfa_fit$data)

From the above, we see that the RFA routine estimates that the effect of a GOT phone call on the probability of turning out to vote is approximately r round(tidy(rfa_fit$fit)[2,2],3). A summary for this estimate is provided by using summary_rfa. The output also includes the OOB (out of bag) $R^2$ for the random forest regressions for the response and the explanatory variable.

summary_rfa(rfa_fit) %>%
  mutate_if(is.numeric, function(x) round(x, 3)) 

Sure enough, the ATE is statistically significant, with $p < 0.01$.

We can also use the plot_rfa function to quickly make a coefficient plot for the treatment effect:

plot_rfa(rfa_fit)

Since plot_rfa is a wrapper for ggplot2, we can modify the RFA plot however we see fit:

plot_rfa(rfa_fit, varname = "Phone Call") +
  labs(
    x = "Estimate",
    y = NULL,
    title = "Effect of GOT phone calls on voter turnout"
  ) +
  geom_vline(xintercept = 0, lty = 2) +
  theme_test()

Adding Fixed or Random Effects

Fixed or random effects can be specified as well. We could do so using the WARD column in the data.

If we wanted ward fixed effects we would write:

fe_rfa <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  fes_and_res = ~ WARD, # FEs
  data = ggi
)

If we wanted ward random effects we would write:

re_rfa <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  fes_and_res = ~ (1 | WARD), # FEs
  data = ggi
)

To compare the results, we can easily create a regression table using the {texreg} package:

library(texreg)
htmlreg(
  list(
    "(1)" = rfa_fit$fit,
    "(2)" = fe_rfa$fit,
    "(3)" = re_rfa$fit
  ),
  custom.coef.map = list(
    "xres" = "Phone Call"
  ),
  custom.header = list("Voter Turnout" = 1:3),
  include.ci = F,
  digits = 3,
  caption = "RFA estimates",
  caption.above = T
)

Diagnosing RFA Estimates

The point estimate reported via random forest adjustment may not always be the only, or even most interesting, quantity of importance. Researchers may also have an interest in assessing which covariates in the data sample were the most substantial source of confounding. They may also wish to characterize the effective sample used to identify the partial relationship between the explanatory variable and response. As Aronow and Samii (2016) caution, the process of adjusting for covariates often distorts the data sample used to identify some causal relationship of interest. Often, reported estimates reflect a local effect that cannot be generalized beyond a more circumscribed data sample than that used at the outset of the analysis.

Because it is so important to contextualize how adjustment for covariates changes the sample used to estimate treatment effects, the {RFA} package includes two additional functions that help in performing some diagnostics. These are get_cod() and get_importance().

The latter returns a data frame that reports variable importance metrics for each of the covariates used in the analysis, with respect to the response and explanatory variable of interest.

Usage for get_importance is quite simple. First, if you'd like to obtain importance metrics, when you perform the RFA estimation, use the importance argument to specify either "impurity" or "permutation." These are the two standard approaches to quantifying the relative predictive importance of covariates in random forests. The first sums the number of splits that include a given variable across all of the regression trees grown in the random forest (this is the Gini importance or Mean Decrease in Impurity). The second is the average decrease in the accuracy of model predictions if a given variable's relationship with an outcome is "broken" via the permutation process (this is also known as Mean Decrease in Accuracy).

For example, if we wanted to get the Gini importance for covariates we would simply specify:

rfa_fit <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  data = ggi,
  importance = "impurity" # Specify impurity importance
)

Then, we can use get_importance to return the importance metrics of the covariates with respect to both the response and explanatory variable of interest:

get_importance(rfa_fit)

These importance metrics can be easily visualized:

imp <- get_importance(rfa_fit)
imp %>%
  gather(
    key = "outcome",
    value = "value",
    -term
  ) %>%
  ggplot() +
  aes(
    x = value,
    y = reorder(term, value),
    color = outcome
  ) +
  geom_point() +
  scale_x_log10() +
  labs(
    x = "Gini Importance (log-10 scale)",
    y = NULL,
    title = "Variable Importance"
  )

They can also be analyzed to assess the extent to which confoundingness is correlated between the response and explanatory variable of interest:

imp %>%
  summarise(
    correlation = cor(response, predictor)
  )

Outside of variable importance, we can also examine the "coefficient of distortion" (COD), which summarizes the difference, in standard deviation units, between the effective sample generated through random forest adjustment, and the nominal sample---the original data sample collected. The COD can help summarize how (un)representative the sample was that was used to identify the point estimate returned via random forest adjustment.

To summarize, the COD is calculated for a given covariate $p$ by first scaling it to standard deviation units and then taking its weighted average: $$\text{COD}p = \sum{i = 1}^N w_i\left[ \frac{x_{ip} - \bar{x}p}{\text{SD}(x{ip})}\right] / \sum_{i=1}^N w_i.$$ The vector of weights $w_i$ used is simply the square of the residualized explanatory variable of interest ($z_i$): $$w_i = (z_i^\varepsilon)^2 = [z_i - \hat{f}_z(x_i)]^2.$$

The COD is calculated for each covariate used in estimation with the get_cod() function. The function accepts three arguments:

Inference for each of the COD estimates is done via bootstrapping.

Applying get_cod to the fitted rfa object, we get:

cods <- 
  get_cod(rfa_fit, bootsims = 100) # set to 100 to speed up
cods # view

Like the importance metrics, these results can be easily visualized:

ggplot(cods) +
  aes(
    x = estimate,
    y = reorder(term, estimate^2),
    xmin = conf.low,
    xmax = conf.high
  ) +
  geom_point() +
  geom_errorbarh(height = 0) +
  geom_vline(
    xintercept = 0,
    lty = 2
  ) +
  labs(
    x = "Distortion with 95% confidence intervals",
    y = NULL,
    title = "Effective vs. Nominal Data Samples"
  )

These tools, combined, help to demystify random forest adjusted estimates. For instance, the importance metrics above reveal a moderately strong correlation between the importance of covariates in predicting the explanatory variable of interest (a get-out-the-vote phone call) and the response (voter turnout). This is suggestive of a nontrivial degree of confounding in the relationship between the explanatory variable and response.

Further, the COD estimates suggest that the process of adjusting for this confounding produces an effective sample that differs markedly from the nominal sample. Notably, the effective sample skews significantly older, is more likely to have voted in the previous congressional election two years prior, and is less likely to be a new voter, among other differences. This suggests a more constrained set of cases to which the identified relationship between GOT phone calls and turnout can be generalized.

Conclusion

Taken together, the tools in the RFA package are powerful allies in the analysis of observational data. Random forest adjustment has the potential to be robust to various forms of confounding in observational data, while it also is amenable to helpful diagnostics that contextualize the reported results.



milesdwilliams15/RFA documentation built on Sept. 26, 2023, 4:31 a.m.