The repository for this project can be accessed here. Check out my website to learn more about me and my research.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(digits = 3) library(tidyverse) results <- read_csv(paste0(getwd(),"/results.csv"))
Random Forest Adjustment (RFA) is a procedure for estimating the relationship between some predictor and a response after partialing out variation in each given a set of covariates via random forests. Below I provide a brief explanation of the technique and demonstrate how it is implemented in R.
RFA is a flexible approach to regression adjustment for controlling for covariates in observational settings. By leveraging a nonparametric machine learning algorithm such as random forests, RFA sidesteps the incidental functional form assumptions imposed by the standard multiple linear regression approach to covariate adjustment. This makes RFA robust to more complex (nonlinear or nonadditive) forms of confounding relationships in observational data that a parametric approach may fail to account for. The latest version of the package can accommodate fixed and random effects as well.
RFA works by partialing out variation in an outcome $y_i$ and some causal variable of interest $z_i$ as a function of covariates $x_{ip} \in X_i$ where $y_i,z_i \not!\perp!!!\perp X_i$. The variables $y_i$ and $z_i$ denote vectors of response and predictor values for the $i^\text{th}$ observation where $i \in I = {1,2,...,n }$.
Because $z_i$ and $y_i$ are not independent of $X_i$, the estimated slope coefficient ($\alpha_1$) from the following linear model may not reflect the correct relationship between $z_i$ on $y_i$: $$y_i = \alpha_0 + \alpha_1z_i + \epsilon_i. \tag{1}$$ RFA's solution to this problem is to partial out the variation in $y_i$ and $z_i$ explained by $X_i$ prior to estimating the relationship between the two using random forests to pre-process the explanatory variable and response. This is done via the following steps:
$$ \begin{aligned} \hat{y}_i & = \hat{f}_y(X_i), \ \hat{y}_i^\varepsilon & = y_i - \hat{y}_i. \end{aligned} \tag{2} $$
$$ \begin{aligned} \hat{z}_i & = \hat{f}_z(X_i), \ \hat{z}_i^\varepsilon & = z_i - \hat{z}_i. \end{aligned} \tag{3} $$
$$\hat{y}_i^\varepsilon = \beta_0 + \beta_1\hat{z}_i^\varepsilon + \mu_i.\tag{4}$$
If desired, fixed or random effects are incorporated by conducting an additional pre-processing step where the response, the causal variable, and each of the covariates are demeaned via fixed effect, random effect, or mixed effect regressions prior to random forest adjustment.
The advantage of RFA is that it sidesteps incidental functional form assumptions that the more conventional multiple regression adjustment strategy imposes. This can be easily demonstrated with a simple simulation.
For the simulation:
$$y_i = 1 + 5z_i + 0.5x_i + x_i^2 + e_i:e_i \sim \mathcal{N}(0, \sigma = 10).$$
$$z_i = -1 + 0.05\space \text{stand}(x_i) - 0.1 \space \text{stand}(x_i)^2 + u_i: u_i \sim \mathcal{N}(0, \sigma = 2).$$
For each iteration of the simulated data-generating process, I recover estimates of the effect of $z_i$ on $y_i$ via:
$$y_i = \beta_0 + \beta_1z_i + \epsilon_i.$$
$$y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i.$$
$$y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \beta_3z_i\cdot(x_i - \bar{x}) + \epsilon_i.$$
Adjustment via a Support Vector Machine (SVM)---an alternative machine learner to random forests.
Adjustment via random forests (RFA).
Because both SVM and random forests take a while to run, for time's sake I restricted the analysis to 100 iterations.
A summary of the performance of each of these approaches is given in the below table. Metrics I use to gauge performance include root mean squared error (MSE), average bias, and coverage of the 95 percent confidence intervals.
results %>% group_by(model) %>% summarize( RMSE = sqrt(sum((estimate - ATE)^2)), "Ave. Bias" = mean(estimate - ATE), Coverage = mean((((estimate - 1.96*std.error)<=ATE) + ((estimate + 1.96*std.error)>=ATE))==2) ) %>% mutate_if(is.numeric, function(x) round(x, 2)) %>% kableExtra::kable(format = "html")
RFA clearly performs the best out of all of the adjustment strategies. This is also clear from the following figure showing the distribution of estimates after 100 runs of the simulation. The estimates provided by RFA are the least prone to error and least biased. However, in terms of coverage, naive OLS, RFA, and SVA are relatively similar.
ATE <- results$ATE[1] results %>% select(-ATE) %>% dotwhisker::small_multiple( dot_args = list(alpha = 0.15) ) + geom_hline(yintercept = c(0, ATE), linetype = c(1,2)) + labs( caption = "SVA = Support Vector Adjustment;\nRFA = Random Forest Adjustment\n(Dashed line denotes true effect)", y = "Estimate with 95% CI", title = "Comparison of Different Adjustment Strategies" ) + ggridges::theme_ridges( font_size = 10 ) + theme( strip.background = element_blank(), strip.text = element_blank(), legend.position = "none" )
To perform RFA analysis in R I created a package called {RFA}
, which can be installed by writing:
## install.packages("devtools") devtools::install_github("milesdwilliams15/RFA")
Here, I demonstrate how the package functions can used to estimate causal effects with observational data. I'll do this with the GerberGreenImai
dataset included in the {Matching}
package. This dataset was used in Imai (2005) to replicate and extend Gerber and Green's (2000) get-out-the-vote (GOT) field experiment. The dataset was used to assess the causal effect of telephone calls on turnout. Note that this data is used for demonstration purposes only, and so results here should not be taken as confirmation or refutation of the findings of the original authors.
First, let's get the data into the working environment:
library(tidyverse) # grammar library(Matching) # for the data data(GerberGreenImai) ggi <- GerberGreenImai # give it a shorter name
Next, to implement RFA, we need to open the RFA
package:
library(RFA)
The workhorse function in the package is rfa()
. The help file for the function can be accessed by entering ?rfa
. The function accepts the following arguments:
formula
= a formula object where the left-hand variable is the outcome and the right-hand variable is the explanatory variable of interest.covariates
= a right-handed formula object specifying the covariates to be used in the random forest regressions.fes_and_res
= a formula object only containing the right-hand side specifying any fixed effects or random effects. If random effects, you should use the notation '~ (1 | id)' as in the 'lme4' package.data
= an optional data frame containing the variables used to implement the RFA routine.se_type
= specifies the standard errors to be returned. If 'clusters' is not specified, the user can specify "classical", "HC0", "stata" (equivalent to "HC1"), "HC2", or "HC3". If 'clusters' is specified, the options are "CR0", "stata" (CR1), and "CR2". "stata" is the default.clusters
= optional name (quoted) of variable that corresponds to clusters in the data....
= additional commands to override the default settings for implementing random forests via 'ranger'. See the ranger
package for more details.Below is an example of how to use rfa
with the ggi
dataset. In the formula object below, the left-hand side variable is a binary response that equals 1 when an individual citizen turned out to vote in the 1998 congressional election in New Haven, CT. The treatment variable (whether the individual received a GOT phone call) is the right-hand side variable in the formula. The covariates to control for are given in the covariates
command in a right-hand side only formula. These covariates are: an individual's age, whether they voted in the previous election (in 1996), the number of persons residing in their household, whether they are a new voter, and whether they support the majority party.
rfa_fit <- rfa( formula = VOTED98 ~ PHN.C1, # treatment variable covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders data = ggi )
The object rfa_fit
consists of a list containing the fitted regression results:
rfa_fit$fit
the random forest regression for the response as a function of covariates:
rfa_fit$yrf
the random forest regression for the causal variable of interest as a function of covariates:
rfa_fit$xrf
and an updated dataframe that adds the processed response and explanatory variable to the data used in estimation:
head(rfa_fit$data)
From the above, we see that the RFA routine estimates that the effect of a GOT phone call on the probability of turning out to vote is approximately r round(tidy(rfa_fit$fit)[2,2],3)
. A summary for this estimate is provided by using summary_rfa
. The output also includes the OOB (out of bag) $R^2$ for the random forest regressions for the response and the explanatory variable.
summary_rfa(rfa_fit) %>% mutate_if(is.numeric, function(x) round(x, 3))
Sure enough, the ATE is statistically significant, with $p < 0.01$.
We can also use the plot_rfa
function to quickly make a coefficient plot for the treatment effect:
plot_rfa(rfa_fit)
Since plot_rfa
is a wrapper for ggplot2
, we can modify the RFA plot however we see fit:
plot_rfa(rfa_fit, varname = "Phone Call") + labs( x = "Estimate", y = NULL, title = "Effect of GOT phone calls on voter turnout" ) + geom_vline(xintercept = 0, lty = 2) + theme_test()
Fixed or random effects can be specified as well. We could do so using the WARD
column in the data.
If we wanted ward fixed effects we would write:
fe_rfa <- rfa( formula = VOTED98 ~ PHN.C1, # treatment variable covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders fes_and_res = ~ WARD, # FEs data = ggi )
If we wanted ward random effects we would write:
re_rfa <- rfa( formula = VOTED98 ~ PHN.C1, # treatment variable covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders fes_and_res = ~ (1 | WARD), # FEs data = ggi )
To compare the results, we can easily create a regression table using the {texreg}
package:
library(texreg) htmlreg( list( "(1)" = rfa_fit$fit, "(2)" = fe_rfa$fit, "(3)" = re_rfa$fit ), custom.coef.map = list( "xres" = "Phone Call" ), custom.header = list("Voter Turnout" = 1:3), include.ci = F, digits = 3, caption = "RFA estimates", caption.above = T )
The point estimate reported via random forest adjustment may not always be the only, or even most interesting, quantity of importance. Researchers may also have an interest in assessing which covariates in the data sample were the most substantial source of confounding. They may also wish to characterize the effective sample used to identify the partial relationship between the explanatory variable and response. As Aronow and Samii (2016) caution, the process of adjusting for covariates often distorts the data sample used to identify some causal relationship of interest. Often, reported estimates reflect a local effect that cannot be generalized beyond a more circumscribed data sample than that used at the outset of the analysis.
Because it is so important to contextualize how adjustment for covariates changes the sample used to estimate treatment effects, the {RFA}
package includes two additional functions that help in performing some diagnostics. These are get_cod()
and get_importance()
.
The latter returns a data frame that reports variable importance metrics for each of the covariates used in the analysis, with respect to the response and explanatory variable of interest.
Usage for get_importance
is quite simple. First, if you'd like to obtain importance metrics, when you perform the RFA estimation, use the importance
argument to specify either "impurity" or "permutation." These are the two standard approaches to quantifying the relative predictive importance of covariates in random forests. The first sums the number of splits that include a given variable across all of the regression trees grown in the random forest (this is the Gini importance or Mean Decrease in Impurity). The second is the average decrease in the accuracy of model predictions if a given variable's relationship with an outcome is "broken" via the permutation process (this is also known as Mean Decrease in Accuracy).
For example, if we wanted to get the Gini importance for covariates we would simply specify:
rfa_fit <- rfa( formula = VOTED98 ~ PHN.C1, # treatment variable covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders data = ggi, importance = "impurity" # Specify impurity importance )
Then, we can use get_importance
to return the importance metrics of the covariates with respect to both the response and explanatory variable of interest:
get_importance(rfa_fit)
These importance metrics can be easily visualized:
imp <- get_importance(rfa_fit) imp %>% gather( key = "outcome", value = "value", -term ) %>% ggplot() + aes( x = value, y = reorder(term, value), color = outcome ) + geom_point() + scale_x_log10() + labs( x = "Gini Importance (log-10 scale)", y = NULL, title = "Variable Importance" )
They can also be analyzed to assess the extent to which confoundingness is correlated between the response and explanatory variable of interest:
imp %>% summarise( correlation = cor(response, predictor) )
Outside of variable importance, we can also examine the "coefficient of distortion" (COD), which summarizes the difference, in standard deviation units, between the effective sample generated through random forest adjustment, and the nominal sample---the original data sample collected. The COD can help summarize how (un)representative the sample was that was used to identify the point estimate returned via random forest adjustment.
To summarize, the COD is calculated for a given covariate $p$ by first scaling it to standard deviation units and then taking its weighted average: $$\text{COD}p = \sum{i = 1}^N w_i\left[ \frac{x_{ip} - \bar{x}p}{\text{SD}(x{ip})}\right] / \sum_{i=1}^N w_i.$$ The vector of weights $w_i$ used is simply the square of the residualized explanatory variable of interest ($z_i$): $$w_i = (z_i^\varepsilon)^2 = [z_i - \hat{f}_z(x_i)]^2.$$
The COD is calculated for each covariate used in estimation with the get_cod()
function. The function accepts three arguments:
rfa
= the fitted rfa
object.include_se
= TRUE
by default. This specifies whether you'd like to report standard errors and related summary statistics along side the COD estimates per covariate.bootsims
= the number of bootstrap iterations to perform to compute the standard errors. By default, this is set to 1,000.Inference for each of the COD estimates is done via bootstrapping.
Applying get_cod
to the fitted rfa
object, we get:
cods <- get_cod(rfa_fit, bootsims = 100) # set to 100 to speed up cods # view
Like the importance metrics, these results can be easily visualized:
ggplot(cods) + aes( x = estimate, y = reorder(term, estimate^2), xmin = conf.low, xmax = conf.high ) + geom_point() + geom_errorbarh(height = 0) + geom_vline( xintercept = 0, lty = 2 ) + labs( x = "Distortion with 95% confidence intervals", y = NULL, title = "Effective vs. Nominal Data Samples" )
These tools, combined, help to demystify random forest adjusted estimates. For instance, the importance metrics above reveal a moderately strong correlation between the importance of covariates in predicting the explanatory variable of interest (a get-out-the-vote phone call) and the response (voter turnout). This is suggestive of a nontrivial degree of confounding in the relationship between the explanatory variable and response.
Further, the COD estimates suggest that the process of adjusting for this confounding produces an effective sample that differs markedly from the nominal sample. Notably, the effective sample skews significantly older, is more likely to have voted in the previous congressional election two years prior, and is less likely to be a new voter, among other differences. This suggests a more constrained set of cases to which the identified relationship between GOT phone calls and turnout can be generalized.
Taken together, the tools in the RFA
package are powerful allies in the analysis of observational data. Random forest adjustment has the potential to be robust to various forms of confounding in observational data, while it also is amenable to helpful diagnostics that contextualize the reported results.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.