knitr::opts_chunk$set(echo = TRUE)

Getting Started

Introductory Example: The GOTV data set.

The gotv data set is part of the causalToolbox library and it was taken from Gerber AS, Green DP, Larimer CW (2008). It contains 229,461 observations. For each of them, we observe seven covariates: sex, age, and whether or not they have voted in the general election in 2000 and 2002 or the primary election in 2000 and 2002. The goal of this study was to increase voter turnout (voted) in the 2006 primary election. The authors used a mailer that is indicated in the treatment column. For treated units the treatment variable is equal to 1 and for units in the control group, the treatment indicator is equal to 0.

library(causalToolbox)
head(gotv)

To estimate the CATE, we first have to train a CATE estimator. The package implements eight different estimators: MO_RF, S_RF, T_RF, and X_RF are based on the forestry package and MO_BART, S_BART, T_BART, and X_BART are based on the dbarts package.

set.seed(30019455)
train_ids <- sample.int(nrow(gotv), 1000)
test_ids <- sample((1:nrow(gotv))[-train_ids], 5)

feat <- gotv[train_ids , -(8:9)]
tr <- gotv$treatment[train_ids]
yobs <- gotv$voted[train_ids]

# Create a T_RF learner with given features, treatments, and observations
t_rf <- T_RF(feat = feat, tr = tr, yobs = yobs)
# Estimate CATE with a given learner and test data
cate_t_rf <- EstimateCate(t_rf, gotv[test_ids, ])
# Calculate the estimated confidence interval for the CATE
cateCI <- CateCI(t_rf, gotv[test_ids, ], B = 2, verbose = FALSE)
cateCI

The other Cate estimators are constructed similarly to this one. Next, we train our CATE estimators based on S_RF, X_BART, and T_BART respectively.

set.seed(50245181)

s_rf <- S_RF(feat = feat, tr = tr, yobs = yobs)
cate_s_rf <- EstimateCate(s_rf, gotv[test_ids, ])
x_bart <- X_BART(feat = feat, tr = tr, yobs = yobs)
cate_x_bart <- EstimateCate(x_bart, gotv[test_ids, ])
t_bart <- T_BART(feat = feat, tr = tr, yobs = yobs)
cate_t_bart <- EstimateCate(t_bart, gotv[test_ids, ])

cbind(cate_t_rf, cate_s_rf, cate_x_bart, cate_t_bart) 

Package strucutre and detailed example

The package goal of the package is to make heterogeneous treatment effect estimation simple and useful. The package is created in such a way that it encourages high scientific standards.

The core of the package are the CATE estimators. In total there are eight different CATE estimators implemented: S-RF, T-RF, X-RF, MO-RF, S-BART, T-BART, X-BART, and MO-BART.

Each of these functions

The causalToolbox package also implements a method to simulate many data sets:

simulated_experiment <- simulate_causal_experiment(
  ntrain = 1000,
  ntest = 1000,
  dim = 10,
  #setup = "complexTau",
  testseed = 293901,
  trainseed = 307017
)

This method can be used to compare different estimators in terms of MSE.

feat <- simulated_experiment$feat_tr
tr <- simulated_experiment$W_tr
yobs <- simulated_experiment$Yobs_tr
feature_test <- simulated_experiment$feat_te

# Create the hte object using honest Random Forests (RF)
xl_rf <- X_RF(feat = feat, tr = tr, yobs = yobs, verbose = FALSE)
tl_rf <- T_RF(feat = feat, tr = tr, yobs = yobs)
sl_rf <- S_RF(feat = feat, tr = tr, yobs = yobs)

xl_bart <- X_BART(feat = feat, tr = tr, yobs = yobs)
tl_bart <- T_BART(feat = feat, tr = tr, yobs = yobs)
sl_bart <- S_BART(feat = feat, tr = tr, yobs = yobs)

cate_esti_xrf <- EstimateCate(xl_rf, feature_test)
cate_esti_trf <- EstimateCate(tl_rf, feature_test)
cate_esti_srf <- EstimateCate(sl_rf, feature_test)

cate_esti_xbart <- EstimateCate(xl_bart, feature_test)
cate_esti_tbart <- EstimateCate(tl_bart, feature_test)
cate_esti_sbart <- EstimateCate(sl_bart, feature_test)
# Evaluate their performances
cate_true <- simulated_experiment$tau_te
c("mse_xrf" = mean((cate_esti_xrf - cate_true) ^ 2), 
  "mse_trf" = mean((cate_esti_trf - cate_true) ^ 2), 
  "mse_srf" = mean((cate_esti_srf - cate_true) ^ 2),
  "mse_xbart" = mean((cate_esti_xbart - cate_true) ^ 2),
  "mse_tbart" = mean((cate_esti_tbart - cate_true) ^ 2),
  "mse_sbart" = mean((cate_esti_sbart - cate_true) ^ 2))

One can also access the coverage of confidence intervals.

ci_srf <- CateCI(tl_rf, feature_test, B = 2, verbose = FALSE)
mean(ci_srf$X5. < cate_true & ci_srf$X95. > cate_true)


soerenkuenzel/causalToolbox documentation built on April 28, 2021, 5:19 a.m.