knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) # save built-in output hook hook_output = knitr::knit_hooks$get("output") # set a new output hook to cut long output texts knitr::knit_hooks$set(output = function(x, options) { if (!is.null(n <- options$out.lines)) { x = xfun::split_lines(x) if (length(x) > n) { # cut x = c(head(x, n), '....\n') } x = paste(x, collapse = '\n') } hook_output(x, options) })
This package uses a two-step procedure to estimate the conditional average treatment effects (CATE) with potentially high-dimensional covariate(s). In the first stage, the nuisance functions necessary for identifying CATE can be estimated by machine learning methods, allowing the number of covariates to be comparable to or larger than the sample size. The second stage consists of a low-dimensional local linear regression, reducing CATE to a function of the covariate(s) of interest.
The CATE estimator implemented in this package not only allows for high-dimensional data, but also has the “double robustness” property: either the model for the propensity score or the models for the conditional means of the potential outcomes are allowed to be misspecified (but not both).
The algorithm used in this package is described in the paper by Fan et al., "Estimation of Conditional Average Treatment Effects With High-Dimensional Data" (2022), Journal of Business & Economic Statistics. (doi:10.1080/07350015.2020.1811102)
Install the package from github:
install.packages('devtools') devtools::install_github('microfan1/hdcate')
Load package:
library(hdcate)
Define the empirical data (which is a usual data.frame
), here we use a built-in simulated data:
set.seed(1) # set an arbitrary random seed to ensure the reproducibility data <- HDCATE.get_sim_data(n_obs=500, n_var=100, n_rel_var=4, intercept=10) # generate data
The simulated data contains 500 observations, each observation has 100 covariates, only the first 4 of those covariates (X1
, X2
, X3
and X4
) are actually present in the expectation function of potential outcome, and only the last 4 covariates (X97
, X98
, X99
and X100
) are present in the propensity score function.
Let's see what the data
looks like:
data
In the data, we know that the dependent variable is Y
, treatment variable is D
, and we use a model with the formula of independent variable as:
x_formula <- "X1+X2+X3+X4+X5+X6" x_formula
Note that the propensity score function is misspecified in this example. We shall see the "double robust" shortly.
Then, we can use the data
to create a model:
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
Pass this model
to an operating function HDCATE.set_condition_var()
to set the required conditional variable, here we want to know the conditional average treatment effects (CATE) for X2
, from X2=-1
to X2=1
, step by 0.01
:
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01)
Then we are ready to fit the model, pass the model
to the operating function HDCATE.fit()
to fit the model:
HDCATE.fit(model)
After the model fitting, we obtain the resulting average treatment effects (via accesing the ate
attribute of model
) and conditional average treatment effects (via accesing the hdcate
attribute of model
):
# ATE model$ate
# CATE model$hdcate
After fitting, we may want to know the uniform confidence bands, to achieve this, pass the fitted model
to the operating function HDCATE.inference()
:
HDCATE.inference(model)
Then we obtain the uniform confidence bands for 2-sided test, left-sided test and right-sided test by accessing the i_two_side
, i_left_side
and i_right_side
attributes of the model
, respectively:
# two-sided bands model$i_two_side
# left-sided bands model$i_left_side
# right-sided bands model$i_right_side
Finally, we may want a plot for the estimated results above:
HDCATE.plot(model) # Alternatively, if the model is not inferenced, use the code below: # HDCATE.plot(model, include_band=FALSE) # Alternatively, we may want to save the full plot as a high-definition PDF file: # HDCATE.plot(model, output_pdf=TRUE, pdf_name='hdcate_plot.pdf', # include_band=TRUE, test_side='both', y_axis_min='auto', y_axis_max='auto', # display.hdcate='HDCATEF', display.ate='ATE', display.siglevel='sig_level' # )
From the graph above, although the model we use is misspecified for the propensity score function (we used the formula Y~X1+X2+X3+X4+X5+X6
), we still obtain a robust result, which is very close to the actual CATE function (CATE(X2)=10+X2
). If we use a model that is misspecified for the expectation function of potential outcome, the results are also robust:
x_formula = paste(paste0('X', c(90:100)), collapse ='+') # x_formula="X90+...+X100" model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model)
The results above show the "double robust" property of the estimating approach in this package, see Fan et al. (2022) for more theoretical results.
By default, when creating an HDCATE model, this package uses full-sample approach, but you are also very welcome to use the cross-fitting approach described in Fan et al. (2022), by simply pass model
into the operator HDCATE.use_cross_fitting()
right after the HDCATE model
has been created, that is:
# create the model first, or use an existing `model` and skip this line model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # suppose we want to use the 5-fold cross-fitting approach: HDCATE.use_cross_fitting(model, k_fold=5)
You may also want to set the folds manually, in this case, pass a list of index vector to the third argument of the operator:
# In this case, the param k_fold is auto detected, you can pass any value to it. HDCATE.use_cross_fitting(model, k_fold=1, folds=list(c(1:250), c(251:500)))
If we wants to use full-sample approach again, then we can use anothor operator HDCATE.use_full_sample()
.
HDCATE.use_full_sample(model)
By default, the package uses LASSO on the first stage (i.e. the estimation of the treated/untreated expectation functions and propensity score functions), but it is highly (and easily) extendable.
Researchers are welcome to define their own preferred methods (such as other ML methods) to run the first-stage estimation.
To define your own approach, you should define several functions that describe how you fit and predict the input data, and then pass those functions (and the model
object) to the operator HDCATE.set_first_stage()
. Then the package will use your method on the first stage.
The template of those functions are given as follows:
my_method_to_fit_expectation <- function(df) { # fit the input data.frame (df), where the first column is outcome value (Y) # and the rest are covariates in x_formula # ... # return a fitted object } my_method_to_predict_expectation <- function(fitted_model, df) { # use the fitted object (returned by your function above) to predict the conditional expectation # fucntion on the input data.frame (df) # ... # return a vector of predicted values } my_method_to_fit_propensity <- function(df) { # fit the input data.frame (df), where the first column is dummy treatment # value (D) and the rest are covariates in x_formula # ... # return a fitted object } my_method_to_predict_propensity <- function(fitted_model, df) { # use the fitted object (returned by your function above) to predict the propensity score # fucntion on the input data.frame (df) # ... # return a vector of predicted values }
Example 1 (LASSO): the default ML method in HDCATE.fit
is LASSO, users can manually define a ML method using the above template, and we again use LASSO as an example, similar practice applies for other ML methods:
Note: In practice, LASSO is the built-in method in the package, if you just want to use LASSO, there's no need to do the following.
x_formula <- "X1+X2+X3+X4+X5+X6" # propensity function is misspecified # create a model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # using the template above, we can write: my_method_to_fit_expectation <- function(df) { # fit the input data.frame (df), where the first column is outcome value (Y) # and the rest are covariates in x_formula hdm::rlasso(as.formula(paste0('Y', "~", x_formula)), df) # return a fitted object } my_method_to_predict_expectation <- function(fitted_model, df) { # use the fitted object (fitted_model) to predict the conditional expectation # fucntion on the input data.frame (df) predict(fitted_model, df) # return a vector of predicted values } my_method_to_fit_propensity <- function(df) { # fit the input data.frame (df), where the first column is dummy treatment # variable (D) and the rest are covariates in x_formula hdm::rlassologit(as.formula(paste0('D', "~", x_formula)), df) # return a fitted object } my_method_to_predict_propensity <- function(fitted_model, df) { # use the fitted object (fitted_model) to predict the propensity score # fucntion on the input data.frame (df) predict(fitted_model, df, type="response") # return a vector of predicted values }
After defining these functions, we can easily plug them into the package using operator HDCATE.set_first_stage()
:
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
HDCATE.set_first_stage( model, fit.treated=my_method_to_fit_expectation, fit.untreated=my_method_to_fit_expectation, fit.propensity=my_method_to_fit_propensity, predict.treated=my_method_to_predict_expectation, predict.untreated=my_method_to_predict_expectation, predict.propensity=my_method_to_predict_propensity )
And then we can use the user-defined first-stage ML methods to estimate CATE:
HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model)
We shall see the similar result, since the user-defined method in this example is the same as the built-in method (LASSO).
To clear the user-defined first-stage method and return to the default built-in method, simply call:
HDCATE.unset_first_stage(model)
Example 2 (Random Forest): Here we provide an additional example for another popular ML method: Random Forest, to further show how users can easily plug their own ML methods into this package:
model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula)
# propensity function is misspecified, for example x_formula <- "X1+X2+X3+X4+X5+X6" # create a model model <- HDCATE(data=data, y_name='Y', d_name='D', x_formula=x_formula) # =========== User-defined ML approach start =========== # using the template above, we can write: my_method_to_fit_expectation <- function(df) { # fit the input data.frame (df), where the first column is outcome value (Y) # and the rest are covariates in x_formula randomForest::randomForest(as.formula(paste0('Y', "~", x_formula)), data = df) # return a fitted object } my_method_to_predict_expectation <- function(fitted_model, df) { # use the fitted object (fitted_model) to predict the conditional expectation # fucntion on the input data.frame (df) predict(fitted_model, df) # return a vector of predicted values } my_method_to_fit_propensity <- function(df) { # fit the input data.frame (df), where the first column is dummy treatment # variable (D) and the rest are covariates in x_formula randomForest::randomForest(as.formula(paste0('D', "~", x_formula)), data = df) # return a fitted object } my_method_to_predict_propensity <- function(fitted_model, df) { # use the fitted object (fitted_model) to predict the propensity score # fucntion on the input data.frame (df) predict(fitted_model, df) # return a vector of predicted values } HDCATE.set_first_stage( model, fit.treated=my_method_to_fit_expectation, fit.untreated=my_method_to_fit_expectation, fit.propensity=my_method_to_fit_propensity, predict.treated=my_method_to_predict_expectation, predict.untreated=my_method_to_predict_expectation, predict.propensity=my_method_to_predict_propensity ) # =========== User-defined ML approach end =========== # set conditional variable in CATE, same as above HDCATE.set_condition_var(model, 'X2', min=-1, max=1, step=0.01) # fit, inference and plot HDCATE.fit(model) HDCATE.inference(model) HDCATE.plot(model)
As shown above, after applying user-defined Random Forest approach into this package, the CATE estimation result seems even better than that using LASSO in respect of confidence bands.
Futhermore, users may want to extract the fitted ML model for analysis, this can be achieved by accessing the propensity_score_model_list
, untreated_cond_exp_model_list
and treated_cond_exp_model_list
attributes of the model
object for the fitted propensity score model, fitted conditional expectation model (for untreated) and fitted conditional expectation model (for treated), respectively. The value of these attributes are "List of K
", where K
is the number of folds, K=1
when using the default full-sample estimator.
For example, we may want to calculate the variable importance in fitting treated conditional expectations using the fitted random forest model:
# the list have only one element since we use full-sample estimator rf_model <- model$treated_cond_exp_model_list[[1]] # then, we can use the fitted object `rf_model` for futher analysis. # the rest of codes are omitted since they are unconcerned for this package. # ...
# the code below uses randomForest and ggplot2 to plot out variable importance library('ggplot2') # visualization library('ggthemes') # visualization library('dplyr') # data manipulation importance <- randomForest::importance(rf_model) # varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'MeanDecreaseGini'],2)) # for classification problem varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'IncNodePurity'],2)) # for regression problem rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance)))) ggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) + geom_bar(stat='identity') + # geom_text(aes(x = Variables, y = 0.5, label = Rank), # hjust=0, vjust=0.55, size = 4, colour = 'red') + labs(x = 'Variables', y= 'Importance (measured in squared OOB residuals)') + coord_flip() + theme_few()
We can see that in this example, the powerful random forest model captures the actual relevant variable of conditional expectation functions of potential outcome (i.e. X1
, X2
, X3
and X4
) correctly.
Similar practice applies for other ML methods.
Users can define their own inference procedure by passing the following arguments into HDCATE.inference()
:
sig_level
: set significant level of the confidence bands, the default is 0.01, passing a vector of significant levels is also allowed.n_rep_boot
: set the number of bootstrap replications, the default is 1000.For example, we may want to construct uniform confidence bands with a larger number of bootstrap replications=3000, and use another significant level=5%:
HDCATE.inference(model, sig_level=0.05, n_rep_boot=3000)
On the second stage, we use rule-of-thumb to decide the bandwidth in the local linear regression, the bandwidth can also be manually set before calling HDCATE.fit()
:
HDCATE.set_bw(model, 0.15) # for example, set bandwidth=0.15
To reset to default rule-of-thumb approach, call:
HDCATE.set_bw(model)
Other details for using this package can be found on the detailed package documentation or the help files below:
help(HDCATE) help(HDCATE.set_condition_var) help(HDCATE.fit) help(HDCATE.inference) help(HDCATE.plot) help(HDCATE.use_cross_fitting) help(HDCATE.use_full_sample) help(HDCATE.set_first_stage) help(HDCATE.unset_first_stage) help(HDCATE.set_bw)
Qingliang Fan, Yu-Chin Hsu, Robert P. Lieli & Yichong Zhang (2022) Estimation of Conditional Average Treatment Effects With High-Dimensional Data, Journal of Business & Economic Statistics, 40:1, 313-327, DOI: 10.1080/07350015.2020.1811102
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.