knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
set.seed(42) library(bartcs)
The bartcs package finds confounders and treatment effect with Bayesian Additive Regression Trees (BART).
This tutorial will use The Infant Health and Development Program (IHDP) dataset. IHDP is a randomized experiment from 1985 to 1988 which studied the effect of home visits on cognitive test scores for infants. Our version is a synthetic dataset generated by @louizos:2017 which provides true values to compare with. The dataset consists of 6 continuous and 19 binary covariates with simulated outcome which is a cognitive test score.
data(ihdp, package = "bartcs") fit <- single_bart( Y = ihdp$y_factual, trt = ihdp$treatment, X = ihdp[, 6:30], num_tree = 10, num_chain = 4, num_post_sample = 100, num_burn_in = 100, verbose = FALSE ) fit
You can get mean and 95% credible interval of average treatment effect (ATE) and possible outcome Y1 and Y0.
ATE <- mean(ihdp$mu1 - ihdp$mu0) ATE mu1 <- mean(ihdp$mu1) mu1 mu0 <- mean(ihdp$mu0) mu0 mse <- mean((unlist(fit$mcmc_list[, "ATE"]) - ATE)^2) mse
True values of ATE, mu1 and mu0 all lies in 95% credible interval. Also, MSE between predicted and true values is 0.013.
Both separate_bart()
and single_bart()
fits multiple MCMC chains.
summary()
provides result for each and aggregated chain.
summary(fit)
You can get posterior inclusion probability for each variables.
plot(fit, method = "pip")
Since inclusion_plot()
is a wrapper function of ggcharts::bar_chart()
, you can use its arguments for better plot.
plot(fit, method = "pip", top_n = 10) plot(fit, method = "pip", threshold = 0.5)
With trace_plot()
, you can visually check trace of effects or other parameters.
plot(fit, method = "trace")
You can also use functions from coda
package for components of bartcs
object.
Components with mcmc_
prefix are mcmc.list
object from coda
package.
coda::gelman.diag(fit$mcmc_list[, "ATE"])
count_omp_thread()
Check whether OpenMP is supported. You need more than 1 thread for multi-threading. Due to overhead of multi-threading, using parallelization will NOT be effective with small and moderate size datasets.
For comparison purpose, I will create dataset with 40,000 rows by bootstrapping from IHDP dataset. Then, for fast computation, I will fit the model with most parameters set to 1.
idx <- sample(nrow(ihdp), 4e4, TRUE) ihdp <- ihdp[idx, ] microbenchmark::microbenchmark( simple = single_bart( Y = ihdp$y_factual, trt = ihdp$treatment, X = ihdp[, 6:30], num_tree = 1, num_chain = 1, num_post_sample = 10, num_burn_in = 0, verbose = FALSE, parallel = FALSE ), parallel = single_bart( Y = ihdp$y_factual, trt = ihdp$treatment, X = ihdp[, 6:30], num_tree = 1, num_chain = 1, num_post_sample = 10, num_burn_in = 0, verbose = FALSE, parallel = TRUE ), times = 50 )
Result show that parallelization reduces computation time.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.