Introduction

This document is intended to provide background and documentation for the semibart package in R. Let $Y$ be the outcome, which is continuous. Let $X = (L, A)$ represent covariates, which may be predictors of $Y$, where $A$ represents the exposure or treatment of interest and $L$ all other covariates. A general regression model is $Y = f(X) + \epsilon = f(A, L) + \epsilon$. Fully parametric linear regression sets $f(X) = X\beta$ and requires normal error terms with mean zero. The model used in this package is $Y = \omega(L) + A \beta + \epsilon$, where $\omega(L)$ is modelled nonparametrically using Bayesian Additive Regression Trees (BART). That is, many of the predictors of $Y$ are allowed to have unspecified functional form, while a smaller subset of covariates are modelled linearly.

BART is a Bayesian sum-of-trees model which can detect nonlinearities and interactions in the predictors with respect to an outcome using minimal tuning. That is, $\omega(L) = \sum_{i=1}^m g_i(L)$, where each $g_i(L)$ is a regression tree. Each individual tree is restricted to be small so that no tree yields undue influence on the overall fit. The

In addition to the typical prior assumptions and distributions on the BART part (see the BayesTree package), we also specify a multivariate normal prior distribution for $\psi$. That is, $\psi \sim N(0,\sigma^2_{\psi} I).$

When the outcome is binary, we use the probit link. That is $P(Y = 1) = \Phi(f(X))$, where $\Phi$ is the CDF for a standard normal variable.

Important Notes

As in the BayesTree package and as recommended in Chipman et al (2010), the outcome $Y$ is shifted and rescaled to be between $[-0.5, 0.5]$. This eases computation and prior choices for the BART part of the model. This is done automatically. Upon completion of the algorithm, all parameters are shifted and rescaled to the original scale of $Y$.

Function and Arguments

The following are the arguments for the semibart function.

## Not run
n <- 100
x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n); 
x.train <- cbind(x1, x2, x3) #possible
x.train <- cbind(x1, x2, x3, x1*x3) ## DO NOT DO THIS. no interactions or higher order terms.
## Not run
n <- 100
a <- rbinom(n, 1, 0.5)
x1 <- rnorm(n)

## the followiing are possible for the linear part of the model
a.train <- cbind(a)  # for one dimensional linear model
a.train <- cbind(a, a*x1) #effect modification with x1. main effect of x1 included in x.train
a.train <- cbind(a, a*x1, x1) #both main effects and interaction included in linear model
a.train <- cbind(a, x1) #both main effects, no interaction
a.train <- cbind(a, a*a) #higher order terms

Function Output

The semibart function outputs a list in R. The first element is a matrix with ndpost rows and ncol(a.train) columns, corresponding to the posterior draws for $\beta$. This can be retrieved with the \$ operator with the name "beta". For continuous outcomes, we also output the draws for $\sigma^2$ with the name "sigma."

#not run, no data
fake.res <- semibart(x,a,y,...)

#get draws for beta
fake.res$beta
fake.res$beta[1,] #first iteration
fake.res$beta[ndpost,] #last iteration
colMeans(fake.res$beta) #only works if beta is multidimensional
mean(fake.res$beta) # only meaningful if beta is 1-dimensional

#get draws for sigma
fake.res$sigma

Examples - Continuous Y

set.seed(11)
n <- 1000
p <- 5
x <- matrix(data = rnorm(n*p), nrow = n, ncol = p)
a <- rbinom(n, size = 1, prob = 0.5)
y <- 3 + 2 * x[ ,1] - 1 * x[ ,2] + 1.5 * x[ ,3] - 0.5 * x[ ,3] * x[ ,4] + 3 * x[ ,5] + 1 * a + rnorm(n)  

cont.res <- semibart(x.train = x, a.train = a, y.train = y)

Examples - Continuous Y - Two Variables of Interest

set.seed(111)
n <- 1000
p <- 5
x <- matrix(data = rnorm(n*p), nrow = n, ncol = p)
a1 <- rbinom(n, size = 1, prob = 0.5)
a2 <- rbinom(n, size = 1, prob = 0.35)
y <- 3 + 2 * x[ ,1] - 1 * x[ ,2] + 1.5 * x[ ,3] - 0.5 * x[ ,3] * x[ ,4] + 3 * x[ ,5] + 1 * a_1 - 2 * a_2 + rnorm(n)  

cont.res <- semibart(x.train = x, a.train = cbind(a_1, a_2), y.train = y)

Examples - Binary Y

set.seed(1143)
n <- 1000
p <- 5
x <- matrix(data = rnorm(n*p), nrow = n, ncol = p)
a <- rbinom(n, size = 1, prob = 0.5)
y <- rbinom(n, size = 1, prob = pnorm(0.3 + 0.2 * x[ ,1] - 0.15 * x[ ,2] + 0.5 * x[ ,3] - 0.25 x[ ,3] * x[ ,4] + 0.5 * x[ ,5] + 0.3 * a))

bin.res <- semibart(x.train = x, a.train = a, y.train = y)


zeldow/semibart documentation built on May 4, 2019, 10:15 p.m.