In zeldow/semibart: Semiparametric Bayesian regression and structural mean models

Introduction

This document is intended to provide background and documentation for the semibart package in R. Let $Y$ be the outcome, which is continuous. Let $X = (L, A)$ represent covariates, which may be predictors of $Y$, where $A$ represents the exposure or treatment of interest and $L$ all other covariates. A general regression model is $Y = f(X) + \epsilon = f(A, L) + \epsilon$. Fully parametric linear regression sets $f(X) = X\beta$ and requires normal error terms with mean zero. The model used in this package is $Y = \omega(L) + A \beta + \epsilon$, where $\omega(L)$ is modelled nonparametrically using Bayesian Additive Regression Trees (BART). That is, many of the predictors of $Y$ are allowed to have unspecified functional form, while a smaller subset of covariates are modelled linearly.

BART is a Bayesian sum-of-trees model which can detect nonlinearities and interactions in the predictors with respect to an outcome using minimal tuning. That is, $\omega(L) = \sum_{i=1}^m g_i(L)$, where each $g_i(L)$ is a regression tree. Each individual tree is restricted to be small so that no tree yields undue influence on the overall fit. The

In addition to the typical prior assumptions and distributions on the BART part (see the BayesTree package), we also specify a multivariate normal prior distribution for $\psi$. That is, $\psi \sim N(0,\sigma^2_{\psi} I).$

When the outcome is binary, we use the probit link. That is $P(Y = 1) = \Phi(f(X))$, where $\Phi$ is the CDF for a standard normal variable.

Important Notes

As in the BayesTree package and as recommended in Chipman et al (2010), the outcome $Y$ is shifted and rescaled to be between $[-0.5, 0.5]$. This eases computation and prior choices for the BART part of the model. This is done automatically. Upon completion of the algorithm, all parameters are shifted and rescaled to the original scale of $Y$.

Function and Arguments

The following are the arguments for the semibart function.

x.train -- Design matrix of values to be modeled nonparametrically with BART. Number of rows equals the total number of observations and the number of columns is equal to number of covariates to be modelled in BART. Do not include interactions or higher order terms.

## Not run
n <- 100
x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n); 
x.train <- cbind(x1, x2, x3) #possible
x.train <- cbind(x1, x2, x3, x1*x3) ## DO NOT DO THIS. no interactions or higher order terms.

a.train -- This is a design matrix for the linear part of the model. At a minimum, must contain one column. This matrix can also include interactions or higher order terms. For example,

## Not run
n <- 100
a <- rbinom(n, 1, 0.5)
x1 <- rnorm(n)

## the followiing are possible for the linear part of the model
a.train <- cbind(a)  # for one dimensional linear model
a.train <- cbind(a, a*x1) #effect modification with x1. main effect of x1 included in x.train
a.train <- cbind(a, a*x1, x1) #both main effects and interaction included in linear model
a.train <- cbind(a, x1) #both main effects, no interaction
a.train <- cbind(a, a*a) #higher order terms

y.train -- Vector of outcomes. Can be binary or numeric. If binary everything must be either 0 or 1. The function will automatically detect whether the outcome is binary or continuous.
sigest -- Intial guess of the standard deviation of the regression error. If NA (the default), the function will use an estimate from least squares. This is set to 1 automatically if outcome is binary.
sigdf -- The number of degrees of freedom on the prior distribution for the error variance. See BayesTree package and Chipman et al (2010) for more details.
sigquant -- The quantile of the prior that the rough estimate (see sigest) is placed at. The closer the quantile is to 1, the more aggresive the fit will be as you are putting more prior weight on error standard deviations $\sigma^2$ less than the rough estimate. Not used if y.train is binary. See BayesTree package and Chipman et al (2010) for more details.
k - For numeric y, k is the number of prior standard deviations $E(Y|x) = f(x)$ is away from +/-.5. The response (y.train) is internally scaled to range from -0.5 to 0.5. For binary y, k is the number of prior standard deviations \eqn{f(x)} is away from +/-3. In both cases, the bigger k is, the more conservative the fitting will be. See BayesTree package and Chipman et al (2010) for more details.
power - Prior on tree depth. Must be in the interval $[0, \infty)$.
base - Prior on tree depth. Must be in the interval $(0, 1)$. Power and base are used as follows: \ \ P(end node is terminal) = base * (1 + d)^{-power}, where d is the current depth at a given node. \ \ See BayesTree package and Chipman et al (2010) for more details.
meanb - Prior mean for $\beta$, the parameters on the linear part of the model. Strongly recommended to use mean 0.
sigb - Prior standard deviation on $\beta$, the regression coefficients. The prior is $\beta \sim N(\text{meanb}, \text{sigb} \cdot I),$ where $I$ is the identity matrix of appropriate dimension (number of rows = length(meanb); number of columns = length(meanb)). Since the outcome y.train is shifted and rescaled to be between $[-0.5, 0.5]$, this doesn't have to be very large to be a diffuse prior. The default prior is $N(0, 4^2)$.
ntree - The number of trees to use for BART.
ndpost - Total number of MCMC draws requested.
numcut - Number of cutpoints for each variable in BART. Must be of length 1 or have length ncol(x.train).
usequants - Boolean to indicate whether to use observed quantiles for cutpoints or evenly spaced cutpoints based on min and max for each column in x.train.
offset - Offset for regression -- used only when outcome is binary.
binarylink - Indicates whether to use probit or logit link for binary data. Currently only the probit link is supported. Default is "probit." Will be updated once "logit" is supported.
verbose - Boolean value indicating whether or not to print output while running.
printevery - Positive number specifying how often to print updates if verbose = TRUE. Default is to print every 100 MCMC iterations.

Function Output

The semibart function outputs a list in R. The first element is a matrix with ndpost rows and ncol(a.train) columns, corresponding to the posterior draws for $\beta$. This can be retrieved with the \$ operator with the name "beta". For continuous outcomes, we also output the draws for $\sigma^2$ with the name "sigma."

#not run, no data
fake.res <- semibart(x,a,y,...)

#get draws for beta
fake.res$beta
fake.res$beta[1,] #first iteration
fake.res$beta[ndpost,] #last iteration
colMeans(fake.res$beta) #only works if beta is multidimensional
mean(fake.res$beta) # only meaningful if beta is 1-dimensional

#get draws for sigma
fake.res$sigma

Examples - Continuous Y

set.seed(11)
n <- 1000
p <- 5
x <- matrix(data = rnorm(n*p), nrow = n, ncol = p)
a <- rbinom(n, size = 1, prob = 0.5)
y <- 3 + 2 * x[ ,1] - 1 * x[ ,2] + 1.5 * x[ ,3] - 0.5 * x[ ,3] * x[ ,4] + 3 * x[ ,5] + 1 * a + rnorm(n)  

cont.res <- semibart(x.train = x, a.train = a, y.train = y)

Examples - Continuous Y - Two Variables of Interest

set.seed(111)
n <- 1000
p <- 5
x <- matrix(data = rnorm(n*p), nrow = n, ncol = p)
a1 <- rbinom(n, size = 1, prob = 0.5)
a2 <- rbinom(n, size = 1, prob = 0.35)
y <- 3 + 2 * x[ ,1] - 1 * x[ ,2] + 1.5 * x[ ,3] - 0.5 * x[ ,3] * x[ ,4] + 3 * x[ ,5] + 1 * a_1 - 2 * a_2 + rnorm(n)  

cont.res <- semibart(x.train = x, a.train = cbind(a_1, a_2), y.train = y)

Examples - Binary Y

set.seed(1143)
n <- 1000
p <- 5
x <- matrix(data = rnorm(n*p), nrow = n, ncol = p)
a <- rbinom(n, size = 1, prob = 0.5)
y <- rbinom(n, size = 1, prob = pnorm(0.3 + 0.2 * x[ ,1] - 0.15 * x[ ,2] + 0.5 * x[ ,3] - 0.25 x[ ,3] * x[ ,4] + 0.5 * x[ ,5] + 0.3 * a))

bin.res <- semibart(x.train = x, a.train = a, y.train = y)