semibart: Bayesian Semiparametric Regression with BART

Description Usage Arguments Value References Examples

View source: R/semibart.R

Description

Semiparametric regression using BART. For continuous outcomes y, the model is y = ω(x) + a β + ε, where ε \sim N(0,σ^2), x are some covariates, and a is a smaller subset of covariates (and possibly interactions) that may be of immediate scientific interest. The covariates a represent the design matrix for variables and interactions that are modeled parametrically. The functional form of ω(x) is unspecified and modeled using Bayesian Additive Regression Trees (BART) (Chipman et al, 2010). To complete the model, we use a normal prior on β and an inverse chi square prior on σ.

For binary y, the model is P(Y=1 | x, a) = F(ω(x) + aβ), where F denotes the standard normal cdf (probit link) and BART is used to model the nonparametric ω(x).

The covariates in the parametric and nonparametric parts may overlap. That is, a covariate included in x may also be included in the parametric part as an interaction with the exposure variable if effect modification is of scientific interest. In this case, special care is recommended as a larger sample size or larger number of trees may be needed.

For information on causal interpretations as structural mean model, see Vansteelandt et al (2014) and Zeldow et al (2016).

Usage

1
2
3
4
5
semibart(x.train, a.train, y.train, sigest = NA, sigdf = 3,
  sigquant = 0.9, k = 2, power = 2, base = 0.95, meanb = rep(0,
  ncol(a.train)), sigb = 4, ntree = 200, ndpost = 1000, numcut = 100,
  usequants = FALSE, offset = 0, binarylink = "probit", verbose = TRUE,
  printevery = 100)

Arguments

x.train

Design matrix of values to be modeled with BART.

a.train

Design matrix of values to be modeled linearly.

y.train

Vector of outcomes (continuous or binary). When binary, elements must be either 0 or 1.

sigest

Estimate of regression error. If no value is supplied and sigest=NA, the least squares estimate is used. Must be a positive number. Ignored if y.train is binary.

sigdf

Degrees of freedom on prior for error variance.

sigquant

The quantile of the prior that the rough estimate (see sigest) is placed at. The closer the quantile is to 1, the more aggresive the fit will be as you are putting more prior weight on error standard deviations (sigma) less than the rough estimate. Not used if y.train is binary.

k

For numeric y, k is the number of prior standard deviations E(Y|x) = f(x) is away from +/-.5. The response (y.train) is internally scaled to range from -0.5 to 0.5. For binary y, k is the number of prior standard deviations f(x) is away from +/-3. In both cases, the bigger k is, the more conservative the fitting will be.

power

Power parameter for prior on tree depth.

base

Base parameter on prior on tree depth.

meanb

Prior mean on regression coefficients. Length must equal # columns in a.train, that is: length(meanb) == ncol(a.train).

sigb

Prior standard deviation on regression coefficients. Prior is β \sim N(meanb, sigb^2 I) where I is the identity matrix of appropriate dimension.

ntree

Number of trees to use for BART.

ndpost

Number of MCMC iterations, including burn-in.

numcut

Number of cutpoints for each variable in BART. Must be of length 1 or have length ncol(x.train).

usequants

Indicates whether to use observed quantiles for cutpoints or evenly spaced cutpoints based on min and max for each column in x.train.

offset

Offset for regression – used only when outcome is binary.

binarylink

Indicates whether to use probit or logit link for binary data. Currently only the probit link is supported.

verbose

Indicates whether or not user wants printed output to check progress of MCMC algorithm.

printevery

Indicates how often to print an update on completion of algorithm. Default is to print a message every 100 iterations. Ignored if verbose = FALSE.

Value

Returns a list containing a matrix of MCMC draws for regression parameters (the dimension is ndpost x ncol(a.train)). When y.train is continuous also returns vector of draws of the error variance. Retrieve the regression parameters using $beta and $sigma, for the regression parameters and variance parameters, respectively.

References

Chipman, H., George, E., and McCulloch R. (2010) Bayesian Additive Regression Trees. The Annals of Applied Statistics, 4,1, 266-298.

Vansteelandt, S, and Joffe, M. (2014) Structural nested models and g-estimation: The partially realized promise. Statistical Science: 707-731.

Zeldow, B, Lo Re, V, Roy, J. (2016) Bayesian semiparametric regression and structural mean models with BART.

Examples

1
2
3
4
5
6
set.seed(1)
n <- 200; nc <- 5
x <- matrix(rnorm(n * nc), nrow = n, ncol = nc)
a <- rbinom(n, 1, 0.5)
y <- 2 + 3 * x[ ,1] + 0.5 * x[ ,2] - 2 * x[ ,3] + 5 * x[ ,5] + 2 * a + rnorm(n)
## Not run: sb <- semibart(x, as.matrix(a), y)

zeldow/semibart documentation built on May 4, 2019, 10:15 p.m.