semibart: Bayesian Semiparametric Regression with BART
In zeldow/semibart: Semiparametric Bayesian regression and structural mean models

Description Usage Arguments Value References Examples

View source: R/semibart.R

Semiparametric regression using BART. For continuous outcomes y, the model is y = ω(x) + a β + ε, where ε \sim N(0,σ^2), x are some covariates, and a is a smaller subset of covariates (and possibly interactions) that may be of immediate scientific interest. The covariates a represent the design matrix for variables and interactions that are modeled parametrically. The functional form of ω(x) is unspecified and modeled using Bayesian Additive Regression Trees (BART) (Chipman et al, 2010). To complete the model, we use a normal prior on β and an inverse chi square prior on σ.

For binary y, the model is P(Y=1 | x, a) = F(ω(x) + aβ), where F denotes the standard normal cdf (probit link) and BART is used to model the nonparametric ω(x).

The covariates in the parametric and nonparametric parts may overlap. That is, a covariate included in x may also be included in the parametric part as an interaction with the exposure variable if effect modification is of scientific interest. In this case, special care is recommended as a larger sample size or larger number of trees may be needed.

For information on causal interpretations as structural mean model, see Vansteelandt et al (2014) and Zeldow et al (2016).

semibart(x.train, a.train, y.train, sigest = NA, sigdf = 3,
  sigquant = 0.9, k = 2, power = 2, base = 0.95, meanb = rep(0,
  ncol(a.train)), sigb = 4, ntree = 200, ndpost = 1000, numcut = 100,
  usequants = FALSE, offset = 0, binarylink = "probit", verbose = TRUE,
  printevery = 100)

`x.train`	Design matrix of values to be modeled with BART.
`a.train`	Design matrix of values to be modeled linearly.
`y.train`	Vector of outcomes (continuous or binary). When binary, elements must be either 0 or 1.
`sigest`	Estimate of regression error. If no value is supplied and sigest=NA, the least squares estimate is used. Must be a positive number. Ignored if y.train is binary.
`sigdf`	Degrees of freedom on prior for error variance.
`sigquant`	The quantile of the prior that the rough estimate (see sigest) is placed at. The closer the quantile is to 1, the more aggresive the fit will be as you are putting more prior weight on error standard deviations (sigma) less than the rough estimate. Not used if y.train is binary.
`k`	For numeric y, k is the number of prior standard deviations E(Y\|x) = f(x) is away from +/-.5. The response (y.train) is internally scaled to range from -0.5 to 0.5. For binary y, k is the number of prior standard deviations f(x) is away from +/-3. In both cases, the bigger k is, the more conservative the fitting will be.
`power`	Power parameter for prior on tree depth.
`base`	Base parameter on prior on tree depth.
`meanb`	Prior mean on regression coefficients. Length must equal # columns in a.train, that is: length(meanb) == ncol(a.train).
`sigb`	Prior standard deviation on regression coefficients. Prior is β \sim N(meanb, sigb^2 I) where I is the identity matrix of appropriate dimension.
`ntree`	Number of trees to use for BART.
`ndpost`	Number of MCMC iterations, including burn-in.
`numcut`	Number of cutpoints for each variable in BART. Must be of length 1 or have length ncol(x.train).
`usequants`	Indicates whether to use observed quantiles for cutpoints or evenly spaced cutpoints based on min and max for each column in x.train.
`offset`	Offset for regression – used only when outcome is binary.
`binarylink`	Indicates whether to use probit or logit link for binary data. Currently only the probit link is supported.
`verbose`	Indicates whether or not user wants printed output to check progress of MCMC algorithm.
`printevery`	Indicates how often to print an update on completion of algorithm. Default is to print a message every 100 iterations. Ignored if verbose = FALSE.

Returns a list containing a matrix of MCMC draws for regression parameters (the dimension is ndpost x ncol(a.train)). When y.train is continuous also returns vector of draws of the error variance. Retrieve the regression parameters using $beta and $sigma, for the regression parameters and variance parameters, respectively.

Chipman, H., George, E., and McCulloch R. (2010) Bayesian Additive Regression Trees. The Annals of Applied Statistics, 4,1, 266-298.

Vansteelandt, S, and Joffe, M. (2014) Structural nested models and g-estimation: The partially realized promise. Statistical Science: 707-731.

Zeldow, B, Lo Re, V, Roy, J. (2016) Bayesian semiparametric regression and structural mean models with BART.

set.seed(1)
n <- 200; nc <- 5
x <- matrix(rnorm(n * nc), nrow = n, ncol = nc)
a <- rbinom(n, 1, 0.5)
y <- 2 + 3 * x[ ,1] + 0.5 * x[ ,2] - 2 * x[ ,3] + 5 * x[ ,5] + 2 * a + rnorm(n)
## Not run: sb <- semibart(x, as.matrix(a), y)