# bart: Bayesian Additive Regression Trees In dbarts: Discrete Bayesian Additive Regression Trees Sampler

## Description

BART is a Bayesian “sum-of-trees” model in which each tree is constrained by a prior to be a weak learner.

• For numeric response y = f(x) + ε, where ε ~ N(0, σ^2).

• For binary response y, P(Y = 1 | x) = Φ(f(x)), where Φ denotes the standard normal cdf (probit link).

## Usage

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 bart(x.train, y.train, x.test = matrix(0.0, 0, 0), sigest = NA, sigdf = 3, sigquant = 0.90, k = 2.0, power = 2.0, base = 0.95, binaryOffset = 0.0, weights = NULL, ntree = 200, ndpost = 1000, nskip = 100, printevery = 100, keepevery = 1, keeptrainfits = TRUE, usequants = FALSE, numcut = 100, printcutoffs = 0, verbose = TRUE, nchain = 1, nthread = 1, combinechains = TRUE, keeptrees = FALSE, keepcall = TRUE, sampleronly = FALSE) bart2(formula, data, test, subset, weights, offset, offset.test = offset, sigest = NA_real_, sigdf = 3.0, sigquant = 0.90, k = NULL, power = 2.0, base = 0.95, n.trees = 75L, n.samples = 500L, n.burn = 500L, n.chains = 4L, n.threads = min(guessNumCores(), n.chains), combineChains = FALSE, n.cuts = 100L, useQuantiles = FALSE, n.thin = 1L, keepTrainingFits = TRUE, printEvery = 100L, printCutoffs = 0L, verbose = TRUE, keepTrees = FALSE, keepCall = TRUE, samplerOnly = FALSE, ...) ## S3 method for class 'bart' plot(x, plquants = c(0.05, 0.95), cols = c('blue', 'black'), ...) ## S3 method for class 'bart' predict(object, newdata, offset, type = c("ev", "ppd", "bart"), combineChains = TRUE, ...) extract(object, ...) ## S3 method for class 'bart' extract(object, type = c("ev", "ppd", "bart"), sample = c("train", "test"), combineChains = TRUE, ...) ## S3 method for class 'bart' fitted(object, type = c("ev", "ppd", "bart"), sample = c("train", "test"), ...) 

## Arguments

 x.train Explanatory variables for training (in sample) data. May be a matrix or a data frame, with rows corresponding to observations and columns to variables. If a variable is a factor in a data frame, it is replaced with dummies. Note that q dummies are created if q > 2 and one dummy is created if q = 2, where q is the number of levels of the factor. y.train Dependent variable for training (in sample) data. If y.train is numeric a continous response model is fit (normal errors). If y.train is a binary factor or has only values 0 and 1, then a binary response model with a probit link is fit. x.test Explanatory variables for test (out of sample) data. Should have same column structure as x.train. bart will generate draws of f(x) for each x which is a row of x.test. sigest For continuous response models, an estimate of the error variance, σ^2, used to calibrate an inverse-chi-squared prior used on that parameter. If not supplied, the least-squares estimate is derived instead. See sigquant for more information. Not applicable when y is binary. sigdf Degrees of freedom for error variance prior. Not applicable when y is binary. sigquant The quantile of the error variance prior that the rough estimate (sigest) is placed at. The closer the quantile is to 1, the more aggresive the fit will be as you are putting more prior weight on error standard deviations (σ) less than the rough estimate. Not applicable when y is binary. k For numeric y, k is the number of prior standard deviations E(Y|x) = f(x) is away from +/- 0.5. The response (y.train) is internally scaled to range from -0.5 to 0.5. For binary y, k is the number of prior standard deviations f(x) is away from +/- 3. In both cases, the bigger k is, the more conservative the fitting will be. The value can be either a fixed number, or the a hyperprior of the form chi(degreesOfFreedom = 1.25, scale = Inf). For bart2, the default of NULL uses the value 2 for continuous reponses and a chi hyperprior for binary ones. The default chi hyperprior is improper, and slightly penalizes small values of k. power Power parameter for tree prior. base Base parameter for tree prior. binaryOffset Used for binary y. When present, the model is P(Y = 1 | x) = Φ(f(x) + binaryOffset), allowing fits with probabilities shrunk towards values other than 0.5. weights An optional vector of weights to be used in the fitting process. When present, BART fits a model with observations y | x ~ N(f(x), σ^2 / w), where f(x) is the unknown function. ntree, n.trees The number of trees in the sum-of-trees formulation. ndpost, n.samples The number of posterior draws after burn in, ndpost / keepevery will actually be returned. nskip, n.burn Number of MCMC iterations to be treated as burn in. printevery, printEvery As the MCMC runs, a message is printed every printevery draws. keepevery, n.thin Every keepevery draw is kept to be returned to the user. Useful for “thinning” samples. keeptrainfits, keepTrainingFits If TRUE the draws of f(x) for x corresponding to the rows of x.train are returned. usequants, useQuantiles When TRUE, determine tree decision rules using estimated quantiles derived from the x.train variables. When FALSE, splits are determined using values equally spaced across the range of a variable. See details for more information. numcut, n.cuts The maximum number of possible values used in decision rules (see usequants, details). If a single number, it is recycled for all variables; otherwise must be a vector of length equal to ncol(x.train). Fewer rules may be used if a covariate lacks enough unique values. printcutoffs, printCutoffs The number of cutoff rules to printed to screen before the MCMC is run. Given a single integer, the same value will be used for all variables. If 0, nothing is printed. verbose Logical; if FALSE supress printing. nchain, n.chains Integer specifying how many independent tree sets and fits should be calculated. nthread, n.threads Integer specifying how many threads to use. Depending on the CPU architecture, using more than the number of chains can degrade performance for small/medium data sets. As such some calculations may be executed single threaded regardless. combinechains, combineChains Logical; if TRUE, samples will be returned in arrays of dimensions equal to nchain \times ndpost \times number of observations. keeptrees, keepTrees Logical; must be TRUE in order to use predict with the result of a bart fit. Note that for models with a large number of observations or a large number of trees, keeping the trees can be very memory intensive. keepcall, keepCall Logical; if FALSE, returned object will have call set to call("NULL"), otherwise the call used to instantiate BART. formula The same as x.train, the name reflecting that a formula object can be used instead. data The same as y.train, the name reflecting that a data frame can be specified when a formula is given instead. test The same as x.train. Can be missing. subset A vector of logicals or indicies used to subset of the data. Can be missing. offset The same as binaryOffset. Can be missing. offset.test A vector of offsets to be used with test data, in case it is different than the training offset. If offest is missing, defaults to NULL. object An object of class bart, returned from either the function bart or bart2. newdata Test data for prediction. Obeys all the same rules as x.train but cannot be missing. sampleronly, samplerOnly Builds the sampler from its arguments and returns it without running it. Useful to use the bart2 interface in more complicated models. x Object of class bart, returned by function bart, which contains the information to be plotted. plquants In the plots, beliefs about f(x) are indicated by plotting the posterior median and a lower and upper quantile. plquants is a double vector of length two giving the lower and upper quantiles. cols Vector of two colors. First color is used to plot the median of f(x) and the second color is used to plot the lower and upper quantiles. type The quantity to be returned by generic functions. Options are "ev" - samples from the posterior of the individual level expected value, "bart" - the sum of trees component; same as "ev" for linear models but on the probit scale for binary ones, and "ppd" - samples from the posterior predictive distribution. To synergize with predict.glm, "response" can be used as a synonym for "value" and "link" can be used as a synonym for "bart". sample Either "train" or "test". ... Additional arguments passed on to plot or dbartsControl, respectively. Not used in predict.

## Details

BART is an Bayesian MCMC method. At each MCMC interation, we produce a draw from the joint posterior (f, σ) | (x, y) in the numeric y case and just f in the binary y case.

Thus, unlike a lot of other modeling methods in R, bart does not produce a single model object from which fits and summaries may be extracted. The output consists of values f*(x) (and σ* in the numeric case) where * denotes a particular draw. The x is either a row from the training data (x.train) or the test data (x.test).

#### Decision Rules

Decision rules for any tree are of the form x ≤ c vs. x > c for each ‘x’ corresponding to a column of x.train. usequants determines the means by which the set of possible c is determined. If usequants is TRUE, then the c are a subset of the values interpolated half-way between the unique, sorted values obtained from the corresponding column of x.train. If usequants is FALSE, the cutoffs are equally spaced across the range of values taken on by the corresponding column of x.train.

The number of possible values of c is determined by numcut. If usequants is FALSE, numcut equally spaced cutoffs are used covering the range of values in the corresponding column of x.train. If usequants is TRUE, then for a variable the minimum of numcut and one less than the number of unique elements for that variable are used.

#### k

The amount of shrinkage of the node parameters is controlled by k. k can be given as either a fixed, positive number, or as any value that can be used to build a supported hyperprior. At present, only χ_ν s priors are supported, where ν is a degrees of freedom and s is a scale. Both values must be positive, however the scale can be infinite which yields an improper prior, interpretted as just the polynomial part of the distribution. If nu is 1 and s is , the prior is “flat”.

For BART on binary outcomes, the degree of overfitting can be highly sensitive to k so it is encouraged to consider a number of values. The default hyperprior for binary BART, chi(1.25, Inf), has been shown to work well in a large number of datasets, however crossvalidation may be helpful. Running for a short time with a flat prior may be helpful to see the range of values of k that are consistent with the data.

#### Generics

bart and rbart_vi support fitted to return the posterior mean of a predicted quantity, as well as predict to return a set of posterior samples for a different sample. In addition, the extract generic can be used to obtain the posterior samples for the training data or test data supplied during the initial fit.

Using predict with a bart object requires that it be fitted with the option keeptrees/keepTrees as TRUE. Keeping the trees for a fit can require a sizeable amount of memory and is off by default.

All generics return values on the scale of expected value of the response by default. This means that predict, extract, and fitted for binary outcomes return probabilities unless specifically the sum-of-trees component is requested (type = "bart"). This is in contrast to yhat.train/yhat.test that are returned with the fitted model.

#### Saving

saveing and loading fitted BART objects for use with predict requires that R's serialization mechanism be able to access the underlying trees, in addition to being fit with keeptrees/keepTrees as TRUE. For memory purposes, the trees are not stored as R objects unless specifically requested. To do this, one must “touch” the sampler's state object before saving, e.g. for a fitted object bartFit, execute invisible(bartFit$fit$state).

## Value

bart returns a list assigned class bart. For applicable quantities, ndpost / keepevery samples are returned. In the numeric y case, the list has components:

 yhat.train A array/matrix of posterior samples. The (i, j, k) value is the jth draw of the posterior of f evaluated at the kth row of x.train (i.e. f(x_k)) corresponding to chain i. When nchain is one or combinechains is TRUE, the result is a collapsed down to a matrix. yhat.test Same as yhat.train but now the xs are the rows of the test data. yhat.train.mean Vector of means of yhat.train across columns and chains, with length equal to the number of training observations. yhat.test.mean Vector of means of yhat.test across columns and chains. sigma Matrix of posterior samples of sigma, the residual/error standard deviation. Dimensions are equal to the number of chains times the numbers of samples unless nchain is one or combinechains is TRUE. first.sigma Burn-in draws of sigma. varcount A matrix with number of rows equal to the number of kept draws and each column corresponding to a training variable. Contains the total count of the number of times that variable is used in a tree decision rule (over all trees). sigest The rough error standard deviation (σ) used in the prior. y The input dependent vector of values for the dependent variable. This is used in plot.bart. fit Optional sampler object which stores the values of the tree splits. Required for using predict and only stored if keeptrees is TRUE. n.chains Information that can be lost if combinechains is TRUE is tracked here. k Optional matrix of posterior samples of k. Only present when k is modeled, i.e. there is a hyperprior.

In the binary y case, the returned list has the components yhat.train, yhat.test, and varcount as above. In addition the list has a binaryOffset component giving the value used.

Note that in the binary y, case yhat.train and yhat.test are f(x) + binaryOffset. For draws of the probability P(Y = 1 | x), apply the normal cdf (pnorm) to these values.

The plot method sets mfrow to c(1,2) and makes two plots. The first plot is the sequence of kept draws of σ including the burn-in draws. Initially these draws will decline as BART finds fit and then level off when the MCMC has burnt in. The second plot has y on the horizontal axis and posterior intervals for the corresponding f(x) on the vertical axis.

## Author(s)

Hugh Chipman: [email protected],
Robert McCulloch: [email protected],
Vincent Dorie: [email protected].

## References

Chipman, H., George, E., and McCulloch, R. (2009) BART: Bayesian Additive Regression Trees.

Chipman, H., George, E., and McCulloch R. (2006) Bayesian Ensemble Learning. Advances in Neural Information Processing Systems 19, Scholkopf, Platt and Hoffman, Eds., MIT Press, Cambridge, MA, 265-272.

both of the above at: http://www.rob-mcculloch.org

Friedman, J.H. (1991) Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.

pdbart

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ## simulate data (example from Friedman MARS paper) ## y = f(x) + epsilon , epsilon ~ N(0, sigma) ## x consists of 10 variables, only first 5 matter f <- function(x) { 10 * sin(pi * x[,1] * x[,2]) + 20 * (x[,3] - 0.5)^2 + 10 * x[,4] + 5 * x[,5] } set.seed(99) sigma <- 1.0 n <- 100 x <- matrix(runif(n * 10), n, 10) Ey <- f(x) y <- rnorm(n, Ey, sigma) ## run BART set.seed(99) bartFit <- bart(x, y) plot(bartFit) ## compare BART fit to linear matter and truth = Ey lmFit <- lm(y ~ ., data.frame(x, y)) fitmat <- cbind(y, Ey, lmFit$fitted, bartFit$yhat.train.mean) colnames(fitmat) <- c('y', 'Ey', 'lm', 'bart') print(cor(fitmat)) 

### Example output Running BART with numeric y

number of trees: 200
Prior:
k: 2.000000
degrees of freedom in sigma prior: 3.000000
quantile in sigma prior: 0.900000
scale in sigma prior: 0.002181
power and base for tree prior: 2.000000 0.950000
use quantiles for rule cut points: false
data:
number of training observations: 100
number of test observations: 0
number of explanatory variables: 10
init sigma: 2.756573, curr sigma: 2.756573

Cutoff rules c in x<=c vs x>c
Number of cutoffs: (var: number of possible c):
(1: 100) (2: 100) (3: 100) (4: 100) (5: 100)
(6: 100) (7: 100) (8: 100) (9: 100) (10: 100)

Running mcmc loop:
iteration: 100 (of 1000)
iteration: 200 (of 1000)
iteration: 300 (of 1000)
iteration: 400 (of 1000)
iteration: 500 (of 1000)
iteration: 600 (of 1000)
iteration: 700 (of 1000)
iteration: 800 (of 1000)
iteration: 900 (of 1000)
iteration: 1000 (of 1000)
total seconds in loop: 0.766282

Tree sizes, last iteration:
2 2 2 2 3 2 3 1 2 3 2 3 2 2 3 3 2 3 2 2
2 3 3 2 2 1 3 3 2 3 2 2 5 5 2 2 3 2 2 2
2 2 3 2 3 2 3 2 1 2 2 3 3 3 3 2 3 2 2 2
3 3 1 2 2 3 3 2 2 3 3 4 2 2 2 3 2 1 2 1
4 2 3 2 2 2 2 2 2 5 3 4 2 2 2 2 2 2 4 2
3 2 2 2 2 3 3 2 2 2 3 2 2 3 1 3 2 2 2 2
2 2 2 3 3 2 2 3 2 2 2 3 2 2 3 2 4 2 2 2
3 2 2 2 2 1 5 3 2 2 3 2 1 2 3 2 2 2 2 2
2 2 2 3 3 4 4 2 3 3 2 2 3 2 3 2 2 3 2 2
4 1 2 2 2 2 2 2 2 2 2 2 3 2 2 2 1 2 3 2

Variable Usage, last iteration (var:count):
(1: 37) (2: 23) (3: 24) (4: 37) (5: 27)
(6: 23) (7: 34) (8: 20) (9: 28) (10: 18)

DONE BART

y        Ey        lm      bart
y    1.0000000 0.9847984 0.8841787 0.9982682
Ey   0.9847984 1.0000000 0.9009389 0.9887574
lm   0.8841787 0.9009389 1.0000000 0.8985059
bart 0.9982682 0.9887574 0.8985059 1.0000000


dbarts documentation built on March 20, 2020, 5:08 p.m.