mc.lbart | R Documentation |
BART is a Bayesian “sum-of-trees” model.
For numeric response y
, we have
y = f(x) + \epsilon
,
where \epsilon \sim Log(0, 1)
.
For a binary response y
, P(Y=1 | x) = F(f(x))
, where F
denotes the standard Logistic CDF (logit link).
In both cases, f
is the sum of many tree models.
The goal is to have very flexible inference for the uknown
function f
.
In the spirit of “ensemble models”, each tree is constrained by a prior to be a weak learner so that it contributes a small amount to the overall fit.
mc.lbart(
x.train, y.train, x.test=matrix(0.0,0,0),
sparse=FALSE, a=0.5, b=1, augment=FALSE, rho=NULL,
xinfo=matrix(0.0,0,0), usequants=FALSE,
cont=FALSE, rm.const=TRUE, tau.interval=0.95,
k=2.0, power=2.0, base=.95,
binaryOffset=NULL,
ntree=50L, numcut=100L,
ndpost=1000L, nskip=100L,
keepevery=1L, printevery=100,
keeptrainfits=TRUE, transposed=FALSE,
mc.cores = 2L, nice = 19L,
seed = 99L
)
x.train |
Explanatory variables for training (in sample) data. |
y.train |
Dependent variable for training (in sample) data. |
x.test |
Explanatory variables for test (out of sample) data. |
sparse |
Whether to perform variable selection based on a sparse Dirichlet prior rather than simply uniform; see Linero 2016. |
a |
Sparse parameter for |
b |
Sparse parameter for |
rho |
Sparse parameter: typically |
augment |
Whether data augmentation is to be performed in sparse variable selection. |
xinfo |
You can provide the cutpoints to BART or let BART
choose them for you. To provide them, use the |
usequants |
If |
cont |
Whether or not to assume all variables are continuous. |
rm.const |
Whether or not to remove constant variables. |
tau.interval |
The width of the interval to scale the variance for the terminal leaf values. |
k |
For numeric y,
k is the number of prior standard deviations |
power |
Power parameter for tree prior. |
base |
Base parameter for tree prior. |
binaryOffset |
Used for binary |
ntree |
The number of trees in the sum. |
numcut |
The number of possible values of c (see usequants).
If a single number if given, this is used for all variables.
Otherwise a vector with length equal to ncol(x.train) is required,
where the |
ndpost |
The number of posterior draws returned. |
nskip |
Number of MCMC iterations to be treated as burn in. |
keepevery |
Every keepevery draw is kept to be returned to the user. |
printevery |
As the MCMC runs, a message is printed every printevery draws. |
keeptrainfits |
Whether to keep |
transposed |
When running |
seed |
Setting the seed required for reproducible MCMC. |
mc.cores |
Number of cores to employ in parallel. |
nice |
Set the job niceness. The default niceness is 19: niceness goes from 0 (highest) to 19 (lowest). |
BART is an Bayesian MCMC method.
At each MCMC interation, we produce a draw from the joint posterior
(f,\sigma) | (x,y)
in the numeric y
case
and just f
in the binary y
case.
Thus, unlike a lot of other modelling methods in R, we do not produce a single model object
from which fits and summaries may be extracted. The output consists of values
f^*(x)
(and \sigma^*
in the numeric case) where * denotes a particular draw.
The x
is either a row from the training data (x.train) or the test data (x.test).
mc.lbart
returns an object of type lbart
which is
essentially a list.
yhat.train |
A matrix with ndpost rows and nrow(x.train) columns.
Each row corresponds to a draw |
yhat.test |
Same as yhat.train but now the x's are the rows of the test data. |
yhat.train.mean |
train data fits = mean of yhat.train columns. |
yhat.test.mean |
test data fits = mean of yhat.test columns. |
varcount |
a matrix with ndpost rows and nrow(x.train) columns. Each row is for a draw. For each variable (corresponding to the columns), the total count of the number of times that variable is used in a tree decision rule (over all trees) is given. |
In addition, the list
has a binaryOffset
giving the value used.
Note that in the binary y
, case yhat.train and yhat.test are
f(x) + binaryOffset
. If you want draws of the probability
P(Y=1 | x)
you need to apply the Logistic cdf (plogis
)
to these values.
lbart
set.seed(99)
n=5000
x = sort(-2+4*runif(n))
X=matrix(x,ncol=1)
f = function(x) {return((1/2)*x^3)}
FL = function(x) {return(exp(x)/(1+exp(x)))}
pv = FL(f(x))
y = rbinom(n,1,pv)
np=100
xp=-2+4*(1:np)/np
Xp=matrix(xp,ncol=1)
## parallel::mcparallel/mccollect do not exist on windows
## if(.Platform$OS.type=='unix') {
## ##test BART with token run to ensure installation works
## mf = mc.lbart(X, y, nskip=5, ndpost=5, mc.cores=1, seed=99)
## }
## Not run:
set.seed(99)
pf = lbart(X,y,Xp)
plot(f(Xp), pf$yhat.test.mean, xlim=c(-4, 4), ylim=c(-4, 4),
xlab='True f(x)', ylab='BART f(x)')
lines(c(-4, 4), c(-4, 4))
mf = mc.lbart(X,y,Xp, mc.cores=4, seed=99)
plot(f(Xp), mf$yhat.test.mean, xlim=c(-4, 4), ylim=c(-4, 4),
xlab='True f(x)', ylab='BART f(x)')
lines(c(-4, 4), c(-4, 4))
par(mfrow=c(2,2))
plot(range(xp),range(pf$yhat.test),xlab='x',ylab='f(x)',type='n')
lines(x,f(x),col='blue',lwd=2)
lines(xp,apply(pf$yhat.test,2,mean),col='red')
qpl = apply(pf$yhat.test,2,quantile,probs=c(.025,.975))
lines(xp,qpl[1,],col='green',lty=1)
lines(xp,qpl[2,],col='green',lty=1)
title(main='BART::lbart f(x) with 0.95 intervals')
plot(range(xp),range(mf$yhat.test),xlab='x',ylab='f(x)',type='n')
lines(x,f(x),col='blue',lwd=2)
lines(xp,apply(mf$yhat.test,2,mean),col='red')
qpl = apply(mf$yhat.test,2,quantile,probs=c(.025,.975))
lines(xp,qpl[1,],col='green',lty=1)
lines(xp,qpl[2,],col='green',lty=1)
title(main='BART::mc.lbart f(x) with 0.95 intervals')
plot(pf$yhat.test.mean,apply(mf$yhat.test,2,mean),xlab='BART::lbart',ylab='BART::mc.lbart')
abline(0,1,col='red')
title(main="BART::lbart f(x) vs. BART::mc.lbart f(x)")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.