mc.abc.vs: Variable selection with ABC Bayesian forest (using parallel...

View source: R/mc.abc.vs.R

mc.abc.vsR Documentation

Variable selection with ABC Bayesian forest (using parallel computation)

Description

This function implements the variable selection approach proposed in Liu, Rockova and Wang (2021) with parallel computation. Rockova and Pas (2020) introduce a spike-and-forest prior which wraps the BART prior with a spike-and-slab prior on the model space. Due to intractable marginal likelihood, Liu, Rockova and Wang (2021) propose an approximate Bayesian computation (ABC) sampling method based on data-splitting to help sample from the model space with higher ABC acceptance rate.
Unlike the function abc.vs() which sequentially evaluates ABC models, this function evaluates the ABC models in parallel.

Usage

mc.abc.vs(
  x,
  y,
  nabc = 1000,
  tolerance = 0.1,
  threshold = 0.25,
  beta.params = c(1, 1),
  split.ratio = 0.5,
  probit = FALSE,
  true.idx = NULL,
  sparse = FALSE,
  xinfo = matrix(0, 0, 0),
  numcut = 100L,
  usequants = FALSE,
  cont = FALSE,
  rm.const = TRUE,
  k = 2,
  power = 2,
  base = 0.95,
  split.prob = "polynomial",
  ntree = 10L,
  ndpost = 1,
  nskip = 200,
  keepevery = 1L,
  printevery = 100L,
  verbose = FALSE,
  mc.cores = 2L,
  nice = 19L,
  seed = 99L
)

Arguments

x

A matrix or a data frame of predictors values with each row corresponding to an observation and each column corresponding to a predictor. If a predictor is a factor with q levels in a data frame, it is replaced with q dummy variables.

y

A vector of response (continuous or binary) values.

nabc

The number of ABC samples, i.e., the number of subsets sampled from the model space.

tolerance

A number between 0 and 1; the nabc subsets are ranked by MSE in ascending order if the response variable is continuous (or by mean log loss (MLL) if the response variable is binary), and the top tolerance*100% of the subsets are accepted by ABC for selection.

threshold

A number between 0 and 1; within the ABC accepted subsets, predictors with MPVIP exceeding threshold are selected.

beta.params

A vector with two positive numbers; the spike-and-slab prior on the model space is assumed to be a beta-binomial prior, i.e., θ ~ Beta(beta.params[1], beta.params[2]) and each predictor is included into a model by Bernoulli(θ).

split.ratio

A number between 0 and 1; the data set (x, y) is split into a training set and a testing set according to the split.ratio.

probit

A Boolean argument indicating whether the response variable is binary or continuous; probit=FALSE (by default) means that the response variable is continuous.

true.idx

(Optional) A vector of indices of the true relevant predictors; if true.idx is provided, metrics including precision, recall and F1 score are returned.

sparse

A Boolean argument indicating whether to perform DART or BART.

xinfo

A matrix of cut-points with each row corresponding to a predictor and each column corresponding to a cut-point. xinfo=matrix(0.0,0,0) indicates the cut-points are specified by BART.

numcut

The number of possible cut-points; If a single number is given, this is used for all predictors; Otherwise a vector with length equal to ncol(x) is required, where the i-th element gives the number of cut-points for the i-th predictor in x. If usequants=FALSE, numcut equally spaced cut-points are used to cover the range of values in the corresponding column of x. If usequants=TRUE, then min(numcut, the number of unique values in the corresponding column of x - 1) cut-point values are used.

usequants

A Boolean argument indicating how the cut-points in xinfo are generated; If usequants=TRUE, uniform quantiles are used for the cut-points; Otherwise, the cut-points are generated uniformly.

cont

A Boolean argument indicating whether to assume all predictors are continuous.

rm.const

A Boolean argument indicating whether to remove constant predictors.

k

The number of prior standard deviations that E(Y|x) = f(x) is away from +/-.5. The response (y) is internally scaled to the range from -.5 to .5. The bigger k is, the more conservative the fitting will be.

power

The power parameter of the polynomial splitting probability for the tree prior. Only used if split.prob="polynomial".

base

The base parameter of the polynomial splitting probability for the tree prior if split.prob="polynomial"; if split.prob="exponential", the probability of splitting a node at depth d is base^d.

split.prob

A string indicating what kind of splitting probability is used for the tree prior. If split.prob="polynomial", the splitting probability in Chipman et al. (2010) is used; If split.prob="exponential", the splitting probability in Rockova and Saha (2019) is used.

ntree

The number of trees in the ensemble.

ndpost

The number of posterior samples returned.

nskip

The number of posterior samples burned in.

keepevery

Every keepevery posterior sample is kept to be returned to the user.

printevery

As the MCMC runs, a message is printed every printevery iterations.

verbose

A Boolean argument indicating whether any messages are printed out.

mc.cores

The number of cores to employ in parallel.

nice

Set the job niceness. The default niceness is 19 and niceness goes from 0 (highest) to 19 (lowest).

seed

Seed required for reproducible MCMC.

Details

At each iteration of the algorithm, the data set is randomly split into a training set and a testing set according to a certain split ratio. The algorithm proceeds by sampling a subset from the spike-and-slab prior on the model space, fitting a BART model on the training set only with the predictors in the subset, and computing the root mean squared errors (RMSE) for the test set based on a posterior sample from the fitted BART model. Only those subsets that result in a low RMSE on the test set are kept for selection. ABC Bayesian forest selects predictors based on their marginal posterior variable inclusion probabilities (MPVIPs) which are estimated by computing the proportion of ABC accepted BART posterior samples that use the predictor at least one time. Given the MPVIPs, predictors with MPVIP exceeding a pre-specified threshold are selected.
See Liu, Rockova and Wang (2021) or Section 2.2.4 in Luo and Daniels (2021) for details.

Value

The function mc.abc.vs() returns a list with the following components.

theta

The probability that a predictor is included into a model.

models

A matrix with nabc rows and ncol(x) columns; each row corresponds to a ABC model (or subset); if the (i, j)-th element is 1, it means that the j-th predictor is included in the i-th ABC model; if the (i, j)-th element is 0, it means that the j-th predictor is not included in the i-th ABC model.

actual.models

A matrix with nabc rows and ncol(x) columns; each row corresponds to a ABC BART posterior sample; if the (i, j)-th element is 1, it means that the j-th predictor is used as a split variable at least one time in the BART posterior sample of the i-th ABC model; if the (i, j)-th element is 0, it means that the j-th predictor is not used as a split variable in the BART posterior sample of the i-th ABC model.

model.errors

The vector of MSEs (or MLLs if the response variable is binary) for the nabc ABC models.

idx

The vector of indices (in terms of the row numbers of models) of the ABC accepted models which are the top tolerance*100% of the nabc ABC models when ranked by MSE or MLL in ascending order.

top.models

A matrix with length(idx) rows and ncol(x) columns, representing the ABC accepted models; top.models=models[idx, ].

top.actual.models

A matrix with length(idx) rows and ncol(x) columns, representing the ABC accepted BART posterior samples; top.models=actual.models[idx, ].

mip

The vector of marginal posterior variable inclusion probabilities.

best.model

The vector of predictors selected by ABC Bayesian forest.

precision

The precision score for the ABC Bayesian forest; only returned when true.idx is provided.

recall

The recall score for the ABC Bayesian forest; only returned when true.idx is provided.

f1

The F1 score for the ABC Bayesian forest; only returned when true.idx is provided.

Author(s)

Chuji Luo: cjluo@ufl.edu and Michael J. Daniels: daniels@ufl.edu.

References

Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). "BART: Bayesian additive regression trees." Ann. Appl. Stat. 4 266–298.

Linero, A. R. (2018). "Bayesian regression trees for high-dimensional prediction and variable selection." J. Amer. Statist. Assoc. 113 626–636.

Liu, Yi, Veronika Rockova, and Yuexi Wang (2021). "Variable selection with ABC Bayesian forests." J. R. Stat. Soc. Ser. B. Stat. Methodol. 83.3, pp. 453–481.

Luo, C. and Daniels, M. J. (2021) "Variable Selection Using Bayesian Additive Regression Trees." arXiv preprint arXiv:2112.13998.

Rockova Veronika and Stephanie van der Pas (2020). "Posterior concentration for Bayesian regression trees and forests." Ann. Statist. 48.4, pp. 2108–2131.

See Also

abc.vs.

Examples

## simulate data (Scenario C.M.1. in Luo and Daniels (2021))
set.seed(123)
data = mixone(100, 10, 1, FALSE)
## parallel::mcparallel/mccollect do not exist on windows
if(.Platform$OS.type=='unix') {
## test mc.abc.vs() function
  res = mc.abc.vs(data$X, data$Y, nabc=100, tolerance=0.1, threshold=0.25, 
  beta.params=c(1.0, 1.0), split.ratio=0.5, probit=FALSE, true.idx=c(1,2,6:8), 
  ntree=10, ndpost=1, nskip=200, mc.cores=2)
}

BartMixVs documentation built on May 5, 2022, 9:05 a.m.