medianInclusion.vs: Variable selection with DART
In BartMixVs: Variable Selection Using Bayesian Additive Regression Trees

medianInclusion.vs

R Documentation

Variable selection with DART

Description

This function implements the variable selection approach proposed in Linero (2018). Linero (2018) proposes DART, a variant of BART, which replaces the discrete uniform distribution for selecting a split variable with a categorical distribution of which the event probabilities follow a Dirichlet distribution. DART estimates the marginal posterior variable inclusion probability (MPVIP) for a predictor by the proportion of the posterior samples of the trees structures where the predictor is used as a split variable at least once, and selects predictors with MPVIP at least 0.5, yielding a median probability model.

Usage

medianInclusion.vs(
  x.train,
  y.train,
  probit = FALSE,
  vip.selection = TRUE,
  true.idx = NULL,
  plot = FALSE,
  num.var.plot = Inf,
  theta = 0,
  omega = 1,
  a = 0.5,
  b = 1,
  augment = FALSE,
  rho = NULL,
  xinfo = matrix(0, 0, 0),
  numcut = 100L,
  usequants = FALSE,
  cont = FALSE,
  rm.const = TRUE,
  power = 2,
  base = 0.95,
  split.prob = "polynomial",
  k = 2,
  ntree = 20L,
  ndpost = 1000L,
  nskip = 1000L,
  keepevery = 1L,
  printevery = 100L,
  verbose = FALSE
)

Arguments

`x.train`	A matrix or a data frame of predictors values with each row corresponding to an observation and each column corresponding to a predictor. If a predictor is a factor with q levels in a data frame, it is replaced with q dummy variables.
`y.train`	A vector of response (continuous or binary) values.
`probit`	A Boolean argument indicating whether the response variable is binary or continuous; `probit=FALSE` (by default) means that the response variable is continuous.
`vip.selection`	A Boolean argument indicating whether to select predictors using BART VIPs.
`true.idx`	(Optional) A vector of indices of the true relevant predictors; if provided, metrics including precision, recall and F1 score are returned.
`plot`	(Optional) A Boolean argument indicating whether plots are returned or not.
`num.var.plot`	The number of variables to be plotted.
`theta`	Set `theta` parameter; zero means random.
`omega`	Set `omega` parameter; zero means random.
`a`	A sparse parameter of Beta(a, b) hyper-prior where 0.5<=a<=1; a lower value induces more sparsity.
`b`	A sparse parameter of Beta(a, b) hyper-prior; typically, b=1.
`augment`	A Boolean argument indicating whether data augmentation is performed in the variable selection procedure of Linero (2018).
`rho`	A sparse parameter; typically ρ = p where p is the number of predictors.
`xinfo`	A matrix of cut-points with each row corresponding to a predictor and each column corresponding to a cut-point. `xinfo=matrix(0.0,0,0)` indicates the cut-points are specified by BART.
`numcut`	The number of possible cut-points; If a single number is given, this is used for all predictors; Otherwise a vector with length equal to `ncol(x.train)` is required, where the i-th element gives the number of cut-points for the i-th predictor in `x.train`. If `usequants=FALSE`, `numcut` equally spaced cut-points are used to cover the range of values in the corresponding column of `x.train`. If `usequants=TRUE`, then min(`numcut`, the number of unique values in the corresponding column of `x.train` - 1) cut-point values are used.
`usequants`	A Boolean argument indicating how the cut-points in `xinfo` are generated; If `usequants=TRUE`, uniform quantiles are used for the cut-points; Otherwise, the cut-points are generated uniformly.
`cont`	A Boolean argument indicating whether to assume all predictors are continuous.
`rm.const`	A Boolean argument indicating whether to remove constant predictors.
`power`	The power parameter of the polynomial splitting probability for the tree prior. Only used if `split.prob="polynomial"`.
`base`	The base parameter of the polynomial splitting probability for the tree prior if `split.prob="polynomial"`; if `split.prob="exponential"`, the probability of splitting a node at depth d is `base`^d.
`split.prob`	A string indicating what kind of splitting probability is used for the tree prior. If `split.prob="polynomial"`, the splitting probability in Chipman et al. (2010) is used; If `split.prob="exponential"`, the splitting probability in Rockova and Saha (2019) is used.
`k`	The number of prior standard deviations that E(Y\|x) = f(x) is away from +/-.5. The response (`y.train`) is internally scaled to the range from -.5 to .5. The bigger `k` is, the more conservative the fitting will be.
`ntree`	The number of trees in the ensemble.
`ndpost`	The number of posterior samples returned.
`nskip`	The number of posterior samples burned in.
`keepevery`	Every `keepevery` posterior sample is kept to be returned to the user.
`printevery`	As the MCMC runs, a message is printed every `printevery` iterations.
`verbose`	A Boolean argument indicating whether any messages are printed out.

Details

See Linero (2018) or Section 2.2.3 in Luo and Daniels (2021) for details.
If vip.selection=TRUE, this function also does variable selection by selecting variables whose BART VIP exceeds 1/ncol{x.train}.
If true.idx is provided, the precision, recall and F1 scores are returned.
If plot=TRUE, plots showing which predictors are selected are generated.

Value

The function medianInclusion.vs() returns two (or one if vip.selection=FALSE) plots if plot=TRUE and a list with the following components.

`dart.pvip`	The vector of DART MPVIPs.
`dart.pvip.imp.names`	The vector of column names of the predictors with DART MPVIP at least 0.5.
`dart.pvip.imp.cols`	The vector of column indices of the predictors with DART MPVIP at least 0.5.
`dart.precision`	The precision score for the DART approach; only returned if `true.idx` is provided.
`dart.recall`	The recall score for the DART approach; only returned if `true.idx` is provided.
`dart.f1`	The F1 score for the DART approach; only returned if `true.idx` is provided.
`bart.vip`	The vector of BART VIPs; only returned if `vip.selection=TRUE`.
`bart.vip.imp.names`	The vector of column names of the predictors with BART VIP exceeding `1/ncol{x.train}`; only returned if `vip.selection=TRUE`.
`bart.vip.imp.cols`	The vector of column indicies of the predictors with BART VIP exceeding `1/ncol{x.train}`; only returned if `vip.selection=TRUE`.
`bart.precision`	The precision score for the BART approach; only returned if `vip.selection=TRUE` and `true.idx` is provided.
`bart.recall`	The recall score for the BART approach; only returned if `vip.selection=TRUE` and `true.idx` is provided.
`bart.f1`	The F1 score for the BART approach; only returned if `vip.selection=TRUE` and `true.idx` is provided.

Author(s)

Chuji Luo: cjluo@ufl.edu and Michael J. Daniels: daniels@ufl.edu.

References

Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). "BART: Bayesian additive regression trees." Ann. Appl. Stat. 4 266–298.

Linero, A. R. (2018). "Bayesian regression trees for high-dimensional prediction and variable selection." J. Amer. Statist. Assoc. 113 626–636.

Luo, C. and Daniels, M. J. (2021) "Variable Selection Using Bayesian Additive Regression Trees." arXiv preprint arXiv:2112.13998.

Rockova V, Saha E (2019). “On theory for BART.” In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 2839–2848). PMLR.

Examples

## simulate data (Scenario C.M.1. in Luo and Daniels (2021))
set.seed(123)
data = mixone(100, 10, 1, FALSE)
## test medianInclusion.vs() function
res = medianInclusion.vs(data$X, data$Y, probit=FALSE, vip.selection=TRUE,  
true.idx=c(1, 2, 6:8), plot=FALSE, ntree=10, ndpost=100, nskip=100, verbose=FALSE)

BartMixVs documentation built on May 5, 2022, 9:05 a.m.