Distributed gradient boosting based on the mboost package.

Description

The parboost package implements distributed gradient boosting based on the mboost package. When should you use parboost instead of mboost? There are two use cases: 1. The data takes too long to fit as a whole 2. You want to bag and postprocess your boosting models to get a more robust ensemble parboost is designed to scale up component-wise functional gradient boosting in a distributed memory environment by splitting the observations into disjoint subsets. Alternatively, parboost can generate and use bootstrap samples of the original data. Each cluster node then fits a boosting model to its subset of the data. These boosting models are combined in an ensemble, either with equal weights, or by fitting a (penalized) regression model on the predictions of the individual models on the complete data. All other functionality of mboost is left untouched for the moment.

Distributed gradient boosting based on the mboost package. Gaussian, Binomial and Poisson families are currently supported.

Usage

1
2
3
4
5
6
7
8
parboost(cluster_object = NULL, mc.cores = NULL, data = NULL,
  path_to_data = "", data_import_function = NULL,
  split_data = c("disjoint", "bagging"), nsplits, preprocessing = NULL,
  seed = NULL, formula, baselearner = c("bbs", "bols", "btree", "bss",
  "bns"), family = c("gaussian", "binomial", "poisson"),
  control = boost_control(), tree_controls = NULL, cv = TRUE,
  cores_cv = detectCores(), folds = 8, stepsize_mstop = 1,
  postprocessing = c("none", "glm", "lasso", "ridge", "elasticnet"))

Arguments

cluster_object

Cluster object from the parallel package to carry out distributed computations.

mc.cores

If not NULL, parboost uses mclapply for shared memory parallelism.

data

A data frame containing the variables in the model. It is recommended to use path_to_data instead for IO efficiency. Defaults to NULL

path_to_data

A string pointing to the location of the data. parboost assumes that the data is located at the same location on every cluster node. This parameter is ignored if you pass a data frame to the data argument.

data_import_function

Function used to import data. Defaults to read.csv. This parameter is ignored if you pass a data frame to the data argument.

split_data

String determening the way the data should be split. disjoint splits the data into disjoint subsets. bootstrap draws a bootstrap sample instead.

nsplits

Integer determining the number of disjoint sets the data should be split into. If split_data is set to bootstrap, nsplits determines the number of bootstrap samples.

preprocessing

Optional preprocessing function to apply to the data. This is useful if you cannot modify the data on the cluster nodes.

seed

Integer determining the random seed value for reproducible results.

formula

Formula to be passed to mboost.

baselearner

Character string to determine the type of baselearner to be used for boosting. See mboost for details.

family

A string determining the family. Currently gaussian, binomial and poisson are implemented.

control

An object of type boost_control for controlling mboost. See boost_control in the mboost for details.

tree_controls

Optional object of type TreeControl. See ctree_control in the party documentation for detailos. Used to set hyperparameters for tree base learners.

cv

Logical to activate crossvalidation to determine m_{stop}. Defaults to TRUE.

cores_cv

Integer determining the number of CPU cores used for cross-validation on each node (or locally). Defaults to maximum available using detectCores.

folds

Integer determening the number of folds used during cross-validation on each cluster node. Defaults to 8. It is computationally more efficient to set the value of of folds to a multiple of the number of cores on each cluster node.

stepsize_mstop

Integer denoting the stepsize used during cross-validation for tuning the value of m_{stop}.

postprocessing

String to set the type of postprocessing. Defaults to "none" for a simple average of the ensemble components.

Details

Generally gradient boosting offers more flexibility and better predictive performance than random forests, but is usually not used for large data sets because of its iterative nature. parboost is designed to scale up component-wise functional gradient boosting in a distributed memory environment by splitting the observations into disjoint subsets, or alternatively by bootstrapping the original data. Each cluster node then fits a boosting model to its subset of the data. These boosting models are combined in an ensemble, either with equal weights, or by fitting a (penalized) regression model on the predictions of the individual models on the complete data. The motivation behind parboost is to offer a boosting framework that is as easy to parallelize and thus scalable as random forests.

If you want to modify the boosting parameters, please take a look at the mboost package documentation and pass the appropriate parameters to tree_control and boost_control.

Value

An object of type parboost with print, summary, predict, coef and selected methods.

Author(s)

Ronert Obst

References

Peter Buehlmann and Bin Yu (2003), Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association, 98, 324–339.

Peter Buehlmann and Torsten Hothorn (2007), Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.

Torsten Hothorn, Peter Buehlmann, Thomas Kneib, Mattthias Schmid and Benjamin Hofner (2010), Model-based Boosting 2.0. Journal of Machine Learning Research, 11, 2109–2113.

Yoav Freund and Robert E. Schapire (1996), Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, 148–156.

Jerome H. Friedman (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.

Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov and Matthias Schmid (2012). Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost. Department of Statistics, Technical Report No. 120.
http://epub.ub.uni-muenchen.de/12754/

T. Hothorn, P. Buehlmann, T. Kneib, M. Schmid, and B. Hofner (2013). mboost: Model-Based Boosting, R package version 2.2-3, http://CRAN.R-project.org/package=mboost.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## Run parboost on a cluster (example not run)
# data(friedman2)
# library(parallel)
# cl <- makeCluster(2)
# parboost_model <- parboost(cluster_object = cl, data = friedman2,
#                            nsplits = 2, formula = y ~ .,
#                            baselearner="bbs", postprocessing = "glm",
#                            control = boost_control(mstop=10))
# stopCluster(cl)
# print(parboost_model)
# summary(parboost_model)
# head(predict(parboost_model))
#
# ## Run parboost serially for testing/debugging purposes
# parboost_model <- parboost(data = friedman2, nsplits = 2, formula
# = y ~ ., baselearner="bbs", postprocessing = "glm", control =
# boost_control(mstop=10))