Distributed gradient boosting based on the mboost package.
Description
The parboost package implements distributed gradient boosting based on the mboost package. When should you use parboost instead of mboost? There are two use cases: 1. The data takes too long to fit as a whole 2. You want to bag and postprocess your boosting models to get a more robust ensemble parboost is designed to scale up componentwise functional gradient boosting in a distributed memory environment by splitting the observations into disjoint subsets. Alternatively, parboost can generate and use bootstrap samples of the original data. Each cluster node then fits a boosting model to its subset of the data. These boosting models are combined in an ensemble, either with equal weights, or by fitting a (penalized) regression model on the predictions of the individual models on the complete data. All other functionality of mboost is left untouched for the moment.
Distributed gradient boosting based on the mboost package. Gaussian, Binomial and Poisson families are currently supported.
Usage
1 2 3 4 5 6 7 8  parboost(cluster_object = NULL, mc.cores = NULL, data = NULL,
path_to_data = "", data_import_function = NULL,
split_data = c("disjoint", "bagging"), nsplits, preprocessing = NULL,
seed = NULL, formula, baselearner = c("bbs", "bols", "btree", "bss",
"bns"), family = c("gaussian", "binomial", "poisson"),
control = boost_control(), tree_controls = NULL, cv = TRUE,
cores_cv = detectCores(), folds = 8, stepsize_mstop = 1,
postprocessing = c("none", "glm", "lasso", "ridge", "elasticnet"))

Arguments
cluster_object 
Cluster object from the parallel package to carry out distributed computations. 
mc.cores 
If not 
data 
A data frame containing the variables in the model. It is recommended to use path_to_data instead for IO efficiency. Defaults to NULL 
path_to_data 
A string pointing to the location of the data. 
data_import_function 
Function used to import data. Defaults
to 
split_data 
String determening the way the data should be
split. 
nsplits 
Integer determining the number of disjoint sets the
data should be split into. If 
preprocessing 
Optional preprocessing function to apply to the data. This is useful if you cannot modify the data on the cluster nodes. 
seed 
Integer determining the random seed value for reproducible results. 
formula 
Formula to be passed to mboost. 
baselearner 
Character string to determine the type of
baselearner to be used for boosting.
See 
family 
A string determining the family. Currently gaussian, binomial and poisson are implemented. 
control 
An object of type 
tree_controls 
Optional object of type 
cv 
Logical to activate crossvalidation to determine
m_{stop}. Defaults to 
cores_cv 
Integer determining the number of CPU cores used
for crossvalidation on each node (or locally). Defaults to
maximum available using 
folds 
Integer determening the number of folds used during crossvalidation on each cluster node. Defaults to 8. It is computationally more efficient to set the value of of folds to a multiple of the number of cores on each cluster node. 
stepsize_mstop 
Integer denoting the stepsize used during crossvalidation for tuning the value of m_{stop}. 
postprocessing 
String to set the type of
postprocessing. Defaults to 
Details
Generally gradient boosting offers more flexibility and better
predictive performance than random forests, but is usually not
used for large data sets because of its iterative
nature. parboost
is designed to scale up componentwise
functional gradient boosting in a distributed memory environment
by splitting the observations into disjoint subsets, or
alternatively by bootstrapping the original data. Each cluster
node then fits a boosting model to its subset of the data. These
boosting models are combined in an ensemble, either with equal
weights, or by fitting a (penalized) regression model on the
predictions of the individual models on the complete data. The
motivation behind parboost
is to offer a boosting
framework that is as easy to parallelize and thus scalable as
random forests.
If you want to modify the boosting parameters, please take a look
at the mboost
package documentation and pass the
appropriate parameters to tree_control
and
boost_control
.
Value
An object of type parboost
with print
,
summary
, predict
, coef
and selected
methods.
Author(s)
Ronert Obst
References
Peter Buehlmann and Bin Yu (2003), Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association, 98, 324–339.
Peter Buehlmann and Torsten Hothorn (2007), Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.
Torsten Hothorn, Peter Buehlmann, Thomas Kneib, Mattthias Schmid and Benjamin Hofner (2010), Modelbased Boosting 2.0. Journal of Machine Learning Research, 11, 2109–2113.
Yoav Freund and Robert E. Schapire (1996), Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, 148–156.
Jerome H. Friedman (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.
Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov and Matthias Schmid
(2012). Modelbased Boosting in R: A Handson Tutorial Using the R
Package mboost. Department of Statistics, Technical Report No. 120.
http://epub.ub.unimuenchen.de/12754/
T. Hothorn, P. Buehlmann, T. Kneib, M. Schmid, and B. Hofner (2013). mboost: ModelBased Boosting, R package version 2.23, http://CRAN.Rproject.org/package=mboost.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  ## Run parboost on a cluster (example not run)
# data(friedman2)
# library(parallel)
# cl < makeCluster(2)
# parboost_model < parboost(cluster_object = cl, data = friedman2,
# nsplits = 2, formula = y ~ .,
# baselearner="bbs", postprocessing = "glm",
# control = boost_control(mstop=10))
# stopCluster(cl)
# print(parboost_model)
# summary(parboost_model)
# head(predict(parboost_model))
#
# ## Run parboost serially for testing/debugging purposes
# parboost_model < parboost(data = friedman2, nsplits = 2, formula
# = y ~ ., baselearner="bbs", postprocessing = "glm", control =
# boost_control(mstop=10))
