Description Usage Arguments Details Value Author(s) References Examples
The parboost package implements distributed gradient boosting based on the mboost package. When should you use parboost instead of mboost? There are two use cases: 1. The data takes too long to fit as a whole 2. You want to bag and postprocess your boosting models to get a more robust ensemble parboost is designed to scale up component-wise functional gradient boosting in a distributed memory environment by splitting the observations into disjoint subsets. Alternatively, parboost can generate and use bootstrap samples of the original data. Each cluster node then fits a boosting model to its subset of the data. These boosting models are combined in an ensemble, either with equal weights, or by fitting a (penalized) regression model on the predictions of the individual models on the complete data. All other functionality of mboost is left untouched for the moment.
Distributed gradient boosting based on the mboost package. Gaussian, Binomial and Poisson families are currently supported.
1 2 3 4 5 6 7 8 | parboost(cluster_object = NULL, mc.cores = NULL, data = NULL,
path_to_data = "", data_import_function = NULL,
split_data = c("disjoint", "bagging"), nsplits, preprocessing = NULL,
seed = NULL, formula, baselearner = c("bbs", "bols", "btree", "bss",
"bns"), family = c("gaussian", "binomial", "poisson"),
control = boost_control(), tree_controls = NULL, cv = TRUE,
cores_cv = detectCores(), folds = 8, stepsize_mstop = 1,
postprocessing = c("none", "glm", "lasso", "ridge", "elasticnet"))
|
cluster_object |
Cluster object from the parallel package to carry out distributed computations. |
mc.cores |
If not |
data |
A data frame containing the variables in the model. It is recommended to use path_to_data instead for IO efficiency. Defaults to NULL |
path_to_data |
A string pointing to the location of the data. |
data_import_function |
Function used to import data. Defaults
to |
split_data |
String determening the way the data should be
split. |
nsplits |
Integer determining the number of disjoint sets the
data should be split into. If |
preprocessing |
Optional preprocessing function to apply to the data. This is useful if you cannot modify the data on the cluster nodes. |
seed |
Integer determining the random seed value for reproducible results. |
formula |
Formula to be passed to mboost. |
baselearner |
Character string to determine the type of
baselearner to be used for boosting.
See |
family |
A string determining the family. Currently gaussian, binomial and poisson are implemented. |
control |
An object of type |
tree_controls |
Optional object of type |
cv |
Logical to activate crossvalidation to determine
m_{stop}. Defaults to |
cores_cv |
Integer determining the number of CPU cores used
for cross-validation on each node (or locally). Defaults to
maximum available using |
folds |
Integer determening the number of folds used during cross-validation on each cluster node. Defaults to 8. It is computationally more efficient to set the value of of folds to a multiple of the number of cores on each cluster node. |
stepsize_mstop |
Integer denoting the stepsize used during cross-validation for tuning the value of m_{stop}. |
postprocessing |
String to set the type of
postprocessing. Defaults to |
Generally gradient boosting offers more flexibility and better
predictive performance than random forests, but is usually not
used for large data sets because of its iterative
nature. parboost
is designed to scale up component-wise
functional gradient boosting in a distributed memory environment
by splitting the observations into disjoint subsets, or
alternatively by bootstrapping the original data. Each cluster
node then fits a boosting model to its subset of the data. These
boosting models are combined in an ensemble, either with equal
weights, or by fitting a (penalized) regression model on the
predictions of the individual models on the complete data. The
motivation behind parboost
is to offer a boosting
framework that is as easy to parallelize and thus scalable as
random forests.
If you want to modify the boosting parameters, please take a look
at the mboost
package documentation and pass the
appropriate parameters to tree_control
and
boost_control
.
An object of type parboost
with print
,
summary
, predict
, coef
and selected
methods.
Ronert Obst
Peter Buehlmann and Bin Yu (2003), Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association, 98, 324–339.
Peter Buehlmann and Torsten Hothorn (2007), Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.
Torsten Hothorn, Peter Buehlmann, Thomas Kneib, Mattthias Schmid and Benjamin Hofner (2010), Model-based Boosting 2.0. Journal of Machine Learning Research, 11, 2109–2113.
Yoav Freund and Robert E. Schapire (1996), Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, 148–156.
Jerome H. Friedman (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232.
Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov and Matthias Schmid
(2012). Model-based Boosting in R: A Hands-on Tutorial Using the R
Package mboost. Department of Statistics, Technical Report No. 120.
http://epub.ub.uni-muenchen.de/12754/
T. Hothorn, P. Buehlmann, T. Kneib, M. Schmid, and B. Hofner (2013). mboost: Model-Based Boosting, R package version 2.2-3, http://CRAN.R-project.org/package=mboost.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ## Run parboost on a cluster (example not run)
# data(friedman2)
# library(parallel)
# cl <- makeCluster(2)
# parboost_model <- parboost(cluster_object = cl, data = friedman2,
# nsplits = 2, formula = y ~ .,
# baselearner="bbs", postprocessing = "glm",
# control = boost_control(mstop=10))
# stopCluster(cl)
# print(parboost_model)
# summary(parboost_model)
# head(predict(parboost_model))
#
# ## Run parboost serially for testing/debugging purposes
# parboost_model <- parboost(data = friedman2, nsplits = 2, formula
# = y ~ ., baselearner="bbs", postprocessing = "glm", control =
# boost_control(mstop=10))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.