Bagging Classification, Regression and Survival Trees
Description
Bagging for classification, regression and survival trees.
Usage
1 2 3 4 5 6 7 8 9 10 11 12  ## S3 method for class 'factor'
ipredbagg(y, X=NULL, nbagg=25, control=
rpart.control(minsplit=2, cp=0, xval=0),
comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
## S3 method for class 'numeric'
ipredbagg(y, X=NULL, nbagg=25, control=rpart.control(xval=0),
comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
## S3 method for class 'Surv'
ipredbagg(y, X=NULL, nbagg=25, control=rpart.control(xval=0),
comb=NULL, coob=FALSE, ns=dim(y)[1], keepX = TRUE, ...)
## S3 method for class 'data.frame'
bagging(formula, data, subset, na.action=na.rpart, ...)

Arguments
y 
the response variable: either a factor vector of class labels
(bagging classification trees), a vector of numerical values
(bagging regression trees) or an object of class

X 
a data frame of predictor variables. 
nbagg 
an integer giving the number of bootstrap replications. 
coob 
a logical indicating whether an outofbag estimate of the
error rate (misclassification error, root mean squared error
or Brier score) should be computed.
See 
control 
options that control details of the 
comb 
a list of additional models for model combination, see below
for some examples. Note that argument 
ns 
number of sample to draw from the learning sample. By default,
the usual bootstrap n out of n with replacement is performed.
If 
keepX 
a logical indicating whether the data frame of predictors
should be returned. Note that the computation of the
outofbag estimator requires 
formula 
a formula of the form 
data 
optional data frame containing the variables in the model formula. 
subset 
optional vector specifying a subset of observations to be used. 
na.action 
function which indicates what should happen when
the data contain 
... 
additional parameters passed to 
Details
The random forest implementations randomForest
and cforest
are more flexible and reliable for computing
bootstrapaggregated trees than this function and should be used instead.
Bagging for classification and regression trees were suggested by Breiman (1996a, 1998) in order to stabilise trees.
The trees in this function are computed using the implementation in the
rpart
package. The generic function ipredbagg
implements methods for different responses. If y
is a factor,
classification trees are constructed. For numerical vectors
y
, regression trees are aggregated and if y
is a survival
object, bagging survival trees (Hothorn et al, 2003) is performed.
The function bagging
offers a formula based interface to
ipredbagg
.
nbagg
bootstrap samples are drawn and a tree is constructed
for each of them. There is no general rule when to stop the tree
growing. The size of the
trees can be controlled by control
argument
or prune.classbagg
. By
default, classification trees are as large as possible whereas regression
trees and survival trees are build with the standard options of
rpart.control
. If nbagg=1
, one single tree is
computed for the whole learning sample without bootstrapping.
If coob
is TRUE, the outofbag sample (Breiman,
1996b) is used to estimate the prediction error
corresponding to class(y)
. Alternatively, the outofbag sample can
be used for model combination, an outofbag error rate estimator is not
available in this case. Doublebagging (Hothorn and Lausen,
2003) computes a LDA on the outofbag sample and uses the discriminant
variables as additional predictors for the classification trees. comb
is an optional list of lists with two elements model
and predict
.
model
is a function with arguments formula
and data
.
predict
is a function with arguments object, newdata
only. If
the estimation of the covariance matrix in lda
fails due to a
limited outofbag sample size, one can use slda
instead.
See the example section for an example of doublebagging. The methodology is
not limited to a combination with LDA: bundling (Hothorn and Lausen, 2002b)
can be used with arbitrary classifiers.
NOTE: Up to ipred version 0.90, bagging was performed using a modified version of the original rpart function. Due to interface changes in rpart 3.155, the bagging function had to be rewritten. Results of previous version are not exactly reproducible.
Value
The class of the object returned depends on class(y)
:
classbagg, regbagg
and survbagg
. Each is a list with elements
y 
the vector of responses. 
X 
the data frame of predictors. 
mtrees 
multiple trees: a list of length 
OOB 
logical whether the outofbag estimate should be computed. 
err 
if 
comb 
logical whether a combination of models was requested. 
For each class methods for the generics prune.rpart
,
print
, summary
and predict
are
available for inspection of the results and prediction, for example:
print.classbagg
, summary.classbagg
,
predict.classbagg
and prune.classbagg
for
classification problems.
References
Leo Breiman (1996a), Bagging Predictors. Machine Learning 24(2), 123–140.
Leo Breiman (1996b), OutOfBag Estimation. Technical Report http://www.stat.berkeley.edu/~breiman/OOBestimation.pdf.
Leo Breiman (1998), Arcing Classifiers. The Annals of Statistics 26(3), 801–824.
Peter Buehlmann and Bin Yu (2002), Analyzing Bagging. The Annals of Statistics 30(4), 927–961.
Torsten Hothorn and Berthold Lausen (2003), DoubleBagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36(6), 1303–1309.
Torsten Hothorn and Berthold Lausen (2005), Bundling Classifiers by Bagging Trees. Computational Statistics & Data Analysis, 49, 1068–1078.
Torsten Hothorn, Berthold Lausen, Axel Benner and Martin RadespielTroeger (2004), Bagging Survival Trees. Statistics in Medicine, 23(1), 77–91.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66  library("MASS")
library("survival")
# Classification: Breast Cancer data
data("BreastCancer", package = "mlbench")
# Test set error bagging (nbagg = 50): 3.7% (Breiman, 1998, Table 5)
mod < bagging(Class ~ Cl.thickness + Cell.size
+ Cell.shape + Marg.adhesion
+ Epith.c.size + Bare.nuclei
+ Bl.cromatin + Normal.nucleoli
+ Mitoses, data=BreastCancer, coob=TRUE)
print(mod)
# Test set error bagging (nbagg=50): 7.9% (Breiman, 1996a, Table 2)
data("Ionosphere", package = "mlbench")
Ionosphere$V2 < NULL # constant within groups
bagging(Class ~ ., data=Ionosphere, coob=TRUE)
# DoubleBagging: combine LDA and classification trees
# predict returns the linear discriminant values, i.e. linear combinations
# of the original predictors
comb.lda < list(list(model=lda, predict=function(obj, newdata)
predict(obj, newdata)$x))
# Note: outofbag estimator is not available in this situation, use
# errorest
mod < bagging(Class ~ ., data=Ionosphere, comb=comb.lda)
predict(mod, Ionosphere[1:10,])
# Regression:
data("BostonHousing", package = "mlbench")
# Test set error (nbagg=25, trees pruned): 3.41 (Breiman, 1996a, Table 8)
mod < bagging(medv ~ ., data=BostonHousing, coob=TRUE)
print(mod)
library("mlbench")
learn < as.data.frame(mlbench.friedman1(200))
# Test set error (nbagg=25, trees pruned): 2.47 (Breiman, 1996a, Table 8)
mod < bagging(y ~ ., data=learn, coob=TRUE)
print(mod)
# Survival data
# Brier score for censored data estimated by
# 10 times 10fold crossvalidation: 0.2 (Hothorn et al,
# 2002)
data("DLBCL", package = "ipred")
mod < bagging(Surv(time,cens) ~ MGEc.1 + MGEc.2 + MGEc.3 + MGEc.4 + MGEc.5 +
MGEc.6 + MGEc.7 + MGEc.8 + MGEc.9 +
MGEc.10 + IPI, data=DLBCL, coob=TRUE)
print(mod)
