erboost: ER-Boost Expectile Regression Modeling

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/erboost.R

Description

Fits ER-Boost Expectile Regression models.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
erboost(formula = formula(data),
    distribution = list(name="expectile",alpha=0.5),
    data = list(),
    weights,
    var.monotone = NULL,
    n.trees = 3000,
    interaction.depth = 3,
    n.minobsinnode = 10,
    shrinkage = 0.001,
    bag.fraction = 0.5,
    train.fraction = 1.0,
    cv.folds=0,
    keep.data = TRUE,
    verbose = TRUE)

erboost.fit(x,y,
        offset = NULL,
        misc = NULL,
        distribution = list(name="expectile",alpha=0.5),
        w = NULL,
        var.monotone = NULL,
        n.trees = 3000,
        interaction.depth = 3,
        n.minobsinnode = 10,
        shrinkage = 0.001,
        bag.fraction = 0.5,
        train.fraction = 1.0,
        keep.data = TRUE,
        verbose = TRUE,
        var.names = NULL,
        response.name = NULL)

erboost.more(object,
         n.new.trees = 3000,
         data = NULL,
         weights = NULL,
         offset = NULL,
         verbose = NULL)

Arguments

formula

a symbolic description of the model to be fit. The formula may include an offset term (e.g. y~offset(n)+x). If keep.data=FALSE in the initial call to erboost then it is the user's responsibility to resupply the offset to erboost.more.

distribution

a list with a component name specifying the distribution and any additional parameters needed. Expectile regression is available and distribution must a list of the form list(name="expectile",alpha=0.25) where alpha is the expectile to estimate. The current version's expectile regression methods do not handle non-constant weights and will stop.

data

an optional data frame containing the variables in the model. By default the variables are taken from environment(formula), typically the environment from which erboost is called. If keep.data=TRUE in the initial call to erboost then erboost stores a copy with the object. If keep.data=FALSE then subsequent calls to erboost.more must resupply the same dataset. It becomes the user's responsibility to resupply the same data at this point.

weights

an optional vector of weights to be used in the fitting process. Must be positive but do not need to be normalized. If keep.data=FALSE in the initial call to erboost then it is the user's responsibility to resupply the weights to erboost.more.

var.monotone

an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome.

n.trees

the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. The default number is 3000. Users should not always use the default value, but choose the appropriate value of n.trees based on their data. Please see "details" section below.

cv.folds

Number of cross-validation folds to perform. If cv.folds>1 then erboost, in addition to the usual fit, will perform a cross-validation, calculate an estimate of generalization error returned in cv.error.

interaction.depth

The maximum depth of variable interactions. 1 implies an additive model, 2 implies a model with up to 2-way interactions, etc. The default value is 3. Users should not always use the default value, but choose the appropriate value of interaction.depth based on their data. Please see "details" section below.

n.minobsinnode

minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight.

shrinkage

a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction.

bag.fraction

the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction<1 then running the same model twice will result in similar but different fits. erboost uses the R random number generator so set.seed can ensure that the model can be reconstructed. Preferably, the user can save the returned erboost.object using save.

train.fraction

The first train.fraction * nrows(data) observations are used to fit the erboost and the remainder are used for computing out-of-sample estimates of the loss function.

keep.data

a logical variable indicating whether to keep the data and an index of the data stored with the object. Keeping the data and index makes subsequent calls to erboost.more faster at the cost of storing an extra copy of the dataset.

object

a erboost object created from an initial call to erboost.

n.new.trees

the number of additional trees to add to object. The default number is 3000.

verbose

If TRUE, erboost will print out progress and performance indicators. If this option is left unspecified for erboost.more then it uses verbose from object.

x, y

For erboost.fit: x is a data frame or data matrix containing the predictor variables and y is the vector of outcomes. The number of rows in x must be the same as the length of y.

offset

a vector of values for the offset

misc

For erboost.fit: misc is an R object that is simply passed on to the erboost engine.

w

For erboost.fit: w is a vector of weights of the same length as the y.

var.names

For erboost.fit: A vector of strings of length equal to the number of columns of x containing the names of the predictor variables.

response.name

For erboost.fit: A character string label for the response variable.

Details

Expectile regression (Newey & Powell 1987) is a nice tool for estimating the conditional expectiles of a response variable given a set of covariates. This package implements a regression tree based gradient boosting estimator for nonparametric multiple expectile regression. The code is a modified version of gbm library (https://cran.r-project.org/package=gbm) originally written by Greg Ridgeway.

Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001).

In addition to many of the features documented in the Gradient Boosting Machine, erboost offers additional features including the out-of-bag estimator for the optimal number of iterations, the ability to store and manipulate the resulting erboost object.

Concerning tuning parameters, interaction.depth and n.trees are two of the most important tuning parameters in erboost. Users should not always use the default values of those two parameters, instead they should choose the appropriate values of interaction.depth and n.trees according to their data. For example, if n.trees, which is the maximal number of trees to fit, is set to be too small, then it is possible that the actual optimal number of trees (which is best.iter selected by the function erboost.perf in "example" section) for a particular data exceeds this number, resulting a sub-optimal model. Therefore, users should always fit the model with a large enough n.trees such that n.trees is greater than the potential optimal number of trees. The same principle also applies on interaction.depth.

erboost.fit provides the link between R and the C++ erboost engine. erboost is a front-end to erboost.fit that uses the familiar R modeling formulas. However, model.frame is very slow if there are many predictor variables. For power-users with many variables use erboost.fit. For general practice erboost is preferable.

Value

erboost, erboost.fit, and erboost.more return a erboost.object.

Author(s)

Yi Yang yiyang@umn.edu and Hui Zou hzou@stat.umn.edu

References

Yang, Y. and Zou, H. (2015), “Nonparametric Multiple Expectile Regression via ER-Boost,” Journal of Statistical Computation and Simulation, 84(1), 84-95.

G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.

https://cran.r-project.org/package=gbm

J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.

J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.

See Also

erboost.object, erboost.perf, plot.erboost, predict.erboost, summary.erboost,

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
N <- 200
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=50)] <- NA
X4[sample(1:N,size=30)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# fit initial model
erboost1 <- erboost(Y~X1+X2+X3+X4+X5+X6,         # formula
    data=data,                   # dataset
    var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                 # +1: monotone increase,
                                 #  0: no monotone restrictions
    distribution=list(name="expectile",alpha=0.5),
                                 # expectile
    n.trees=3000,                # number of trees
    shrinkage=0.005,             # shrinkage or learning rate,
                                 # 0.001 to 0.1 usually work
    interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
    bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
    train.fraction = 0.5,        # fraction of data for training,
                                 # first train.fraction*N used for training
    n.minobsinnode = 10,         # minimum total weight needed in each node
    cv.folds = 5,                # do 5-fold cross-validation
    keep.data=TRUE,              # keep a copy of the dataset with the object
    verbose=TRUE)                # print out progress


# check performance using a 50% heldout test set
best.iter <- erboost.perf(erboost1,method="test")
print(best.iter)

# check performance using 5-fold cross-validation
best.iter <- erboost.perf(erboost1,method="cv")
print(best.iter)

# plot the performance
# plot variable influence
summary(erboost1,n.trees=1)         # based on the first tree
summary(erboost1,n.trees=best.iter) # based on the estimated best number of trees

# make some new data
N <- 20
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE))
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma)

data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# predict on the new data using "best" number of trees
# f.predict generally will be on the canonical scale
f.predict <- predict.erboost(erboost1,data2,best.iter)

# least squares error
print(sum((data2$Y-f.predict)^2))

# create marginal plots
# plot variable X1 after "best" iterations
plot.erboost(erboost1,1,best.iter)
# contour plot of variables 1 and 3 after "best" iterations
plot.erboost(erboost1,c(1,3),best.iter)

# do another 20 iterations
erboost2 <- erboost.more(erboost1,20,
                 verbose=FALSE) # stop printing detailed progress

erboost documentation built on May 1, 2019, 9:22 p.m.

Related to erboost in erboost...