mboostLSS: Fitting GAMLSS by Boosting

Description Usage Arguments Details Value References See Also Examples

Description

Functions for fitting GAMLSS (generalized additive models for location, scale and shape) using boosting techniques. The algorithm iteratively rotates between the distribution parameters, updating one while using the current fits of the others as offsets (for details see Mayr et al., 2012).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
mboostLSS(formula, data = list(), families = GaussianLSS(),
          control = boost_control(), weights = NULL, ...)
glmboostLSS(formula, data = list(), families = GaussianLSS(),
            control = boost_control(), weights = NULL, ...)
gamboostLSS(formula, data = list(), families = GaussianLSS(),
            control = boost_control(), weights = NULL, ...)
blackboostLSS(formula, data = list(), families = GaussianLSS(),
              control = boost_control(), weights = NULL, ...)

## fit function:
mboostLSS_fit(formula, data = list(), families = GaussianLSS(),
              control = boost_control(), weights = NULL,
              fun = mboost, funchar = "mboost", call = NULL, ...)

Arguments

formula

a symbolic description of the model to be fit. See mboost for details. If formula is a single formula, the same formula is used for all distribution parameters. formula can also be a (named) list, where each list element corresponds to one distribution parameter of the GAMLSS distribution. The names must be the same as in the family (see example for details).

data

a data frame containing the variables in the model.

families

an object of class families. It can be either one of the pre-defined distributions that come along with the package or a new distribution specified by the user (see Families for details). Per default, we use the two-parametric GaussianLSS family.

control

a list of parameters controlling the algorithm. For more details see boost_control.

weights

a numeric vector of weights (optional).

fun

fit function. Either mboost, glmboost, gamboost or blackboost. Specified directly via the corresponding LSS function. E.g. gamboostLSS() calls mboostLSS_fit(..., fun = gamboost).

funchar

character representation of fit function. Either "mboost", "glmboost", "gamboost" or "blackboost". Specified directly via the corresponding LSS function.

call

used to forward the call from mboostLSS, glmboostLSS, gamboostLSS and blackboostLSS. This argument should not be directly specified by users!

...

Further arguments to be passed to mboostLSS_fit. In mboostLSS_fit, ... represent further arguments to be passed to mboost and mboost_fit. So ... can be all arguments of mboostLSS_fit and mboost_fit.

Details

For information on GAMLSS theory see Rigby and Stasinopoulos (2005) or the information provided at http://gamlss.org. For a tutorial on gamboostLSS see Hofner et al. (2014).

glmboostLSS uses glmboost to fit the distribution parameters of a GAMLSS – a linear boosting model is fitted for each parameter.

gamboostLSS uses gamboost to fit the distribution parameters of a GAMLSS – an additive boosting model (by default with smooth effects) is fitted for each parameter. With the formula argument, a wide range of different base-learners can be specified (see baselearners). The base-learners inply the type of effect each covariate has on the corresponding distribution parameter.

mboostLSS uses mboost to fit the distribution parameters of a GAMLSS. The type of model (linear, tree-based or smooth) is specified by fun.

blackboostLSS uses blackboost to fit the distribution parameters of a GAMLSS – a tree-based boosting model is fitted for each parameter.

mboostLSS, glmboostLSS, gamboostLSS and blackboostLSS all call mboostLSS_fit while fun is the corresponding mboost function, i.e., the same function without LSS. For further possible arguments see these functions as well as mboost_fit.

In all four fitting functions it is possible to specify one or multiple mstop and nu values via boost_control. In the case of one single value, this value is used for all distribution parameters of the GAMLSS model. Alternatively, a (named) vector or a (named) list with separate values for each component can be used to specify a seperate value for each parameter of the GAMLSS model. The names of the list must correspond to the names of the distribution parameters of the GAMLSS family. If no names are given, the order of the mstop or nu values is assumed to be the same as the order of the components in the families. For one-dimensional stopping, the user therefore can specify, e.g., mstop = 100 via boost_control. For more-dimensional stopping, one can specify, e.g., mstop = list(mu = 100, sigma = 200) (see examples).

To (potentially) stabilize the model estimation by standardizing the negative gradients one can use the argument stabilization of the families. See Families for details.

Value

An object of class mboostLSS with corresponding methods to extract information.

References

B. Hofner, A. Mayr, M. Schmid (2014). gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework. Technical Report, arXiv:1407.1774.

Mayr, A., Fenske, N., Hofner, B., Kneib, T. and Schmid, M. (2012): Generalized additive models for location, scale and shape for high-dimensional data - a flexible approach based on boosting. Journal of the Royal Statistical Society, Series C (Applied Statistics) 61(3): 403-427.

M. Schmid, S. Potapov, A. Pfahlberg, and T. Hothorn. Estimation and regularization techniques for regression models with multidimensional prediction functions. Statistics and Computing, 20(2):139-150, 2010.

Rigby, R. A. and D. M. Stasinopoulos (2005). Generalized additive models for location, scale and shape (with discussion). Journal of the Royal Statistical Society, Series C (Applied Statistics), 54, 507-554.

Buehlmann, P. and Hothorn, T. (2007), Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.

See Also

Families for a documentation of available GAMLSS distributions.

The underlying boosting functions mboost, gamboost, glmboost, blackboost are contained in the mboost package.

See for example risk or coef for methods that can be used to extract information from mboostLSS objects.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
### Data generating process:
set.seed(1907)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for( i in 1:1000)
    y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

### linear model with y ~ . for both components: 400 boosting iterations
model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 400),
                     center = TRUE)
coef(model, off2int = TRUE)


### estimate model with different formulas for mu and sigma:
names(NBinomialLSS())      # names of the family

# Note: Multiple formulas must be specified via a _named list_
#       where the names correspond to the names of the distribution parameters
#       in the family (see above)
model2 <- glmboostLSS(formula = list(mu = y ~ x1 + x2 + x3 + x4,
                                    sigma = y ~ x3 + x4 + x5 + x6),
                     families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 400, trace = TRUE),
                     center = TRUE)
coef(model2, off2int = TRUE)



### Offset needs to be specified via the arguments of families object:
model <- glmboostLSS(y ~ ., data = dat,
                     families = NBinomialLSS(mu = mean(mu),
                                             sigma = mean(sigma)),
                     control = boost_control(mstop = 10),
                     center = TRUE)
# Note: mu-offset = log(mean(mu)) and sigma-offset = log(mean(sigma))
#       as we use a log-link in both families
coef(model)
log(mean(mu))
log(mean(sigma))

### use different mstop values for the two distribution parameters
### (two-dimensional early stopping)
### the number of iterations is passed to boost_control via a named list
model3 <- glmboostLSS(formula = list(mu = y ~ x1 + x2 + x3 + x4,
                                    sigma = y ~ x3 + x4 + x5 + x6),
                     families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = list(mu = 400,
                                                          sigma = 300),
                                             trace  = TRUE),
                     center = TRUE)
coef(model3, off2int = TRUE)

### Alternatively we can subset model2:
# here it is assumed that the first element in the vector corresponds to
# the first distribution parameter of model2 etc.
model2[c(400, 300)]
par(mfrow = c(1,2))
plot(model2, xlim = c(0, max(mstop(model2))))
## all.equal(coef(model2), coef(model3)) # same!


### WARNING: Subsetting via model[mstopnew] changes the model directly!
### For the original fit one has to subset again: model[mstop]

gamboostLSS documentation built on May 2, 2019, 4:57 p.m.