Family: Gradient Boosting Families
In mboost: Model-Based Boosting

Description Usage Arguments Details Value Warning Author(s) References See Also Examples

boost_family objects provide a convenient way to specify loss functions and corresponding risk functions to be optimized by one of the boosting algorithms implemented in this package.

Family(ngradient, loss = NULL, risk = NULL,
       offset = function(y, w)
           optimize(risk, interval = range(y),
                    y = y, w = w)$minimum,
       check_y = function(y) y,
       weights = c("any", "none", "zeroone", "case"),
       nuisance = function() return(NA),
       name = "user-specified", fW = NULL,
       response = function(f) NA,
       rclass = function(f) NA)
AdaExp()
AUC()
Binomial(link = c("logit", "probit"), ...)
GaussClass()
GaussReg()
Gaussian()
Huber(d = NULL)
Laplace()
Poisson()
GammaReg(nuirange = c(0, 100))
CoxPH()
QuantReg(tau = 0.5, qoffset = 0.5)
ExpectReg(tau = 0.5)
NBinomial(nuirange = c(0, 100))
PropOdds(nuirange = c(-0.5, -1), offrange = c(-5, 5))
Weibull(nuirange = c(0, 100))
Loglog(nuirange = c(0, 100))
Lognormal(nuirange = c(0, 100))
Gehan()
Hurdle(nuirange = c(0, 100))
Multinomial()

`ngradient`	a function with arguments `y`, `f` and `w` implementing the negative gradient of the `loss` function (which is to be minimized).
`loss`	an optional loss function with arguments `y` and `f`.
`risk`	an optional risk function with arguments `y`, `f` and `w` to be minimized (!), the weighted mean of the loss function by default.
`offset`	a function with argument `y` and `w` (weights) for computing a scalar offset.
`fW`	transformation of the fit for the diagonal weights matrix for an approximation of the boosting hat matrix for loss functions other than squared error.
`response`	inverse link function of a GLM or any other transformation on the scale of the response.
`rclass`	function to derive class predictions from conditional class probabilities (for models with factor response variable).
`check_y`	a function for checking and transforming the class / mode of a response variable.
`nuisance`	a function for extracting nuisance parameters from the family.
`weights`	a character indicating what type of weights are allowed. These can be either arbitrary (non-negative) weights code"any", only zero and one weights `"zeroone"`, (non-negative) interger weights `"case"`, or no weights are allowed `"none"`.
`name`	a character giving the name of the loss function for pretty printing.
`link`	link function for binomial family. Alternatively, one may supply the name of a distribution (for example `link = "norm"`), parameters of which may be specified via the ... argument.
`d`	delta parameter for Huber loss function. If omitted, it is chosen adaptively.
`tau`	the quantile or expectile to be estimated, a number strictly between 0 and 1.
`qoffset`	quantile of response distribution to be used as offset.
`nuirange`	a vector containing the end-points of the interval to be searched for the minimum risk w.r.t. the nuisance parameter. In case of `PropOdds`, the starting values for the nuisance parameters.
`offrange`	interval to search for offset in.
`...`	additional arguments to link functions.

The boosting algorithm implemented in mboost minimizes the (weighted) empirical risk function risk(y, f, w) with respect to f. By default, the risk function is the weighted sum of the loss function loss(y, f) but can be chosen arbitrarily. The ngradient(y, f) function is the negative gradient of loss(y, f) with respect to f.

Pre-fabricated functions for the most commonly used loss functions are available as well. Buehlmann and Hothorn (2007) give a detailed overview of the available loss functions. The offset function returns the population minimizers evaluated at the response, i.e., 1/2 \log(p / (1 - p)) for Binomial() or AdaExp() and (∑ w_i)^{-1} ∑ w_i y_i for Gaussian() and the median for Huber() and Laplace(). A short summary of the available families is given in the following paragraphs:

AdaExp(), Binomial() and AUC() implement families for binary classification. AdaExp() uses the exponential loss, which essentially leads to the AdaBoost algorithm of Freund and Schapire (1996). Binomial() implements the negative binomial log-likelihood of a logistic regression model as loss function. Thus, using Binomial family closely corresponds to fitting a logistic model. Alternative link functions can be specified via the name of the corresponding distribution, for example link = "cauchy" lead to pcauchy used as link function. This feature is still experimental and not well tested.

However, the coefficients resulting from boosting with family Binomial(link = "logit") are 1/2 of the coefficients of a logit model obtained via glm. This is due to the internal recoding of the response to -1 and +1 (see below). However, Buehlmann and Hothorn (2007) argue that the family Binomial is the preferred choice for binary classification. For binary classification problems the response y has to be a factor. Internally y is re-coded to -1 and +1 (Buehlmann and Hothorn 2007). AUC() uses 1-AUC(y, f) as the loss function. The area under the ROC curve (AUC) is defined as AUC = (n_{-1} n_1)^{-1} ∑_{i: y_i = 1} ∑_{j: y_j = -1} I(f_i > f_j). Since this is not differentiable in f, we approximate the jump function I((f_i - f_j) > 0) by the distribution function of the triangular distribution on [-1, 1] with mean 0, similar to the logistic distribution approximation used in Ma and Huang (2005).

Gaussian() is the default family in mboost. It implements L_2Boosting for continuous response. Note that families GaussReg() and GaussClass() (for regression and classification) are deprecated now. Huber() implements a robust version for boosting with continuous response, where the Huber-loss is used. Laplace() implements another strategy for continuous outcomes and uses the L_1-loss instead of the L_2-loss as used by Gaussian().

Poisson() implements a family for fitting count data with boosting methods. The implemented loss function is the negative Poisson log-likelihood. Note that the natural link function \log(μ) = η is assumed. The default step-site nu = 0.1 is probably too large for this family (leading to infinite residuals) and smaller values are more appropriate.

GammaReg() implements a family for fitting nonnegative response variables. The implemented loss function is the negative Gamma log-likelihood with logarithmic link function (instead of the natural link).

CoxPH() implements the negative partial log-likelihood for Cox models. Hence, survival models can be boosted using this family.

QuantReg() implements boosting for quantile regression, which is introduced in Fenske et al. (2009). ExpectReg works in analogy, only for expectiles, which were introduced to regression by Newey and Powell (1987).

Families with an additional scale parameter can be used for fitting models as well: PropOdds() leads to proportional odds models for ordinal outcome variables (Schmid et al., 2011). When using this family, an ordered set of threshold parameters is re-estimated in each boosting iteration. An example is given below which also shows how to obtain the thresholds. NBinomial() leads to regression models with a negative binomial conditional distribution of the response. Weibull(), Loglog(), and Lognormal() implement the negative log-likelihood functions of accelerated failure time models with Weibull, log-logistic, and lognormal distributed outcomes, respectively. Hence, parametric survival models can be boosted using these families. For details see Schmid and Hothorn (2008) and Schmid et al. (2010).

Gehan() implements rank-based estimation of survival data in an accelerated failure time model. The loss function is defined as the sum of the pairwise absolute differences of residuals. The response needs to be defined as Surv(y, delta), where y is the observed survial time (subject to censoring) and delta is the non-censoring indicator (see Surv for details). For details on Gehan() see Johnson and Long (2011).

Hurdle models for zero-inflated count data can be fitted by using a combination of the Binomial() and Hurdle() families. While the Binomial() family allows for fitting the zero-generating process of the Hurdle model, Hurdle() fits a negative binomial regression model to the non-zero counts. Note that the specification of the Hurdle model allows for using Binomial() and Hurdle() independently of each other.

Linear or additive multinomial logit models can be fitted using Multinomial(); although is family requires some extra effort for model specification (see example). More specifically, the predictor must be in the form of a linear array model (see %O%). Note that this family does not work with tree-based base-learners at the moment. The class corresponding to the last level of the factor coding of the response is used as reference class.

An object of class boost_family.

The coefficients resulting from boosting with family Binomial are 1/2 of the coefficients of a logit model obtained via glm. This is due to the internal recoding of the response to -1 and +1 (see above).

For AUC(), variables should be centered and scaled and observations with weight > 0 must not contain missing values. The estimated coefficients for AUC() have no probabilistic interpretation.

ExpectReg() was donated by Fabian Sobotka. AUC() was donated by Fabian Scheipl.

Peter Buehlmann and Torsten Hothorn (2007), Boosting algorithms: regularization, prediction and model fitting. Statistical Science, 22(4), 477–505.

Nora Fenske, Thomas Kneib, and Torsten Hothorn (2011), Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. Journal of the American Statistical Association, 106:494-510.

Yoav Freund and Robert E. Schapire (1996), Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, 148–156.

Shuangge Ma and Jian Huang (2005), Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics, 21(24), 4356–4362.

Whitney K. Newey and James L. Powell (1987), Asymmetric least squares estimation and testing. Econometrika, 55, 819–847.

Matthias Schmid and Torsten Hothorn (2008), Flexible boosting of accelerated failure time models. BMC Bioinformatics, 9(269).

Matthias Schmid, Sergej Potapov, Annette Pfahlberg, and Torsten Hothorn (2010). Estimation and regularization techniques for regression models with multidimensional prediction functions. Statistics and Computing, 20, 139–150.

Schmid, M., T. Hothorn, K. O. Maloney, D. E. Weller and S. Potapov (2011): Geoadditive regression modeling of stream biological condition. Environmental and Ecological Statistics, 18(4), 709–733.

Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov and Matthias Schmid (2014). Model-based Boosting in R: A Hands-on Tutorial Using the R Package mboost. Computational Statistics, 29, 3–35.
http://dx.doi.org/10.1007/s00180-012-0382-5

Available as vignette via: vignette(package = "mboost", "mboost_tutorial")

Brent A. Johnson and Qi Long (2011) Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies. Annals of Applied Statistics, 5, 1081–1101.

mboost for the usage of Familys. See boost_family-class for objects resulting from a call to Family.

### Define a new family
MyGaussian <- function(){
       Family(ngradient = function(y, f, w = 1) y - f,
       loss = function(y, f) (y - f)^2,
       name = "My Gauss Variant")
}

############################################################
## Do not run and check these examples automatically as
## they take some time

### Proportional odds model
data(iris)
iris$Species <- factor(iris$Species, ordered = TRUE)
if (require("MASS")) {
    (mod.polr <- polr(Species  ~ Sepal.Length, data = iris))
}
mod.PropOdds <- glmboost(Species  ~ Sepal.Length, data = iris,
                         family = PropOdds(nuirange = c(-0.5, 3)))
mstop(mod.PropOdds) <- 1000
## thresholds are treated as nuisance parameters, to extract these use
nuisance(mod.PropOdds)
## effect estimate
coef(mod.PropOdds)["Sepal.Length"]
## make thresholds comparable to a model without intercept
nuisance(mod.PropOdds) - coef(mod.PropOdds)["(Intercept)"] -
    attr(coef(mod.PropOdds), "offset")

## End(Not run and test)


### Multinomial logit model via a linear array model
## One needs to convert the data to a list
myiris <- as.list(iris)
## ... and define a dummy vector with one factor level less
## than the outcome, which is used as reference category.
myiris$class <- factor(levels(iris$Species)[-nlevels(iris$Species)])
## Now fit the linear array model
mlm <- mboost(Species ~ bols(Sepal.Length, df = 2) %O%
                        bols(class, df = 2, contrasts.arg = "contr.dummy"),
              data = myiris,
              family = Multinomial())
coef(mlm) ## one should use more boosting iterations.
head(round(pred <- predict(mlm, type = "response"), 2))

## Prediction with new data:
newdata <- as.list(iris[1,])
## One always needs to keep the dummy vector class as above!
newdata$class <- factor(levels(iris$Species)[-nlevels(iris$Species)])
pred2 <- predict(mlm, type = "response", newdata = newdata)
## check results
pred[1, ]
pred2

############################################################
## Do not run and check these examples automatically as
## they take some time

## Compare results with nnet::multinom
if (require("nnet")) {
    mlmn <- multinom(Species ~ Sepal.Length, data = iris)
    max(abs(fitted(mlm[1000], type = "response") -
            fitted(mlmn, type = "prob")))

}

## End(Not run and test)

mboost documentation built on May 2, 2019, 6:10 p.m.

mboost index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mboost
Model-Based Boosting

Family: Gradient Boosting Families
In mboost: Model-Based Boosting

Description

Usage

Arguments

Details

Value

Warning

Author(s)

References

See Also

Examples

Related to Family in mboost...

R Package Documentation

Browse R Packages

We want your feedback!

mboost Model-Based Boosting

Family: Gradient Boosting Families In mboost: Model-Based Boosting

Description

Usage

Arguments

Details

Value

Warning

Author(s)

References

See Also

Examples

Related to Family in mboost...

R Package Documentation

Browse R Packages

We want your feedback!

mboost
Model-Based Boosting

Family: Gradient Boosting Families
In mboost: Model-Based Boosting