distforest: Distributional Regression Forests

Description Usage Arguments Details Value References Examples

Description

Forests based on maximum-likelihood estimation of parameters for specified distribution families, for example from the GAMLSS family (for generalized additive models for location, scale, and shape).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
distforest(formula, data, subset, na.action = na.pass, weights,
             offset, cluster, family = NO(), strata, 
             control = disttree_control(teststat = "quad", testtype = "Univ", 
             mincriterion = 0, saveinfo = FALSE, minsplit = 20, minbucket = 7, 
             splittry = 2, ...), 
             ntree = 500L, fit.par = FALSE, 
             perturb = list(replace = FALSE, fraction = 0.632), 
             mtry = ceiling(sqrt(nvar)), applyfun = NULL, cores = NULL, 
             trace = FALSE, ...)
## S3 method for class 'distforest'
predict(object, newdata = NULL,
        type = c("parameter", "response", "weights", "node"),
        OOB = FALSE, scale = TRUE, ...)

Arguments

formula

a symbolic description of the model to be fit. This should be of type y ~ x1 + x2 where y should be the response variable and x1 and x2 are used as partitioning variables.

data

a data frame containing the variables in the model.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

na.action

a function which indicates what should happen when the data contain missing value.

weights

an optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities weights / sum(weights). The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else. Alternatively, weights can be a double matrix defining case weights for all ncol(weights) trees in the forest directly. This requires more storage but gives the user more control.

offset

an optional vector of offset values.

cluster

an optional factor indicating independent clusters. Highly experimental, use at your own risk.

family

specification of the response distribution. Either a gamlss.family object, a list generating function or a family list.

strata

an optional factor for stratified sampling.

control

a list with control parameters, see disttree_control. The default values that are not set within the call of distforest correspond to those of the default values used by disttree from the disttree package. saveinfo = FALSE leads to less memory hungry representations of trees. Note that arguments mtry, cores and applyfun in disttree_control are ignored for distforest, because they are already set.

ntree

number of trees to grow for the forest.

fit.par

logical. if TRUE, fitted and predicted values and predicted parameters are calculated for the learning data (together with loglikelihood)

perturb

a list with arguments replace and fraction determining which type of resampling with replace = TRUE referring to the n-out-of-n bootstrap and replace = FALSE to sample splitting. fraction is the number of observations to draw without replacement.

mtry

number of input variables randomly sampled as candidates at each node for random forest like algorithms. Bagging, as special case of a random forest without random input variable sampling, can be performed by setting mtry either equal to Inf or manually equal to the number of input variables.

applyfun

an optional lapply-style function with arguments function(X, FUN, ...). It is used for computing the variable selection criterion. The default is to use the basic lapply function unless the cores argument is specified (see below).

cores

numeric. If set to an integer the applyfun is set to mclapply with the desired number of cores.

trace

a logical indicating if a progress bar shall be printed while the forest grows.

object

an object as returned by distforest

newdata

an optional data frame containing test data.

type

a character string denoting the type of predicted value returned. For "parameter" the predicted distributional parameters are returned and for "response" the expectation is returned. "weights" returns an integer vector of prediction weights. For type = "node", a list of terminal node ids for each of the trees in the forest is returned.

OOB

a logical defining out-of-bag predictions (only if newdata = NULL).

scale

a logical indicating scaling of the nearest neighbor weights by the sum of weights in the corresponding terminal node of each tree. In the simple regression forest, predicting the conditional mean by nearest neighbor weights will be equivalent to (but slower!) the aggregation of means.

...

arguments to be used to form the default control argument if it is not supplied directly.

Details

Distributional regression forests are an application of model-based recursive partitioning (implemented in mob, ctree and cforest) to parametric model fits based on the GAMLSS family of distributions.

Distributional regression trees, see disttree, are fitted to each of the ntree perturbed samples of the learning sample. Most of the hyper parameters in disttree_control regulate the construction of the distributional regression trees.

Hyper parameters you might want to change are:

1. The number of randomly preselected variables mtry, which is fixed to the square root of the number of input variables.

2. The number of trees ntree. Use more trees if you have more variables.

3. The depth of the trees, regulated by mincriterion. Usually unstopped and unpruned trees are used in random forests. To grow large trees, set mincriterion to a small value.

The aggregation scheme works by averaging observation weights extracted from each of the ntree trees and NOT by averaging predictions directly as in randomForest. See Schlosser et al. (2019), Hothorn et al. (2004), and Meinshausen (2006) for a description.

Predictions can be computed using predict. For observations with zero weights, predictions are computed from the fitted tree when newdata = NULL.

Value

An object of class distforest.

References

Breiman L (2001). Random Forests. Machine Learning, 45(1), 5–32.

Hothorn T, Lausen B, Benner A, Radespiel-Troeger M (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77–91.

Hothorn T, B\"uhlmann P, Dudoit S, Molinaro A, Van der Laan MJ (2006a). Survival Ensembles. Biostatistics, 7(3), 355–373.

Hothorn T, Hornik K, Zeileis A (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.

Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.

Meinshausen N (2006). Quantile Regression Forests. Journal of Machine Learning Research, 7, 983–999.

Schlosser L, Hothorn T, Stauffer R, Zeileis A (2019). Distributional Regression Forests for Probabilistic Precipitation Forecasting in Complex Terrain. arXiv 1804.02921, arXiv.org E-Print Archive. http://arxiv.org/abs/1804.02921v3

Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007). Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. http://www.biomedcentral.com/1471-2105/8/25

Strobl C, Malley J, Tutz G (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests. Psychological Methods, 14(4), 323–348.

Examples

1
2
3
4
5
6
7
8
## basic example: distributional regression forest for cars data
df <- distforest(dist ~ speed, data = cars)

## prediction of fitted mean and visualization
nd <- data.frame(speed = 4:25)
nd$mean  <- predict(df, newdata = nd, type = "response")[["(fitted.response)"]]
plot(dist ~ speed, data = cars)
lines(mean ~ speed, data = nd)

disttree documentation built on Aug. 14, 2019, 3 a.m.