distforest: Distributional Regression Forests
In disttree: Trees and Forests for Distributional Regression

Description Usage Arguments Details Value References Examples

Forests based on maximum-likelihood estimation of parameters for specified distribution families, for example from the GAMLSS family (for generalized additive models for location, scale, and shape).

distforest(formula, data, subset, na.action = na.pass, weights,
             offset, cluster, family = NO(), strata, 
             control = disttree_control(teststat = "quad", testtype = "Univ", 
             mincriterion = 0, saveinfo = FALSE, minsplit = 20, minbucket = 7, 
             splittry = 2, ...), 
             ntree = 500L, fit.par = FALSE, 
             perturb = list(replace = FALSE, fraction = 0.632), 
             mtry = ceiling(sqrt(nvar)), applyfun = NULL, cores = NULL, 
             trace = FALSE, ...)
## S3 method for class 'distforest'
predict(object, newdata = NULL,
        type = c("parameter", "response", "weights", "node"),
        OOB = FALSE, scale = TRUE, ...)

`formula`	a symbolic description of the model to be fit. This should be of type `y ~ x1 + x2` where `y` should be the response variable and `x1` and `x2` are used as partitioning variables.
`data`	a data frame containing the variables in the model.
`subset`	an optional vector specifying a subset of observations to be used in the fitting process.
`na.action`	a function which indicates what should happen when the data contain missing value.
`weights`	an optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities `weights / sum(weights)`. The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else. Alternatively, `weights` can be a double matrix defining case weights for all `ncol(weights)` trees in the forest directly. This requires more storage but gives the user more control.
`offset`	an optional vector of offset values.
`cluster`	an optional factor indicating independent clusters. Highly experimental, use at your own risk.
`family`	specification of the response distribution. Either a `gamlss.family` object, a list generating function or a family list.
`strata`	an optional factor for stratified sampling.
`control`	a list with control parameters, see `disttree_control`. The default values that are not set within the call of `distforest` correspond to those of the default values used by `disttree` from the `disttree` package. `saveinfo = FALSE` leads to less memory hungry representations of trees. Note that arguments `mtry`, `cores` and `applyfun` in `disttree_control` are ignored for `distforest`, because they are already set.
`ntree`	number of trees to grow for the forest.
`fit.par`	logical. if TRUE, fitted and predicted values and predicted parameters are calculated for the learning data (together with loglikelihood)
`perturb`	a list with arguments `replace` and `fraction` determining which type of resampling with `replace = TRUE` referring to the n-out-of-n bootstrap and `replace = FALSE` to sample splitting. `fraction` is the number of observations to draw without replacement.
`mtry`	number of input variables randomly sampled as candidates at each node for random forest like algorithms. Bagging, as special case of a random forest without random input variable sampling, can be performed by setting `mtry` either equal to `Inf` or manually equal to the number of input variables.
`applyfun`	an optional `lapply`-style function with arguments `function(X, FUN, ...)`. It is used for computing the variable selection criterion. The default is to use the basic `lapply` function unless the `cores` argument is specified (see below).
`cores`	numeric. If set to an integer the `applyfun` is set to `mclapply` with the desired number of `cores`.
`trace`	a logical indicating if a progress bar shall be printed while the forest grows.
`object`	an object as returned by `distforest`
`newdata`	an optional data frame containing test data.
`type`	a character string denoting the type of predicted value returned. For `"parameter"` the predicted distributional parameters are returned and for `"response"` the expectation is returned. `"weights"` returns an integer vector of prediction weights. For `type = "node"`, a list of terminal node ids for each of the trees in the forest is returned.
`OOB`	a logical defining out-of-bag predictions (only if `newdata = NULL`).
`scale`	a logical indicating scaling of the nearest neighbor weights by the sum of weights in the corresponding terminal node of each tree. In the simple regression forest, predicting the conditional mean by nearest neighbor weights will be equivalent to (but slower!) the aggregation of means.
`...`	arguments to be used to form the default `control` argument if it is not supplied directly.

Distributional regression forests are an application of model-based recursive partitioning (implemented in mob, ctree and cforest) to parametric model fits based on the GAMLSS family of distributions.

Distributional regression trees, see disttree, are fitted to each of the ntree perturbed samples of the learning sample. Most of the hyper parameters in disttree_control regulate the construction of the distributional regression trees.

Hyper parameters you might want to change are:

1. The number of randomly preselected variables mtry, which is fixed to the square root of the number of input variables.

2. The number of trees ntree. Use more trees if you have more variables.

3. The depth of the trees, regulated by mincriterion. Usually unstopped and unpruned trees are used in random forests. To grow large trees, set mincriterion to a small value.

The aggregation scheme works by averaging observation weights extracted from each of the ntree trees and NOT by averaging predictions directly as in randomForest. See Schlosser et al. (2019), Hothorn et al. (2004), and Meinshausen (2006) for a description.

Predictions can be computed using predict. For observations with zero weights, predictions are computed from the fitted tree when newdata = NULL.

An object of class distforest.

Breiman L (2001). Random Forests. Machine Learning, 45(1), 5–32.

Hothorn T, Lausen B, Benner A, Radespiel-Troeger M (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77–91.

Hothorn T, B\"uhlmann P, Dudoit S, Molinaro A, Van der Laan MJ (2006a). Survival Ensembles. Biostatistics, 7(3), 355–373.

Hothorn T, Hornik K, Zeileis A (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.

Hothorn T, Zeileis A (2015). partykit: A Modular Toolkit for Recursive Partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.

Meinshausen N (2006). Quantile Regression Forests. Journal of Machine Learning Research, 7, 983–999.

Schlosser L, Hothorn T, Stauffer R, Zeileis A (2019). Distributional Regression Forests for Probabilistic Precipitation Forecasting in Complex Terrain. arXiv 1804.02921, arXiv.org E-Print Archive. http://arxiv.org/abs/1804.02921v3

Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007). Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. http://www.biomedcentral.com/1471-2105/8/25

Strobl C, Malley J, Tutz G (2009). An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests. Psychological Methods, 14(4), 323–348.

## basic example: distributional regression forest for cars data
df <- distforest(dist ~ speed, data = cars)

## prediction of fitted mean and visualization
nd <- data.frame(speed = 4:25)
nd$mean  <- predict(df, newdata = nd, type = "response")[["(fitted.response)"]]
plot(dist ~ speed, data = cars)
lines(mean ~ speed, data = nd)