orf: Ordered Forest Estimator

View source: R/orf_user.R

orfR Documentation

Ordered Forest Estimator

Description

An implementation of the Ordered Forest estimator as developed in Lechner & Okasa (2019). The Ordered Forest flexibly estimates the conditional probabilities of models with ordered categorical outcomes (so-called ordered choice models). Additionally to common machine learning algorithms the orf package provides functions for estimating marginal effects as well as statistical inference thereof and thus provides similar output as in standard econometric models for ordered choice. The core forest algorithm relies on the fast C++ forest implementation from the ranger package (Wright & Ziegler, 2017).

Usage

orf(
  X,
  Y,
  num.trees = 1000,
  mtry = NULL,
  min.node.size = NULL,
  replace = FALSE,
  sample.fraction = NULL,
  honesty = TRUE,
  honesty.fraction = NULL,
  inference = FALSE,
  importance = FALSE
)

Arguments

X

numeric matrix of features

Y

numeric vector of outcomes

num.trees

scalar, number of trees in a forest, i.e. bootstrap replications (default is 1000 trees)

mtry

scalar, number of randomly selected features (default is the squared root of number of features, rounded up to the nearest integer)

min.node.size

scalar, minimum node size, i.e. leaf size of a tree (default is 5 observations)

replace

logical, if TRUE sampling with replacement, i.e. bootstrap is used to grow the trees, otherwise subsampling without replacement is used (default is set to FALSE)

sample.fraction

scalar, subsampling rate (default is 1 for bootstrap and 0.5 for subsampling)

honesty

logical, if TRUE honest forest is built using sample splitting (default is set to TRUE)

honesty.fraction

scalar, share of observations belonging to honest sample not used for growing the forest (default is 0.5)

inference

logical, if TRUE the weight based inference is conducted (default is set to FALSE)

importance

logical, if TRUE variable importance measure based on permutation is conducted (default is set to FALSE)

Details

The Ordered Forest function, orf, estimates the conditional ordered choice probabilities, i.e. P[Y=m|X=x]. Additionally, weight-based inference for the probability predictions can be conducted as well. If inference is desired, the Ordered Forest must be estimated with honesty and subsampling. If prediction only is desired, estimation without honesty and with bootstrapping is recommended for optimal prediction performance.

In order to estimate the Ordered Forest user must supply the data in form of matrix of covariates X and a vector of outcomes 'codeY to the orf function. These data inputs are also the only inputs that must be specified by the user without any defaults. Further optional arguments include the classical forest hyperparameters such as number of trees, num.trees, number of randomly selected features, mtry, and the minimum leaf size, min.node.size. The forest building scheme is regulated by the replace argument, meaning bootstrapping if replace = TRUE or subsampling if replace = FALSE. For the case of subsampling, sample.fraction argument regulates the subsampling rate. Further, honest forest is estimated if the honesty argument is set to TRUE, which is also the default. Similarly, the fraction of the sample used for the honest estimation is regulated by the honesty.fraction argument. The default setting conducts a 50:50 sample split, which is also generally advised to follow for optimal performance. Inference procedure of the Ordered Forest is based on the forest weights and is controlled by the inference argument. Note, that such weight-based inference is computationally demanding exercise due to the estimation of the forest weights and as such longer computation time is to be expected. Lastly, the importance argument turns on and off the permutation based variable importance.

orf is compatible with standard R commands such as predict, margins, plot, summary and print. For further details, see examples below.

Value

object of type orf with following elements

forests

saved forests trained for orf estimations (inherited from ranger)

info

info containing forest inputs and data used

predictions

predicted values for class probabilities

variances

variances of predicted values

importance

weighted measure of permutation based variable importance

accuracy

oob measures for mean squared error and ranked probability score

Author(s)

Gabriel Okasa

References

  • Lechner, M., & Okasa, G. (2019). Random Forest Estimation of the Ordered Choice Model. arXiv preprint arXiv:1907.02436. https://arxiv.org/abs/1907.02436

  • Goller, D., Knaus, M. C., Lechner, M., & Okasa, G. (2021). Predicting Match Outcomes in Football by an Ordered Forest Estimator. A Modern Guide to Sports Economics. Edward Elgar Publishing, 335-355. doi: 10.4337/9781789906530.00026

  • Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1-17. doi: 10.18637/jss.v077.i01.

See Also

summary.orf, plot.orf predict.orf, margins.orf

Examples

## Ordered Forest
require(orf)

# load example data
data(odata)

# specify response and covariates
Y <- as.numeric(odata[, 1])
X <- as.matrix(odata[, -1])

# estimate Ordered Forest with default parameters
orf_fit <- orf(X, Y)

# estimate Ordered Forest with own tuning parameters
orf_fit <- orf(X, Y, num.trees = 2000, mtry = 3, min.node.size = 10)

# estimate Ordered Forest with bootstrapping and without honesty
orf_fit <- orf(X, Y, replace = TRUE, honesty = FALSE)

# estimate Ordered Forest with subsampling and with honesty
orf_fit <- orf(X, Y, replace = FALSE, honesty = TRUE)

# estimate Ordered Forest with subsampling and with honesty
# with own tuning for subsample fraction and honesty fraction
orf_fit <- orf(X, Y, replace = FALSE, sample.fraction = 0.5,
                     honesty = TRUE, honesty.fraction = 0.5)

# estimate Ordered Forest with subsampling and with honesty and with inference
# (for inference, subsampling and honesty are required)
orf_fit <- orf(X, Y, replace = FALSE, honesty = TRUE, inference = TRUE)

# estimate Ordered Forest with simple variable importance measure
orf_fit <- orf(X, Y, importance = TRUE)

# estimate Ordered Forest with all custom settings
orf_fit <- orf(X, Y, num.trees = 2000, mtry = 3, min.node.size = 10,
                     replace = TRUE, sample.fraction = 1,
                     honesty = FALSE, honesty.fraction = 0,
                     inference = FALSE, importance = FALSE)



orf documentation built on July 24, 2022, 1:05 a.m.