quantForestError: Quantify random forest prediction error
In forestError: A Unified Framework for Random Forest Prediction Error Estimation

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/quantforesterror.R

Estimates the conditional misclassification rates, conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions of random forest predictions.

quantForestError(
  forest,
  X.train,
  X.test,
  Y.train = NULL,
  what = if (grepl("class", c(forest$type, forest$family, forest$treetype), TRUE))
    "mcr" else c("mspe", "bias", "interval", "p.error", "q.error"),
  alpha = 0.05,
  train_nodes = NULL,
  return_train_nodes = FALSE,
  n.cores = 1
)

`forest`	The random forest object being used for prediction.
`X.train`	A `matrix` or `data.frame` with the observations that were used to train `forest`. Each row should be an observation, and each column should be a predictor variable.
`X.test`	A `matrix` or `data.frame` with the observations to be predicted; each row should be an observation, and each column should be a predictor variable.
`Y.train`	A vector of the responses of the observations that were used to train `forest`. Required if `forest` was created using `ranger`, but not if `forest` was created using `randomForest`, `randomForestSRC`, or `quantregForest`.
`what`	A vector of characters indicating what estimates are desired. Possible options are conditional mean squared prediction errors (`"mspe"`), conditional biases (`"bias"`), conditional prediction intervals (`"interval"`), conditional error distribution functions (`"p.error"`), conditional error quantile functions (`"q.error"`), and conditional misclassification rate (`"mcr"`). Note that the conditional misclassification rate is available only for categorical outcomes, while the other parameters are available only for real-valued outcomes.
`alpha`	A vector of type-I error rates desired for the conditional prediction intervals; required if `"interval"` is included in `what`.
`train_nodes`	A `data.table` indicating what out-of-bag prediction errors each terminal node of each tree in `forest` contains. It should be formatted like the output of `findOOBErrors`. If not provided, it will be computed internally.
`return_train_nodes`	A boolean indicating whether to return the `train_nodes` computed and/or used.
`n.cores`	Number of cores to use (for parallel computation in `ranger`).

This function accepts classification or regression random forests built using the randomForest, ranger, randomForestSRC, and quantregForest packages. When training the random forest using randomForest, ranger, or quantregForest, keep.inbag must be set to TRUE. When training the random forest using randomForestSRC, membership must be set to TRUE.

The predictions computed by ranger can be parallelized by setting the value of n.cores to be greater than 1.

The random forest predictions are always returned as a data.frame. Additional columns are included in the data.frame depending on the user's selections in the argument what. In particular, including "mspe" in what will add an additional column with the conditional mean squared prediction error of each test prediction to the data.frame; including "bias" in what will add an additional column with the conditional bias of each test prediction to the data.frame; including "interval" in what will add to the data.frame additional columns with the lower and upper bounds of conditional prediction intervals for each test prediction; and including "mcr" in what will add an additional column with the conditional misclassification rate of each test prediction to the data.frame. The conditional misclassification rate can be estimated only for classification random forests, while the other parameters can be estimated only for regression random forests.

If "p.error" or "q.error" is included in what, or if return_train_nodes is set to TRUE, then a list will be returned as output. The first element of the list, named "estimates", is the data.frame described in the above paragraph. The other elements of the list are the estimated cumulative distribution functions (perror) of the conditional error distributions, the estimated quantile functions (qerror) of the conditional error distributions, and/or a data.table indicating what out-of-bag prediction errors each terminal node of each tree in the random forest contains.

A data.frame with one or more of the following columns, as described in the details section:

`pred`	The random forest predictions of the test observations
`mspe`	The estimated conditional mean squared prediction errors of the random forest predictions
`bias`	The estimated conditional biases of the random forest predictions
`lower_alpha`	The estimated lower bounds of the conditional alpha-level prediction intervals for the test observations
`upper_alpha`	The estimated upper bounds of the conditional alpha-level prediction intervals for the test observations
`mcr`	The estimated conditional misclassification rate of the random forest predictions

In addition, one or both of the following functions, as described in the details section:

`perror`	The estimated cumulative distribution functions of the conditional error distributions associated with the test predictions
`qerror`	The estimated quantile functions of the conditional error distributions associated with the test predictions

In addition, if return_train_nodes is TRUE, then a data.table called train_nodes indicating what out-of-bag prediction errors each terminal node of each tree in forest contains.

Benjamin Lu <b.lu@berkeley.edu>; Johanna Hardin <jo.hardin@pomona.edu>

perror, qerror, findOOBErrors

# load data
data(airquality)

# remove observations with missing predictor variable values
airquality <- airquality[complete.cases(airquality), ]

# get number of observations and the response column index
n <- nrow(airquality)
response.col <- 1

# split data into training and test sets
train.ind <- sample(c("A", "B", "C"), n,
                    replace = TRUE, prob = c(0.8, 0.1, 0.1))
Xtrain <- airquality[train.ind == "A", -response.col]
Ytrain <- airquality[train.ind == "A", response.col]
Xtest1 <- airquality[train.ind == "B", -response.col]
Xtest2 <- airquality[train.ind == "C", -response.col]

# fit regression random forest to the training data
rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
                                 ntree = 500,
                                 keep.inbag = TRUE)

# estimate conditional mean squared prediction errors,
# biases, prediction intervals, and error distribution
# functions for the observations in Xtest1. return
# train_nodes to avoid recomputation in the next
# line of code.
output1 <- quantForestError(rf, Xtrain, Xtest1,
                            return_train_nodes = TRUE)

# estimate just the conditional mean squared prediction errors
# and prediction intervals for the observations in Xtest2.
# avoid recomputation by providing train_nodes from the
# previous line of code.
output2 <- quantForestError(rf, Xtrain, Xtest2,
                            what = c("mspe", "interval"),
                            train_nodes = output1$train_nodes)

# for illustrative purposes, convert response to categorical
Ytrain <- as.factor(Ytrain > 31.5)

# fit classification random forest to the training data
rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 3,
                                 ntree = 500,
                                 keep.inbag = TRUE)

# estimate conditional misclassification rate of the
# predictions of Xtest1
output <- quantForestError(rf, Xtrain, Xtest1)