rfint: rfint()
In piRF: Prediction Intervals for Random Forests

Description Usage Arguments Details Value Author(s) References See Also Examples

Implements seven different random forest prediction interval methods.

rfint(
  formula = formula,
  train_data = NULL,
  test_data = NULL,
  method = "Zhang",
  alpha = 0.1,
  symmetry = TRUE,
  seed = NULL,
  m_try = 2,
  num_trees = 500,
  min_node_size = 5,
  num_threads = parallel::detectCores(),
  calibrate = FALSE,
  Roy_method = "quantile",
  featureBias = FALSE,
  predictionBias = TRUE,
  Tung_R = 5,
  Tung_num_trees = 75,
  variant = 1,
  Ghosal_num_stages = 2,
  prop = 0.618,
  concise = TRUE,
  interval_type = "two-sided"
)

`formula`	Object of class formula or character describing the model to fit. Interaction terms supported only for numerical variables.
`train_data`	Training data of class data.frame.
`test_data`	Test data of class data.frame. Utilizes ranger::predict() to produce prediction intervals for test data.
`method`	Choose what method to generate RF prediction intervals. Options are `method = c("Zhang", "quantile", "Romano", "Ghosal", "Roy", "Tung", "HDI")`. Defaults to `method = "Zhang"`.
`alpha`	Significance level for prediction intervals. Defaults to `alpha = 0.1`.
`symmetry`	True if constructing symmetric out-of-bag prediction intervals, False otherwise. Used only `method = "Zhang"`. Defaults to `symmetry = TRUE`.
`seed`	Seed for random number generation. Currently not utilized.
`m_try`	Number of variables to randomly select from at each split.
`num_trees`	Number of trees used in the random forest.
`min_node_size`	Minimum number of observations before split at a node.
`num_threads`	The number of threads to use in parallel. Default is the current number of cores.
`calibrate`	If `calibrate = TRUE`, intervals are calibrated to achieve nominal coverage. Currently uses quantiles to calibrate. Only for `method = "Roy"`.
`Roy_method`	Interval method for `method = "Roy"`. Options are `Roy_method = c("quantile", "HDI", "CHDI")`.
`featureBias`	Remove feature bias. Only for `method = "Tung"`.
`predictionBias`	Remove prediction bias. Only for `method = "Tung"`.
`Tung_R`	Number of repetitions used in bias removal. Only for `method = "Tung"`.
`Tung_num_trees`	Number of trees used in bias removal. Only for `method = "Tung"`.
`variant`	Choose which variant to use. Options are `method = c("1", "2")`. Only for `method = "Ghosal"`.
`Ghosal_num_stages`	Number of total stages. Only for `method = "Ghosal"`.
`prop`	Proportion of training data to sample for each tree. Only for `method = "Ghosal"`.
`concise`	If concise = TRUE, only predictions output. Defaults to `concise = FALSE`.
`interval_type`	Type of prediction interval to generate. Options are `method = c("two-sided", "lower", "upper")`. Default is `method = "two-sided"`.

The seven methods implemented are cited in the References section. Additional information can be found within those references. Each of these methods are implemented by utilizing the ranger package. For method = "Zhang", prediction intervals are generated using out-of-bag residuals. method = "Romano" utilizes a split-conformal approach. method = "Roy" uses a bag-of-predictors approach. method = "Ghosal" performs boosting to reduce bias in the random forest, and estimates variance. The authors provide multiple variants to their methodology. method = "Tung" debiases feature selection and prediction. Prediction intervals are generated using quantile regression forests. method = "HDI" delivers prediction intervals through highest-density interval regression forests. method = "quantile" utilizes quantile regression forests.

`int`	Default output. Includes prediction intervals for all methods in `methods`.
`preds`	Predictions for test data for all methods in `methods`. Output when `concise = FALSE`.

Chancellor Johnstone

Haozhe Zhang

\insertRef

breiman2001randompiRF

\insertRef

ghosal2018boostingpiRF

\insertRef

meinshausen2006quantilepiRF

\insertRef

romano2019conformalizedpiRF

\insertRef

roy2019predictionpiRF

\insertRef

tung2014biaspiRF

\insertRef

zhang2019randompiRF

\insertRef

zhu2019hdipiRF

ranger

rfinterval

library(piRF)

#functions to get average length and average coverage of output
getPILength <- function(x){
#average PI length across each set of predictions
l <- x[,2] - x[,1]
avg_l <- mean(l)
return(avg_l)
}

getCoverage <- function(x, response){
  #output coverage for test data
  coverage <- sum((response >= x[,1]) * (response <= x[,2]))/length(response)
  return(coverage)
}

#import airfoil self noise dataset
data(airfoil)
method_vec <- c("quantile", "Zhang", "Tung", "Romano", "Roy", "HDI", "Ghosal")
#generate train and test data
ratio <- .975
nrow <- nrow(airfoil)
n <- floor(nrow*ratio)
samp <- sample(1:nrow, size = n)
train <- airfoil[samp,]
test <- airfoil[-samp,]

#generate prediction intervals
res <- rfint(pressure ~ . , train_data = train, test_data = test,
             method = method_vec,
             concise= FALSE,
             num_threads = 1)

#empirical coverage, and average prediction interval length for each method
coverage <- sapply(res$int, FUN = getCoverage, response = test$pressure)
coverage
length <- sapply(res$int, FUN = getPILength)
length

#get current mfrow setting
opar <- par(mfrow = c(2,2))

#plotting intervals and predictions
for(i in 1:7){
   col <- ((test$pressure >= res$int[[i]][,1]) *
   (test$pressure <= res$int[[i]][,2])-1)*(-1)+1
   plot(x = res$preds[[i]], y = test$pressure, pch = 20,
      col = "black", ylab = "true", xlab = "predicted", main = method_vec[i])
   abline(a = 0, b = 1)
   segments(x0 = res$int[[i]][,1], x1 = res$int[[i]][,2],
      y1 = test$pressure, y0 = test$pressure, lwd = 1, col = col)
}
par(opar)