Out-of-bag predictions and evaluation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.height = 5, 
  fig.width = 7
)
library(aorsf)
library(survival)
library(SurvMetrics)

Out-of-bag data

In random forests, each tree is grown with a bootstrapped version of the training set. Because bootstrap samples are selected with replacement, each bootstrapped training set contains about two-thirds of instances in the original training set. The 'out-of-bag' data are instances that are not in the bootstrapped training set.

Out-of-bag predictions and error

Each tree in the random forest can make predictions for its out-of-bag data, and the out-of-bag predictions can be aggregated to make an ensemble out-of-bag prediction. Since the out-of-bag data are not used to grow the tree, the accuracy of the ensemble out-of-bag predictions approximate the generalization error of the random forest. Out-of-bag prediction error plays a central role for some routines that estimate variable importance, e.g. negation importance.

Let's fit an oblique random survival forest and plot the distribution of the ensemble out-of-bag predictions.

fit <- orsf(data = pbc_orsf, 
            formula = Surv(time, status) ~ . - id,
            oobag_pred_type = 'surv',
            n_tree = 5,
            oobag_pred_horizon = 2000)

hist(fit$pred_oobag, 
     main = 'Ensemble out-of-bag survival predictions at t=3,500')

Not surprisingly, all of the survival predictions are between 0 and 1. Next, let's check the out-of-bag accuracy of fit:

# what function is used to evaluate out-of-bag predictions?
fit$eval_oobag$stat_type

# what is the output from this function?
fit$eval_oobag$stat_values

The out-of-bag estimate of r fit$eval_oobag$stat_type (the default method to evaluate out-of-bag predictions) is r as.numeric(fit$eval_oobag$stat_values).

Monitoring out-of-bag error

As each out-of-bag data set contains about one-third of the training set, the out-of-bag error estimate usually converges to a stable value as more trees are added to the forest. If you want to monitor the convergence of out-of-bag error for your own oblique random survival forest, you can set oobag_eval_every to compute out-of-bag error at every oobag_eval_every tree. For example, let's compute out-of-bag error after fitting each tree in a forest of 50 trees:

fit <- orsf(data = pbc_orsf,
            formula = Surv(time, status) ~ . - id,
            n_tree = 20,
            tree_seeds = 2,
            oobag_pred_type = 'surv',
            oobag_pred_horizon = 2000,
            oobag_eval_every = 1)

plot(
 x = seq(1, 20, by = 1),
 y = fit$eval_oobag$stat_values, 
 main = 'Out-of-bag C-statistic computed after each new tree is grown.',
 xlab = 'Number of trees grown',
 ylab = fit$eval_oobag$stat_type
)

lines(x=seq(1, 20), y = fit$eval_oobag$stat_values)

In general, at least 500 trees are recommended for a random forest fit. We're just using 10 for illustration.

User-supplied out-of-bag evaluation functions

In some cases, you may want more control over how out-of-bag error is estimated. For example, let's use the Brier score from the SurvMetrics package:

oobag_fun_brier <- function(y_mat, w_vec, s_vec){

 # output is numeric vector of length 1
 as.numeric(
  SurvMetrics::Brier(
   object = Surv(time = y_mat[, 1], event = y_mat[, 2]), 
   pre_sp = s_vec,
   # t_star in Brier() should match oob_pred_horizon in orsf()
   t_star = 2000
  )
 )

}

There are two ways to apply your own function to compute out-of-bag error. First, you can apply your function to the out-of-bag survival predictions that are stored in 'aorsf' objects, e.g:

oobag_fun_brier(y_mat = pbc_orsf[,c('time', 'status')],
                s_vec = fit$pred_oobag)

Second, you can pass your function into orsf(), and it will be used in place of Harrell's C-statistic:

fit <- orsf(data = pbc_orsf,
            formula = Surv(time, status) ~ . - id,
            n_tree = 20,
            tree_seeds = 2,
            oobag_pred_horizon = 2000,
            oobag_fun = oobag_fun_brier,
            oobag_eval_every = 1)

plot(
 x = seq(1, 20, by = 1),
 y = fit$eval_oobag$stat_values, 
 main = 'Out-of-bag error computed after each new tree is grown.',
 sub = 'For the Brier score, lower values indicate more accurate predictions',
 xlab = 'Number of trees grown',
 ylab = "Brier score"
)

lines(x=seq(1, 20), y = fit$eval_oobag$stat_values)

Specific instructions on user-supplied functions

User-supplied functions must:

  1. have exactly three arguments named y_mat, w_vec, and s_vec.
  2. return a numeric output of length 1

If either of these conditions is not true, an error will occur. A simple test to make sure your user-supplied function will work with the aorsf package is below:

# Helper code to make sure your oobag_fun function will work with aorsf

# time and status values
test_time <- seq(from = 1, to = 5, length.out = 100)
test_status <- rep(c(0,1), each = 50)

# y-matrix is presumed to contain time and status (with column names)
y_mat <- cbind(time = test_time, status = test_status)
# s_vec is presumed to be a vector of survival probabilities
s_vec <- seq(0.9, 0.1, length.out = 100)

# see 1 in the checklist above
names(formals(oobag_fun_brier)) == c("y_mat", "w_vec", "s_vec")

test_output <- oobag_fun_brier(y_mat = y_mat, 
                               w_vec = w_vec,
                               s_vec = s_vec)

# test output should be numeric
is.numeric(test_output)
# test_output should be a numeric value of length 1
length(test_output) == 1

Notes

When evaluating out-of-bag error:



Try the aorsf package in your browser

Any scripts or data that you put into this service are public.

aorsf documentation built on Oct. 26, 2023, 5:08 p.m.