orsf: Oblique Random Survival Forest (ORSF)

View source: R/orsf.R

orsfR Documentation

Oblique Random Survival Forest (ORSF)

Description

Fit an oblique random survival forest

Usage

orsf(
  data,
  formula,
  control = orsf_control_fast(),
  weights = NULL,
  n_tree = 500,
  n_split = 5,
  n_retry = 3,
  n_thread = 1,
  mtry = NULL,
  sample_with_replacement = TRUE,
  sample_fraction = 0.632,
  leaf_min_events = 1,
  leaf_min_obs = 5,
  split_rule = "logrank",
  split_min_events = 5,
  split_min_obs = 10,
  split_min_stat = switch(split_rule, logrank = 3.841459, cstat = 0.5),
  oobag_pred_type = "surv",
  oobag_pred_horizon = NULL,
  oobag_eval_every = n_tree,
  oobag_fun = NULL,
  importance = "anova",
  group_factors = TRUE,
  tree_seeds = NULL,
  attach_data = TRUE,
  no_fit = FALSE,
  na_action = "fail",
  verbose_progress = FALSE,
  ...
)

orsf_train(object)

Arguments

data

a data.frame, tibble, or data.table that contains the relevant variables.

formula

(formula) The response on the left hand side should include a time variable, followed by a status variable, and may be written inside a call to Surv (see examples). The terms on the right are names of predictor variables.

control

(orsf_control) An object returned from one of the orsf_control functions:

  • orsf_control_fast (the default) uses a single iteration of Newton Raphson scoring to identify a linear combination of predictors.

  • orsf_control_cph uses Newton Raphson scoring until a convergence criteria is met.

  • orsf_control_net uses glmnet to identify linear combinations of predictors, similar to Jaeger (2019).

  • orsf_control_custom allows the user to apply their own function to create linear combinations of predictors.

weights

(numeric vector) Optional. If given, this input should have length equal to nrow(data). Values in weights are treated like replication weights, i.e., a value of 2 is the same thing as having 2 observations in data, each containing a copy of the corresponding person's data.

Use weights cautiously, as orsf will count the number of observations and events prior to growing a node for a tree, so higher values in weights will lead to deeper trees.

n_tree

(integer) the number of trees to grow. Default is n_tree = 500.

n_split

(integer) the number of cut-points assessed when splitting a node in decision trees. Default is n_split = 5.

n_retry

(integer) when a node can be split, but the current linear combination of inputs is unable to provide a valid split, orsf will try again with a new linear combination based on a different set of randomly selected predictors, up to n_retry times. Default is n_retry = 3. Set n_retry = 0 to prevent any retries.

n_thread

(integer) number of threads to use while growing trees, computing predictions, and computing importance. Default is one thread. To use the maximum number of threads that your system provides for concurrent execution, set n_thread = 0.

mtry

(integer) Number of predictors randomly included as candidates for splitting a node. The default is the smallest integer greater than the square root of the number of total predictors, i.e., ⁠mtry = ceiling(sqrt(number of predictors))⁠

sample_with_replacement

(logical) If TRUE (the default), observations are sampled with replacement when an in-bag sample is created for a decision tree. If FALSE, observations are sampled without replacement and each tree will have an in-bag sample containing sample_fraction% of the original sample.

sample_fraction

(double) the proportion of observations that each trees' in-bag sample will contain, relative to the number of rows in data. Only used if sample_with_replacement is FALSE. Default value is 0.632.

leaf_min_events

(integer) minimum number of events in a leaf node. Default is leaf_min_events = 1

leaf_min_obs

(integer) minimum number of observations in a leaf node. Default is leaf_min_obs = 5.

split_rule

(character) how to assess the quality of a potential splitting rule for a node. Valid options are

  • 'logrank' : a log-rank test statistic.

  • 'cstat' : Harrell's concordance statistic.

split_min_events

(integer) minimum number of events required in a node to consider splitting it. Default is split_min_events = 5

split_min_obs

(integer) minimum number of observations required in a node to consider splitting it. Default is split_min_obs = 10.

split_min_stat

(double) minimum test statistic required to split a node. Default is 3.841459 if split_rule = 'logrank' and 0.50 if split_rule = 'cstat'. If no splits are found with a statistic exceeding split_min_stat, the given node either becomes a leaf or a retry occurs (up to n_retry retries).

oobag_pred_type

(character) The type of out-of-bag predictions to compute while fitting the ensemble. Valid options are

  • 'none' : don't compute out-of-bag predictions

  • 'risk' : probability of event occurring at or before oobag_pred_horizon.

  • 'surv' : 1 - risk.

  • 'chf' : cumulative hazard function at oobag_pred_horizon.

  • 'mort' : mortality, i.e., the number of events expected if all observations in the training data were identical to a given observation.

oobag_pred_horizon

(numeric) A numeric value indicating what time should be used for out-of-bag predictions. Default is the median of the observed times, i.e., oobag_pred_horizon = median(time).

oobag_eval_every

(integer) The out-of-bag performance of the ensemble will be checked every oobag_eval_every trees. So, if oobag_eval_every = 10, then out-of-bag performance is checked after growing the 10th tree, the 20th tree, and so on. Default is oobag_eval_every = n_tree.

oobag_fun

(function) to be used for evaluating out-of-bag prediction accuracy every oobag_eval_every trees. When oobag_fun = NULL (the default), Harrell's C-statistic (1982) is used to evaluate accuracy. if you use your own oobag_fun note the following:

  • oobag_fun should have two inputs: y_mat and s_vec

  • y_mat is a two column matrix with first column named 'time', second named 'status'

  • s_vec is a numeric vector containing predicted survival probabilities.

  • oobag_fun should return a numeric output of length 1

For more details, see the out-of-bag vignette.

importance

(character) Indicate method for variable importance:

  • 'none': no variable importance is computed.

  • 'anova': compute analysis of variance (ANOVA) importance

  • 'negate': compute negation importance

  • 'permute': compute permutation importance

For details on these methods, see orsf_vi.

group_factors

(logical) Only relevant if variable importance is being estimated. if TRUE, the importance of factor variables will be reported overall by aggregating the importance of individual levels of the factor. If FALSE, the importance of individual factor levels will be returned.

tree_seeds

(integer vector) Optional. if specified, random seeds will be set using the values in tree_seeds[i] before growing tree i. Two forests grown with the same number of trees and the same seeds will have the exact same out-of-bag samples, making out-of-bag error estimates of the forests more comparable. If NULL (the default), no seeds are set during the training process.

attach_data

(logical) if TRUE, a copy of the training data will be attached to the output. This is helpful if you plan on using functions like orsf_pd_oob or orsf_summarize_uni to interpret the forest using its training data. Default is TRUE.

no_fit

(logical) if TRUE, model fitting steps are defined and saved, but training is not initiated. The object returned can be directly submitted to orsf_train() so long as attach_data is TRUE.

na_action

(character) what should happen when data contains missing values (i.e., NA values). Valid options are:

  • 'fail' : an error is thrown if data contains NA values

  • 'omit' : rows in data with incomplete data will be dropped

  • 'impute_meanmode' : missing values for continuous and categorical variables in data will be imputed using the mean and mode, respectively. Note that is this option is selected and attach_data is TRUE, the data attached to the output will be the imputed version of data.

verbose_progress

(logical) if TRUE, progress messages are printed in the console. If FALSE (the default), nothing is printed.

...

Further arguments passed to or from other methods (not currently used).

object

an untrained 'aorsf' object, created by setting no_fit = TRUE in orsf().

Details

This function is based on and similar to the ORSF function in the obliqueRSF R package. The primary difference is that this function runs much faster. The speed increase is attributable to better management of memory (i.e., no unnecessary copies of inputs) and using a Newton Raphson scoring algorithm to identify linear combinations of inputs rather than performing penalized regression using routines in glmnet.The modified Newton Raphson scoring algorithm that this function applies is an adaptation of the C++ routine developed by Terry M. Therneau that fits Cox proportional hazards models (see survival::coxph() and more specifically survival::coxph.fit()).

Value

an accelerated oblique RSF object (aorsf)

Details on inputs

formula:

  • The response in formula can be a survival object as returned by the Surv function, but can also just be the time and status variables. I.e., Surv(time, status) ~ . works just like time + status ~ .

  • A . symbol on the right hand side is short-hand for using all variables in data (omitting those on the left hand side of formula) as predictors.

  • The order of variables in the left hand side matters. i.e., writing status + time ~ . will make orsf assume your status variable is actually the time variable.

  • The response variable can be a survival object stored in data. For example, y ~ . is a valid formula if data$y inherits from the Surv class.

  • Although you can fit an oblique random survival forest with 1 predictor variable, your formula should have at least 2 predictors. The reason for this recommendation is that a linear combination of predictors is trivial if there is only one predictor.

mtry:

The mtry parameter may be temporarily reduced to ensure there are at least 2 events per predictor variable. This occurs when using orsf_control_cph because coefficients in the Newton Raphson scoring algorithm may become unstable when the number of covariates is greater than or equal to the number of events. This reduction does not occur when using orsf_control_net.

oobag_fun:

If oobag_fun is specified, it will be used in to compute negation importance or permutation importance, but it will not have any role for ANOVA importance.

n_thread:

If an R function must be called from C++ (i.e., user-supplied function to compute out-of-bag error or identify linear combinations of variables), n_thread will automatically be set to 1 because attempting to run R functions in multiple threads will cause the R session to crash.

What is an oblique decision tree?

Decision trees are developed by splitting a set of training data into two new subsets, with the goal of having more similarity within the new subsets than between them. This splitting process is repeated on the resulting subsets of data until a stopping criterion is met. When the new subsets of data are formed based on a single predictor, the decision tree is said to be axis-based because the splits of the data appear perpendicular to the axis of the predictor. When linear combinations of variables are used instead of a single variable, the tree is oblique because the splits of the data are neither parallel nor at a right angle to the axis

Figure : Decision trees for classification with axis-based splitting (left) and oblique splitting (right). Cases are orange squares; controls are purple circles. Both trees partition the predictor space defined by variables X1 and X2, but the oblique splits do a better job of separating the two classes.

tree_axis_v_oblique.png

What is a random forest?

Random forests are collections of de-correlated decision trees. Predictions from each tree are aggregated to make an ensemble prediction for the forest. For more details, see Breiman at el, 2001.

Training, out-of-bag error, and testing

In random forests, each tree is grown with a bootstrapped version of the training set. Because bootstrap samples are selected with replacement, each bootstrapped training set contains about two-thirds of instances in the original training set. The 'out-of-bag' data are instances that are not in the bootstrapped training set. Each tree in the random forest can make predictions for its out-of-bag data, and the out-of-bag predictions can be aggregated to make an ensemble out-of-bag prediction. Since the out-of-bag data are not used to grow the tree, the accuracy of the ensemble out-of-bag predictions approximate the generalization error of the random forest. Generalization error refers to the error of a random forest's predictions when it is applied to predict outcomes for data that were not used to train it, i.e., testing data.

Missing data

Data passed to aorsf functions are not allowed to have missing values. A user should impute missing values using an R package with that purpose, such as recipes or mlr3pipelines.

Examples

First we load some relevant packages

set.seed(329730)
suppressPackageStartupMessages({
 library(aorsf)
 library(survival)
 library(tidymodels)
 library(tidyverse)
 library(randomForestSRC)
 library(ranger)
 library(riskRegression) 
 library(obliqueRSF)
})

The entry-point into aorsf is the standard call to orsf():

fit <- orsf(pbc_orsf, Surv(time, status) ~ . - id)

printing fit provides quick descriptive summaries:

fit
## ---------- Oblique random survival forest
## 
##      Linear combinations: Accelerated
##           N observations: 276
##                 N events: 111
##                  N trees: 500
##       N predictors total: 17
##    N predictors per node: 5
##  Average leaves per tree: 21
## Min observations in leaf: 5
##       Min events in leaf: 1
##           OOB stat value: 0.84
##            OOB stat type: Harrell's C-statistic
##      Variable importance: anova
## 
## -----------------------------------------

Model control

For these examples we will make use of the orsf_control_ functions to build and compare models based on their out-of-bag predictions. We will also standardize the out-of-bag samples using the input argument tree_seeds

Accelerated linear combinations

The accelerated ORSF ensemble is the default because it has a nice balance of computational speed and prediction accuracy. It runs a single iteration of Newton Raphson scoring on the Cox partial likelihood function to find linear combinations of predictors.

fit_accel <- orsf(pbc_orsf, 
                  control = orsf_control_fast(),
                  formula = Surv(time, status) ~ . - id,
                  tree_seeds = 329)
Linear combinations with Cox regression

orsf_control_cph runs Cox regression in each non-terminal node of each survival tree, using the regression coefficients to create linear combinations of predictors:

fit_cph <- orsf(pbc_orsf, 
                control = orsf_control_cph(),
                formula = Surv(time, status) ~ . - id,
                tree_seeds = 329)
Linear combinations with penalized cox regression

orsf_control_net runs penalized Cox regression in each non-terminal node of each survival tree, using the regression coefficients to create linear combinations of predictors. This can be really helpful if you want to do feature selection within the node, but it is a lot slower than the other options.

# select 3 predictors out of 5 to be used in
# each linear combination of predictors.
fit_net <- orsf(pbc_orsf, 
                control = orsf_control_net(df_target = 3),
                formula = Surv(time, status) ~ . - id,
                tree_seeds = 329)
Linear combinations with your own function

Let’s make two customized functions to identify linear combinations of predictors.

  • The first uses random coefficients

    f_rando <- function(x_node, y_node, w_node){
     matrix(runif(ncol(x_node)), ncol=1) 
    }
    
  • The second derives coefficients from principal component analysis.

    f_pca <- function(x_node, y_node, w_node) { 
    
     # estimate two principal components.
     pca <- stats::prcomp(x_node, rank. = 2)
     # use the second principal component to split the node
     pca$rotation[, 1L, drop = FALSE]
    
    }
    
  • The third uses orsf() inside of orsf().

    # This approach is known as reinforcement learning trees.  
    # some special care is taken to prevent your R session from crashing.
    # Specifically, random coefficients are used when n_obs <= 10
    # or n_events <= 5. 
    
    f_aorsf <- function(x_node, y_node, w_node){
    
     colnames(y_node) <- c('time', 'status')
     colnames(x_node) <- paste("x", seq(ncol(x_node)), sep = '')
    
     data <- as.data.frame(cbind(y_node, x_node))
    
     if(nrow(data) <= 10 || sum(y_node[,'status']) <= 5) 
      return(matrix(runif(ncol(x_node)), ncol = 1))
    
     fit <- orsf(data, time + status ~ ., 
                 weights = as.numeric(w_node),
                 n_tree = 25,
                 importance = 'permute')
    
     out <- orsf_vi(fit)
    
     # drop the least two important variables
     n_vars <- length(out)
     out[c(n_vars, n_vars-1)] <- 0
    
     # ensure out has same variable order as input
     out <- out[colnames(x_node)]
    
     matrix(out, ncol = 1)
    
    }
    

We can plug these functions into orsf_control_custom(), and then pass the result into orsf():

fit_rando <- orsf(pbc_orsf,
                  Surv(time, status) ~ . - id,
                  control = orsf_control_custom(beta_fun = f_rando),
                  tree_seeds = 329)

fit_pca <- orsf(pbc_orsf,
                Surv(time, status) ~ . - id,
                control = orsf_control_custom(beta_fun = f_pca),
                tree_seeds = 329)

fit_rlt <- orsf(pbc_orsf, time + status ~ . - id, 
                control = orsf_control_custom(beta_fun = f_aorsf),
                tree_seeds = 329)

So which fit seems to work best in this example? Let’s find out by evaluating the out-of-bag survival predictions.

risk_preds <- list(
 accel = 1 - fit_accel$pred_oobag,
 cph   = 1 - fit_cph$pred_oobag,
 net   = 1 - fit_net$pred_oobag,
 rando = 1 - fit_rando$pred_oobag,
 pca   = 1 - fit_pca$pred_oobag,
 rlt   = 1 - fit_rlt$pred_oobag
)

sc <- Score(object = risk_preds, 
            formula = Surv(time, status) ~ 1, 
            data = pbc_orsf, 
            summary = 'IPA',
            times = fit_accel$pred_horizon)

The AUC values, from highest to lowest:

sc$AUC$score[order(-AUC)]
##    model times       AUC         se     lower     upper
## 1:   net  1788 0.9134593 0.02079935 0.8726933 0.9542253
## 2: accel  1788 0.9112315 0.02098077 0.8701099 0.9523530
## 3:   rlt  1788 0.9076685 0.02078732 0.8669261 0.9484109
## 4:   cph  1788 0.9063871 0.02165434 0.8639453 0.9488288
## 5: rando  1788 0.9023489 0.02218936 0.8588586 0.9458393
## 6:   pca  1788 0.8994220 0.02201713 0.8562692 0.9425748

And the indices of prediction accuracy:

sc$Brier$score[order(-IPA), .(model, times, IPA)]
##         model times       IPA
## 1:        net  1788 0.4916038
## 2:      accel  1788 0.4879683
## 3:        cph  1788 0.4751883
## 4:        rlt  1788 0.4570179
## 5:        pca  1788 0.4370592
## 6:      rando  1788 0.4258344
## 7: Null model  1788 0.0000000

From inspection,

  • net, accel, and rlt have high discrimination and index of prediction accuracy.

  • rando and pca do less well, but they aren’t bad.

tidymodels

This example uses tidymodels functions but stops short of using an official tidymodels workflow. I am working on getting aorsf pulled into the censored package and I will update this with real workflows if that happens!

Comparing ORSF with other learners

Start with a recipe to pre-process data

imputer <- recipe(pbc_orsf, formula = time + status ~ .) %>% 
 step_impute_mean(all_numeric_predictors()) %>%
 step_impute_mode(all_nominal_predictors()) 

Next create a 10-fold cross validation object and pre-process the data:

# 10-fold cross validation; make a container for the pre-processed data
analyses <- vfold_cv(data = pbc_orsf, v = 10) %>%
 mutate(recipe = map(splits, ~prep(imputer, training = training(.x))),
        train = map(recipe, juice),
        test = map2(splits, recipe, ~bake(.y, new_data = testing(.x))))

analyses
## #  10-fold cross-validation 
## # A tibble: 10 x 5
##    splits           id     recipe   train               test              
##    <list>           <chr>  <list>   <list>              <list>            
##  1 <split [248/28]> Fold01 <recipe> <tibble [248 x 20]> <tibble [28 x 20]>
##  2 <split [248/28]> Fold02 <recipe> <tibble [248 x 20]> <tibble [28 x 20]>
##  3 <split [248/28]> Fold03 <recipe> <tibble [248 x 20]> <tibble [28 x 20]>
##  4 <split [248/28]> Fold04 <recipe> <tibble [248 x 20]> <tibble [28 x 20]>
##  5 <split [248/28]> Fold05 <recipe> <tibble [248 x 20]> <tibble [28 x 20]>
##  6 <split [248/28]> Fold06 <recipe> <tibble [248 x 20]> <tibble [28 x 20]>
##  7 <split [249/27]> Fold07 <recipe> <tibble [249 x 20]> <tibble [27 x 20]>
##  8 <split [249/27]> Fold08 <recipe> <tibble [249 x 20]> <tibble [27 x 20]>
##  9 <split [249/27]> Fold09 <recipe> <tibble [249 x 20]> <tibble [27 x 20]>
## 10 <split [249/27]> Fold10 <recipe> <tibble [249 x 20]> <tibble [27 x 20]>

Define functions for a ‘workflow’ with randomForestSRC, ranger, and aorsf.

rfsrc_wf <- function(train, test, pred_horizon){
 
 # rfsrc does not like tibbles, so cast input data into data.frames
 train <- as.data.frame(train)
 test <- as.data.frame(test)
 
 rfsrc(formula = Surv(time, status) ~ ., data = train) %>% 
  predictRisk(newdata = test, times = pred_horizon) %>% 
  as.numeric()
 
}

ranger_wf <- function(train, test, pred_horizon){
 
 ranger(Surv(time, status) ~ ., data = train) %>% 
  predictRisk(newdata = test, times = pred_horizon) %>% 
  as.numeric()
 
}

aorsf_wf <- function(train, test, pred_horizon){
 
 train %>% 
  orsf(Surv(time, status) ~ .,) %>% 
  predict(new_data = test, pred_horizon = pred_horizon) %>% 
  as.numeric()
 
}

Run the ‘workflows’ on each fold:

# 5 year risk prediction
ph <- 365.25 * 5

results <- analyses %>% 
 transmute(test, 
           pred_aorsf = map2(train, test, aorsf_wf, pred_horizon = ph),
           pred_rfsrc = map2(train, test, rfsrc_wf, pred_horizon = ph),
           pred_ranger = map2(train, test, ranger_wf, pred_horizon = ph))

Next unnest each column to get back a tibble with all of the testing data and predictions.

results <- results %>% 
 unnest(everything())

glimpse(results)
## Rows: 276
## Columns: 23
## $ id          <int> 20, 47, 55, 57, 63, 65, 83, 121, 141, 181, 197, 201, 202, ~
## $ trt         <fct> placebo, placebo, d_penicill_main, d_penicill_main, placeb~
## $ age         <dbl> 59.95346, 47.42779, 65.76318, 53.57153, 46.62834, 40.20260~
## $ sex         <fct> f, f, m, f, f, f, f, m, f, f, f, f, f, f, m, f, m, f, f, f~
## $ ascites     <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ hepato      <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1~
## $ spiders     <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1~
## $ edema       <fct> 0, 0, 0, 0.5, 1, 0, 0.5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ bili        <dbl> 5.1, 0.5, 1.8, 2.3, 22.5, 1.2, 1.3, 1.3, 0.9, 1.4, 0.6, 0.~
## $ chol        <int> 374, 316, 416, 260, 932, 256, 250, 151, 346, 248, 266, 235~
## $ albumin     <dbl> 3.51, 3.65, 3.94, 3.18, 3.12, 3.60, 3.50, 3.08, 3.77, 3.58~
## $ copper      <int> 140, 68, 121, 231, 95, 74, 48, 73, 59, 63, 25, 26, 15, 12,~
## $ alk.phos    <dbl> 1919.0, 1716.0, 10165.0, 11320.2, 5396.0, 724.0, 1138.0, 1~
## $ ast         <dbl> 122.45, 187.55, 79.98, 105.78, 244.90, 141.05, 71.30, 46.5~
## $ trig        <int> 135, 71, 219, 94, 133, 108, 100, 49, 56, 106, 102, 67, 89,~
## $ platelet    <int> 322, 356, 213, 216, 165, 430, 81, 213, 336, 79, 201, 228, ~
## $ protime     <dbl> 13.0, 9.8, 11.0, 12.4, 11.6, 10.0, 12.9, 13.2, 10.6, 10.3,~
## $ stage       <ord> 4, 3, 3, 3, 3, 1, 4, 4, 2, 4, 2, 4, 2, 1, 2, 3, 2, 4, 3, 4~
## $ time        <int> 1356, 2576, 1360, 3282, 859, 3992, 4050, 191, 3050, 2556, ~
## $ status      <dbl> 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0~
## $ pred_aorsf  <dbl> 0.74325933, 0.07171843, 0.26549931, 0.65886824, 0.88328061~
## $ pred_rfsrc  <dbl> 0.60298524, 0.05715514, 0.19512613, 0.57592127, 0.75772791~
## $ pred_ranger <dbl> 0.61244412, 0.02793571, 0.17943870, 0.64002784, 0.74174298~

And finish by aggregating the predictions and computing performance in the testing data. Note that I am computing one statistic for all predictions instead of computing one statistic for each fold. This approach is fine when you have smaller testing sets and/or small event counts.

Score(
 object = list(aorsf = results$pred_aorsf,
               rfsrc = results$pred_rfsrc,
               ranger = results$pred_ranger),
 formula = Surv(time, status) ~ 1, 
 data = results, 
 summary = 'IPA',
 times = ph
)
## 
## Metric AUC:
## 
## Results by model:
## 
##     model times  AUC lower upper
## 1:  aorsf  1826 90.2  85.8  94.6
## 2:  rfsrc  1826 89.4  84.9  93.9
## 3: ranger  1826 89.8  85.5  94.1
## 
## Results of model comparisons:
## 
##    times  model reference delta.AUC lower upper   p
## 1:  1826  rfsrc     aorsf      -0.8  -2.3   0.6 0.3
## 2:  1826 ranger     aorsf      -0.4  -1.8   1.0 0.6
## 3:  1826 ranger     rfsrc       0.4  -0.7   1.6 0.4

## 
## NOTE: Values are multiplied by 100 and given in %.

## NOTE: The higher AUC the better.

## 
## Metric Brier:
## 
## Results by model:
## 
##         model   times Brier lower upper  IPA
## 1: Null model 1826.25  20.5  18.1  22.9  0.0
## 2:      aorsf 1826.25  11.0   8.7  13.3 46.2
## 3:      rfsrc 1826.25  12.0   9.8  14.2 41.3
## 4:     ranger 1826.25  11.8   9.7  14.0 42.3
## 
## Results of model comparisons:
## 
##      times  model  reference delta.Brier lower upper            p
## 1: 1826.25  aorsf Null model        -9.5 -12.1  -6.8 1.504600e-12
## 2: 1826.25  rfsrc Null model        -8.5 -10.8  -6.1 7.469601e-13
## 3: 1826.25 ranger Null model        -8.7 -11.0  -6.3 1.022843e-12
## 4: 1826.25  rfsrc      aorsf         1.0   0.3   1.7 5.805166e-03
## 5: 1826.25 ranger      aorsf         0.8   0.1   1.5 2.164528e-02
## 6: 1826.25 ranger      rfsrc        -0.2  -0.7   0.3 4.329685e-01

## 
## NOTE: Values are multiplied by 100 and given in %.

## NOTE: The lower Brier the better, the higher IPA the better.

From inspection,

  • aorsf obtained slightly higher discrimination (AUC)

  • aorsf obtained higher index of prediction accuracy (IPA)

mlr3 pipelines

Warning: this code may or may not run depending on your current version of mlr3proba. First we load some additional mlr3 libraries.

suppressPackageStartupMessages({
 library(mlr3verse)
 library(mlr3proba)
 library(mlr3extralearners)
 library(mlr3viz)
 library(mlr3benchmark)
})

Next we’ll define some tasks for our learners to engage with.

# Mayo Clinic Primary Biliary Cholangitis Data
task_pbc <- 
 TaskSurv$new(
  id = 'pbc',  
  backend = select(pbc_orsf, -id) %>% 
   mutate(stage = as.numeric(stage)),  
  time = "time", 
  event = "status"
 )

# Veteran's Administration Lung Cancer Trial
data(veteran, package = "randomForestSRC")

task_veteran <- 
 TaskSurv$new(
  id = 'veteran',  
  backend = veteran,  
  time = "time", 
  event = "status"
 )

# NKI 70 gene signature
data_nki <- OpenML::getOMLDataSet(data.id = 1228)

task_nki <- 
 TaskSurv$new(
  id = 'nki',  
  backend = data_nki$data,  
  time = "time", 
  event = "event"
 )

# Gene Expression-Based Survival Prediction in Lung Adenocarcinoma
data_lung <- OpenML::getOMLDataSet(data.id = 1245)

task_lung <- 
 TaskSurv$new(
  id = 'nki',  
  backend = data_lung$data %>% 
   mutate(OS_event = as.numeric(OS_event) -1),  
  time = "OS_years", 
  event = "OS_event"
 )


# Chemotherapy for Stage B/C colon cancer
# (there are two rows per person, one for death 
#  and the other for recurrence, hence the two tasks)

task_colon_death <-
 TaskSurv$new(
  id = 'colon_death',  
  backend = survival::colon %>%
   filter(etype == 2) %>% 
   drop_na() %>% 
   # drop id, redundant variables
   select(-id, -study, -node4, -etype),
   mutate(OS_event = as.numeric(OS_event) -1),  
  time = "time", 
  event = "status"
 )

task_colon_recur <-
 TaskSurv$new(
  id = 'colon_death',  
  backend = survival::colon %>%
   filter(etype == 1) %>% 
   drop_na() %>% 
   # drop id, redundant variables
   select(-id, -study, -node4, -etype),
   mutate(OS_event = as.numeric(OS_event) -1),  
  time = "time", 
  event = "status"
 )

# putting them all together
tasks <- list(task_pbc,
              task_veteran,
              task_nki,
              task_lung,
              task_colon_death,
              task_colon_recur,
              # add a few more pre-made ones
              tsk("actg"),
              tsk('gbcs'),
              tsk('grace'),
              tsk("unemployment"),
              tsk("whas"))

Now we can make a benchmark designed to compare our three favorite learners:

# Learners with default parameters
learners <- lrns(c("surv.ranger", "surv.rfsrc", "surv.aorsf"))

# Brier (Graf) score, c-index and training time as measures
measures <- msrs(c("surv.graf", "surv.cindex", "time_train"))

# Benchmark with 5-fold CV
design <- benchmark_grid(
  tasks = tasks,
  learners = learners,
  resamplings = rsmps("cv", folds = 5)
)

benchmark_result <- benchmark(design)

bm_scores <- benchmark_result$score(measures, predict_sets = "test")

Let’s look at the overall results:

bm_scores %>%
 select(task_id, learner_id, surv.graf, surv.cindex, time_train) %>%
 group_by(learner_id) %>% 
 filter(!is.infinite(surv.graf)) %>% 
 summarize(
  across(
   .cols = c(surv.graf, surv.cindex, time_train),
   .fns = mean, 
   na.rm = TRUE
  )
 )
## # A tibble: 3 x 4
##   learner_id  surv.graf surv.cindex time_train
##   <chr>           <dbl>       <dbl>      <dbl>
## 1 surv.aorsf      0.152       0.733      1.41 
## 2 surv.ranger     0.166       0.712      1.95 
## 3 surv.rfsrc      0.155       0.723      0.745

From inspection,

  • aorsf has a higher expected value for ‘surv.cindex’ (higher is better)

  • aorsf has a lower expected value for ‘surv.graf’ (lower is better)

References

Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the Yield of Medical Tests. JAMA 1982; 247(18):2543-2546. DOI: 10.1001/jama.1982.03320430047030

Breiman L. Random forests. Machine learning 2001 Oct; 45(1):5-32. DOI: 10.1023/A:1010933404324

Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Annals of applied statistics 2008 Sep; 2(3):841-60. DOI: 10.1214/08-AOAS169

Jaeger BC, Long DL, Long DM, Sims M, Szychowski JM, Min YI, Mcclure LA, Howard G, Simon N. Oblique random survival forests. Annals of applied statistics 2019 Sep; 13(3):1847-83. DOI: 10.1214/19-AOAS1261

Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM. Accelerated and interpretable oblique random survival forests. Journal of Computational and Graphical Statistics Published online 08 Aug 2023. DOI: 10.1080/10618600.2023.2231048


aorsf documentation built on Oct. 26, 2023, 5:08 p.m.