README.md

aorsf

Project Status: Active – The project has reached a stable, usable
state and is being actively
developed. Codecov test
coverage R-CMD-check Status at rOpenSci Software Peer
Review CRAN
status DOI

Fit, interpret, and make predictions with oblique random survival forests (ORSFs).

Why aorsf?

Installation

You can install aorsf from CRAN using

install.packages("aorsf")

You can install the development version of aorsf from GitHub with:

# install.packages("remotes")
remotes::install_github("ropensci/aorsf")

What is an oblique decision tree?

Decision trees are developed by splitting a set of training data into two new subsets, with the goal of having more similarity within the new subsets than between them. The splitting process is repeated on resulting subsets of data until a stopping criterion is met.

When the new subsets of data are formed based on a single predictor, the decision tree is said to be axis-based because the splits of the data appear perpendicular to the axis of the predictor. When linear combinations of variables are used instead of a single variable, the tree is oblique because the splits of the data are neither parallel nor at a right angle to the axis.

Figure: Decision trees for classification with axis-based splitting (left) and oblique splitting (right). Cases are orange squares; controls are purple circles. Both trees partition the predictor space defined by variables X1 and X2, but the oblique splits do a better job of separating the two classes.

Examples

The orsf() function can fit several types of ORSF ensembles. My personal favorite is the accelerated ORSF because it has a great combination of prediction accuracy and computational efficiency (see JCGS paper).2


library(aorsf)

set.seed(329730)

index_train <- sample(nrow(pbc_orsf), 150) 

pbc_orsf_train <- pbc_orsf[index_train, ]
pbc_orsf_test <- pbc_orsf[-index_train, ]

fit <- orsf(data = pbc_orsf_train, 
            formula = Surv(time, status) ~ . - id,
            oobag_pred_horizon = 365.25 * 5)

Inspect

Printing the output from orsf() will give some information and descriptive statistics about the ensemble.


fit
#> ---------- Oblique random survival forest
#> 
#>      Linear combinations: Accelerated
#>           N observations: 150
#>                 N events: 52
#>                  N trees: 500
#>       N predictors total: 17
#>    N predictors per node: 5
#>  Average leaves per tree: 10
#> Min observations in leaf: 5
#>       Min events in leaf: 1
#>           OOB stat value: 0.83
#>            OOB stat type: Harrell's C-statistic
#>      Variable importance: anova
#> 
#> -----------------------------------------

Variable importance

The importance of individual variables can be estimated in three ways using aorsf:

``` r

orsf_vi_negate(fit) #> bili sex copper stage age #> 0.1162463868 0.0517905362 0.0375565841 0.0240450064 0.0239056901 #> ast protime hepato edema ascites #> 0.0191083400 0.0158014897 0.0139536512 0.0119264604 0.0100865906 #> albumin chol spiders trt trig #> 0.0085394443 0.0037903802 0.0030727468 0.0020617896 0.0018361632 #> alk.phos platelet #> 0.0006586211 -0.0044967624 ```

``` r

orsf_vi_permute(fit) #> bili copper age stage sex #> 0.0523994364 0.0187964038 0.0152246586 0.0115192591 0.0110127557 #> ast hepato edema ascites albumin #> 0.0100104477 0.0082889176 0.0079183046 0.0077834483 0.0070642325 #> protime trig chol spiders alk.phos #> 0.0066513097 0.0015656325 0.0014474560 0.0006015308 0.0001369292 #> trt platelet #> -0.0013984860 -0.0022427356 ```

``` r

orsf_vi_anova(fit) #> bili ascites edema copper stage sex age #> 0.48778004 0.44943820 0.41677872 0.31865585 0.26675095 0.26458616 0.25448430 #> ast hepato albumin chol protime trig spiders #> 0.21743929 0.19945726 0.18191604 0.15240328 0.15076561 0.13709677 0.11833550 #> alk.phos platelet trt #> 0.10113636 0.06302021 0.05019305 ```

You can supply your own R function to estimate out-of-bag error when using negation or permutation importance (see oob vignette)

Partial dependence (PD)

Partial dependence (PD) shows the expected prediction from a model as a function of a single predictor or multiple predictors. The expectation is marginalized over the values of all other predictors, giving something like a multivariable adjusted estimate of the model’s prediction.

The summary function, orsf_summarize_uni(), computes PD for as many variables as you ask it to, using sensible values.


orsf_summarize_uni(fit, n_variables = 2)
#> 
#> -- bili (VI Rank: 1) ---------------------------
#> 
#>        |---------------- Risk ----------------|
#>  Value      Mean    Median     25th %    75th %
#>   0.70 0.1986719 0.1044026 0.04354701 0.2968290
#>    1.3 0.2132847 0.1210276 0.05245387 0.3208855
#>    3.2 0.2883814 0.1917119 0.11951296 0.4147258
#> 
#> -- sex (VI Rank: 2) ----------------------------
#> 
#>        |---------------- Risk ----------------|
#>  Value      Mean    Median     25th %    75th %
#>      m 0.3394141 0.2313787 0.13762339 0.5311308
#>      f 0.2390067 0.1112093 0.04782891 0.3773551
#> 
#>  Predicted risk at time t = 1826.25 for top 2 predictors

For more on PD, see the vignette

Individual conditional expectations (ICE)

Unlike partial dependence, which shows the expected prediction as a function of one or multiple predictors, individual conditional expectations (ICE) show the prediction for an individual observation as a function of a predictor.

For more on ICE, see the vignette

Comparison to existing software

Comparisons between aorsf and existing software are presented in our JCGS paper. The paper:

A more hands-on comparison of aorsf and other R packages is provided in orsf examples

References

  1. Jaeger BC, Long DL, Long DM, Sims M, Szychowski JM, Min YI, Mcclure LA, Howard G, Simon N. Oblique random survival forests. Annals of applied statistics 2019 Sep; 13(3):1847-83. DOI: 10.1214/19-AOAS1261

  2. Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM. Accelerated and interpretable oblique random survival forests. Journal of Computational and Graphical Statistics Published online 08 Aug 2023. DOI: 10.1080/10618600.2023.2231048

  3. Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA. On oblique random forests. Joint European Conference on Machine Learning and Knowledge Discovery in Databases 2011 Sep 4; pp. 453-469. DOI: 10.1007/978-3-642-23783-6_29

Funding

The developers of aorsf receive financial support from the Center for Biomedical Informatics, Wake Forest University School of Medicine. We also receive support from the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR001420.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.



Try the aorsf package in your browser

Any scripts or data that you put into this service are public.

aorsf documentation built on Oct. 26, 2023, 5:08 p.m.