In jatherrien/largeRCRF: Large Random Competing Risks Forests

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This is a quick example of running largeRCRF on a dataset, extracting some predictions from it, and calculating a measure of concordance error.

Source

The dataset originally comes from the Women's Interagency HIV Study [@wihs], but was obtained through the randomForestSRC [@IshwaranRfsrc] package.

Background

The Women's Interagency HIV Study is a dataset that followed HIV positive women and recorded when one of three possible competing events occurred for each one:

The woman began treatment for HIV.
The woman developed AIDS or died.
The woman was censored for administrative reasons.

There are four different predictors available (age, history of drug injections, race, and a blood count of a type of white blood cells).

Getting the data

data(wihs, package = "largeRCRF")
names(wihs)

time & status are two columns in wihs corresponding to the competing risks response, while ageatfda, idu, black, and cd4nadir are the different predictors we wish to train on.

We train a forest by calling train.

library("largeRCRF")
model <- train(CR_Response(status, time) ~ ageatfda + idu + black + cd4nadir,
               data = wihs, splitFinder = LogRankSplitFinder(1:2, 2), 
               ntree = 100, numberOfSplits = 0, mtry = 2, nodeSize = 15,
               randomSeed = 15)

We specify splitFinder = LogRankSplitFinder(1:2, 2), which indicates that we have event codes 1 to 2 to handle, but that we want to focus on optimizing splits for event 2 (which corresponds to when AIDS develops).

We specify that we want a forest of 100 trees (ntree = 100), that we want to try all possible splits when trying to split on a variable (numberOfSplits = 0), that we want to try splitting on two predictors at a time (mtry = 2), and that the terminal nodes should have an average size of at minimum 15 (nodeSize = 15; accomplished by not splitting any nodes with size less than 2 $\times$ nodeSize). randomSeed = 15 specifies a seed so that the results are deterministic; note that largeRCRF generates random numbers separately from R and so is not affected by set.seed().

Printing model on its own doesn't do much except print the different components and parameters that made the forest.

model

Next we'll make predictions on the training data. Since we're using the training data, largeRCRF will by default only predict each observation using trees where that observation wasn't included in the bootstrap sample ('out-of-bag' predictions).

predictions <- predict(model)

Since our data is competing risks data, our responses are several functions which can't be printed on screen. Instead a message lets us know of several functions which can let us extract the estimate of the survivor curve, the cause-specific cumulative incidence functions, or the cause-specific cumulative hazard functions (CHF).

predictions[[1]]

Here we extract the cause-specific functions for the AIDS event, as well as the overall survivor curve.

aids.cifs = extractCIF(predictions, event = 2)
aids.chfs = extractCHF(predictions, event = 2)
survivor.curves = extractSurvivorCurve(predictions)

Now we plot some of the functions that we extracted.

curve(aids.cifs[[3]](x), from=0, to=8, ylim=c(0,1),
       type="S", ylab="CIF(t)", xlab="Time (t)")

curve(aids.chfs[[3]](x), from=0, to=8, 
       type="S", ylab="CHF(t)", xlab="Time (t)")

Finally, we calculate the naive concordance error on the out-of-bag predictions. extractMortalities calculates a measure of mortality by integrating the specified event's cumulative incidence function from 0 to time, although users are free to substitute their own measures if desired. naiveConcordance then takes the true responses and compares them with the mortality predictions provided, estimating the proportion of wrong predictions for each event as described by @WolbersConcordanceCompetingRisks.

mortalities1 <- extractMortalities(predictions, time = 8, event = 1)
mortalities2 <- extractMortalities(predictions, time = 8, event = 2)
naiveConcordance(CR_Response(wihs$status, wihs$time), 
               list(mortalities1, mortalities2))

We could continue by trying another model to see if we could lower the concordance error, or by integrating the above steps into some tuning algorithm.