``` {r setup, include = FALSE, cache = FALSE, echo=FALSE}

Not displayed

library("knitr") knitr::render_sweave()

set global chunk options for knitr. These can be changed in the header for each individual R code chunk

opts_chunk$set(fig.path = 'rfs-', fig.align = 'center', fig.pos = "!htb", fig.show = 'hold', # fig.height = 3, # fig.width = 4, size = 'footnotesize', prompt = TRUE, highlight = FALSE, comment = NA, echo = FALSE, # Change this to TRUE if you want to see all the code examples results = FALSE, message = FALSE, warning = FALSE, error = FALSE)

Setup the R environment

options(object.size = Inf, expressions = 100000, memory = Inf, replace.assign = TRUE, width = 75, prompt = "R> ") options(mc.cores = 1, rf.cores = 0, stringsAsFactors = FALSE)

# Introduction

Random forest [@Breiman:2001] (RF) is a non-parametric statistical method which requires no distributional assumptions on covariate relation to the response. RF is a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates. Random Survival Forest (RSF) [@Ishwaran:2007a;@Ishwaran:2008] is an extension of Breiman's RF techniques to survival settings, allowing efficient non-parametric analysis of time to event data. The \pkg{randomForestSRC} package (http://CRAN.R-project.org/package=randomForestSRC) [@Ishwaran:RFSRC:2014] is a unified treatment of Breiman's random forest for survival, regression and classification problems.

Predictive accuracy make RF an attractive alternative to parametric models, though complexity and interpretability of the forest hinder wider application of the method. We introduce the \pkg{ggRandomForests} package  (http://CRAN.R-project.org/package=ggRandomForests) for visually exploring random forest models. The \pkg{ggRandomForests} package is structured to extract intermediate data objects from \pkg{randomForestSRC} objects and generate figures using the \pkg{ggplot2} graphics package (http://CRAN.R-project.org/package=ggplot2)  [@Wickham:2009].

Many of the figures created by the \pkg{ggRandomForests} package are also available directly from within the \pkg{randomForestSRC} package. However \pkg{ggRandomForests} offers the following advantages:

  * Separation of data and figures: \pkg{ggRandomForests} contains functions that  operate on either the `rfsrc` forest object directly, or on the output from \pkg{randomForestSRC} post processing functions (i.e., `plot.variable`, `var.select`) to generate intermediate \pkg{ggRandomForests} data objects. \pkg{ggRandomForests} functions are provide to further process these objects and plot results using the \pkg{ggplot2} graphics package. Alternatively, users can use these data objects for their own custom plotting or analysis operations.

  * Each data object/figure is a single, self contained unit. This allows simple modification and manipulation of the data or `ggplot` objects to meet users specific needs and requirements.

 * We chose to use the \pkg{ggplot2} package for our figures for flexibility in modifying the output. Each \pkg{ggRandomForests} plot function returns either a single `ggplot` object, or a `list` of `ggplot` objects, allowing the use of additional \pkg{ggplot2} functions to modify and customize the final figures.

This document is structured as a tutorial for using the \pkg{randomForestSRC} package for building and post-processing random survival forest models  and using the \pkg{ggRandomForests} package for understanding how the forest is constructed. In this tutorial, we will build a random survival forest for the primary biliary cirrhosis (PBC) of the liver data set [@fleming:1991], available in the \pkg{randomForestSRC} package.

In \autoref{data-summary-primary-biliary-cirrhosis-pbc-data-set} we introduce the `pbc` data set and summarize the proportional hazards analysis of this data from Chapter 4 of [@fleming:1991]. In \autoref{random-survival-forest}, we describe how to grow a random survival forest with the \pkg{randomForestSRC} package. Random forest is not a parsimonious method, but uses all variables available in the data set to construct the response predictor. We demonstrate random forest variable selection techniques (\autoref{variable-selection}) using Variable Importance (VIMP) [@Breiman:2001] in \autoref{variable-importance} and Minimal Depth [@Ishwaran:2010] in \autoref{minimal-depth}. We then compare both methods with variables used in the [@fleming:1991] model.

Once we have an idea of which variables we are most interested in, we use dependence plots [@Friedman:2000] (\autoref{variableresponse-dependence}) to understand how these variables are related to the response. Variable dependence (\autoref{variable-dependence}) plots give us an idea of the overall trend of a variable/response relation, while partial dependence plots (\autoref{partial-dependence}) show us the risk adjusted relation by averaging out the effects of other variables. Dependence plots often show strongly non-linear variable/response relations that are not easily obtained through parametric modeling.

We then graphically examine forest variable interactions with the use of variable and partial dependence conditioning plots (coplots) [@chambers:1992; @cleveland:1993] (\autoref{conditional-dependence-plots}) and the analogouse partial dependence surfaces (\autoref{partial-plot-surfaces}) before adding concluding remarks in \autoref{conclusion}.

``` {r libraries}
################## Load packages ##################
library("ggplot2")         # Graphics engine
library("RColorBrewer")    # Nice color palettes
library("plot3D")          # for 3d surfaces.
library("dplyr")           # Better data manipulations
library("parallel")        # mclapply for multicore processing

# Analysis packages.
library("randomForestSRC") # random forest for survival, regression and
                           # classification
library("ggRandomForests") # ggplot2 random forest figures (This!)

################ Default Settings ##################
theme_set(theme_bw())     # A ggplot2 theme with white background

## Set open circle for censored, and x for events
event.marks <- c(1, 4)
event.labels <- c(FALSE, TRUE)

## We want red for death events, so reorder this set.
strCol <- brewer.pal(3, "Set1")[c(2,1,3)]

Data summary: primary biliary cirrhosis (PBC) data set

The \emph{primary biliary cirrhosis} of the liver (PBC) study consists of 424 PBC patients referred to Mayo Clinic between 1974 and 1984 who met eligibility criteria for a randomized placebo controlled trial of the drug D-penicillamine (DPCA). The data is described in \cite[Chapter 0.2]{fleming:1991} and a partial likelihood model (Cox proportional hazards) is developed in Chapter 4.4. The pbc data set, included in the \pkg{randomForestSRC} package, contains 418 observations, of which 312 patients participated in the randomized trial \cite[Appendix D]{fleming:1991}.

``` {r datastep, echo=TRUE} data("pbc", package = "randomForestSRC")

``` {r data-clean}
library("tidyr")        # Transforming wide data into long data (gather)

## Not displayed ##
## Set modes correctly. For binary variables: transform to logical
## Check for range of 0, 1
## There is probably a better way to do this.
for(ind in 1:dim(pbc)[2]){
  if(!is.factor(pbc[, ind])){
    if(length(unique(pbc[which(!is.na(pbc[, ind])), ind]))<= 2) {
      if(sum(range(pbc[, ind], na.rm = TRUE) ==  c(0, 1)) ==  2){
        pbc[, ind] <- as.logical(pbc[, ind])
        }
  }
 }else{
  if(length(unique(pbc[which(!is.na(pbc[, ind])), ind]))<= 2) {
   if(sum(sort(unique(pbc[, ind])) ==  c(0, 1)) ==  2){
    pbc[, ind] <- as.logical(pbc[, ind])
   }
   if(sum(sort(unique(pbc[, ind])) ==  c(FALSE, TRUE)) ==  2){
    pbc[, ind] <- as.logical(pbc[, ind])
   }
  }
 }
 if(!is.logical(pbc[, ind]) &
    length(unique(pbc[which(!is.na(pbc[, ind])), ind]))<= 5) {
  pbc[, ind] <- factor(pbc[, ind])
 }
}
# Convert age to years
pbc$age <- pbc$age/364.24
pbc$years <- pbc$days/364.24
pbc <- pbc %>% select(-days)
pbc$treatment <- as.numeric(pbc$treatment)
pbc$treatment[which(pbc$treatment == 1)] <- "DPCA"
pbc$treatment[which(pbc$treatment == 2)] <- "placebo"
pbc$treatment <- factor(pbc$treatment)

cls <- sapply(pbc, class)

labels <- c("Event (F = censor, T = death)",
            "Treament (DPCA, Placebo)",
            "Age (years)",
            "Female = T",
            "Presence of Asictes",
            "Presence of Hepatomegaly",
            "Presence of Spiders",
            "Edema (0, 0.5, 1)",
            "Serum Bilirubin (mg/dl)",
            "Serum Cholesterol (mg/dl)",
            "Albumin (gm/dl)",
            "Urine Copper (ug/day)",
            "Alkaline Phosphatase (U/liter)",
            "SGOT (U/ml)",
            "Triglicerides (mg/dl)",
            "Platelets per cubic ml/1000",
            "Prothrombin time (sec)",
            "Histologic Stage",
            "Time (years)")

dta.labs <- data.frame(cbind(names = colnames(pbc), label = labels, type = cls))
# Put the "years" variable on top.
dta.labs <- rbind(dta.labs[nrow(dta.labs),], dta.labs[-nrow(dta.labs),])

st.labs <- as.character(dta.labs$label)
names(st.labs) <- rownames(dta.labs)

For this analysis, we modify some of the data for better formatting of our results. Since the data contains about 12 years of follow up, we prefer using years instead of days to describe survival. We also convert the age variable to years, and the treatment variable to a factor containing levels of c("DPCA", "placebo"). The variable names, type and description are given in \autoref{T:dataLabs}.

``` {r dta-table}

Not displayed

create a data dictionary table

tmp <- dta.labs colnames(tmp) <- c("Variable name", "Description", "Type") kable(tmp, format="latex", caption = "pbc data set variable dictionary.\label{T:dataLabs}", row.names = FALSE)

## Exploratory data analysis

It is good practice to view your data before beginning analysis. Exploratory Data Analysis (EDA) [@Tukey:1977] will help you to understand the data, and find outliers, missing values and other data anomalies within each variable before getting deep into the analysis. To this end, we use \pkg{ggplot2} figures with the `facet_wrap` function to create two sets of panel plots, one of histograms for categorical variables  (\autoref{fig:categoricalEDA}), and another of scatter plots for continuous variables (\autoref{fig:continuousEDA}). Variables are plotted along a continuous variable on the X-axis to separate the individual observations.

``` {r categoricalEDA, fig.width=7, fig.cap="EDA plots for categorical variables (logicals and factors). Bars indicate number of patients within 1 year of followup interval for each categorical variable. Colors correspond to class membership within each variable. Missing values are included in white."}
## Not displayed ##
# Use tidyr::gather to transform the data into long format.
cnt <- c(which(cls == "numeric" ), which(cls == "integer"))
fct <- setdiff(1:ncol(pbc), cnt) # The complement of numeric/integers.
fct <- c(fct, which(colnames(pbc) == "years"))
dta <- suppressWarnings(gather(pbc[,fct], variable, value, -years))

# plot panels for each covariate colored by the logical chas variable.
ggplot(dta, aes(x = years, fill = value)) +
  geom_histogram(color = "black", binwidth = 1) +
  labs(y = "", x = st.labs["years"]) +
  scale_fill_brewer(palette="RdBu",na.value = "white" ) +
  facet_wrap(~variable, scales = "free_y", nrow = 2) +
  theme(legend.position = "none")

In categorical EDA plots (\autoref{fig:categoricalEDA}), we are looking for patterns of missing data (white portion of bars). We often use surgical date for our X-axis variable to look for possible periods of low enrollment. There is not a comparable variable available in the pbc data set, so instead we used follow up time (years). Another reasonable choice may have been to use the patient age variable for the X-axis. The important quality of the selected variable is to spread the observations out to aid in finding data anomalies.

`` {r continuousEDA, fig.cap="EDA plots for continuous variables. Symbols indicate observations with variable value on Y-axis against follow up time in years. Symbols are colored and shaped according to the death event (status` variable). Missing values are indicated by rug marks along the X-axis", fig.width=7, fig.height=4}

Not displayed

Use tidyr::gather to transform the data into long format.

cnt <- c(cnt, which(colnames(pbc) == "status")) dta <- gather(pbc[,cnt], variable, value, -years, -status)

plot panels for each covariate colored by the logical chas variable.

ggplot(dta %>% filter(!is.na(value)), aes(x = years, y = value, color = status, shape = status)) + geom_point(alpha = 0.4) + geom_rug(data = dta[which(is.na(dta$value)),], color = "grey50") + labs(y = "", x = st.labs["years"], color = "Death", shape = "Death") + scale_color_manual(values = strCol) + scale_shape_manual(values = event.marks) + facet_wrap(~variable, scales = "free_y", ncol = 4) + theme(legend.position = c(0.8, 0.2))

In continuous data EDA plots (\autoref{fig:continuousEDA}), we are looking for missingness (rug marks) and extreme or non-physical values. For survival settings, we color and shape the points as red `x's to indicate events, and blue circles to indicate censored observation.

Extreme value examples are evident in a few of the variables in \autoref{fig:continuousEDA}. We are typically looking for values that are outside of the biological range. This is often caused by measurements recorded in differing units, which can sometimes be corrected algorithmically. Since we can not ask the original investigator to clarify these values in this particular study, we will continue without modifying the data.

``` {r missing, results="asis"}
## Not displayed ##
# create a missing data table
pbc.trial <- pbc %>% filter(!is.na(treatment))
st <- apply(pbc,2, function(rw){sum(is.na(rw))})
st.t <- apply(pbc.trial,2, function(rw){sum(is.na(rw))})
st <- data.frame(cbind(full = st, trial = st.t))
st <- st[which(st$full>0),]
colnames(st) <- c("pbc", "pbc.trial")

kable(st,
      format="latex",
      caption = "Missing value counts in `pbc` data set and pbc clinical trial observations (`pbc.trial`).\\label{T:missing}",
      digits = 3, booktabs=TRUE)

Both EDA figures indicate the pbc data set contains quite a bit of missing data. \autoref{T:missing} shows the number of missing values in each variable of the pbc data set. Of the r ncol(pbc) variables in the data, r nrow(st) have missing values. The pbc column details variables with missing data in the full pbc data set, though there are r st["treatment", "full"] patients that were not randomized into the trial. If we restrict the data to the trial only, most of the missing values are also removed, leaving only r sum(st$pbc.trial>0) variables with missing values. Therefore, we will focus on the r nrow(pbc.trial) observations from the clinical trial for the remainder of this document. We will discuss how \pkg{randomForestSRC} handles missing values in \autoref{random-forest-imputation}.

PBC Model Summary

We conclude the data set investigation with a summary of[@fleming:1991] model results from Chapter 4.4. We start by generating Kaplan--Meier (KM) survival estimates comparing the treatment groups of DPCA and placebo. We use the \pkg{ggRandomForests} gg_survival function to generate these estimates from the data set as follows.

``` {r gg_survival, echo=TRUE}

Create the trial and test data sets.

pbc.trial <- pbc %>% filter(!is.na(treatment)) pbc.test <- pbc %>% filter(is.na(treatment))

Create the gg_survival object

gg_dta <- gg_survival(interval = "years", censor = "status", by = "treatment", data = pbc.trial, conf.int = 0.95)

The code block reduces the `pbc` data set to the `pbc.trial` which only include observations from the clinical trial. The remaining observations are stored in the `pbc.test` data set for later use. The \pkg{ggRandomForests} package is designed to use a two step process in figure generation. The first step is data generation, where we store a `gg_survival` data object in the `gg_dta` object. The `gg_survival` function uses the `data` set, follow up `interval`, `censor` indicator and an optional grouping argument (`by`). By default `gg_survival` also calculates $95\%$ confidence band, which we can control with the `conf.int` argument.

In the figure generation step, we use the \pkg{ggRandomForests} plot routine `plot.gg_survival` as shown in the following code block. The `plot.gg_survival` function uses the `gg_dta` data object to plot the survival estimate curves for each group and corresponding confidence interval ribbons. We have used additional \pkg{ggplot2} commands to modify the axis and legend labels (`labs`), the legend location (`theme`) and control the plot range of the y-axis (`coord_cartesian`) for this figure.

``` {r plot_gg_survival, fig.cap="Kaplan--Meier survival estimates comparing the DPCA treatment (red) with placebo (blue) groups for the pbc.trail data set. Median survival with shaded 95\\% confidence band.", echo=TRUE}
plot(gg_dta) +
  labs(y = "Survival Probability", x = "Observation Time (years)",
       color = "Treatment", fill = "Treatment") +
  theme(legend.position = c(0.2, 0.2)) +
  coord_cartesian(y = c(0, 1.01))

The gg_survival plot of \autoref{fig:plot_gg_survival} is analogous to[@fleming:1991] Figure 0.2.3 and Figure 4.4.1, showing there is little difference between the treatment and control groups.

The gg_survival function generates a variety of time-to-event estimates, including the cumulative hazard. The follow code block creates a cumulative hazard plot \cite[Figure 0.2.1]{fleming:1991} in \autoref{fig:plot_gg_cum_hazard} using the same data object generated by the original gg_survival function call. The red DPCA line is directly comparable to Figure 0.2.1, we've add the cumulative hazard estimates for the placebo population in blue.

``` {r plot_gg_cum_hazard, fig.cap="Kaplan--Meier cumulative hazard estimates comparing the DPCA treatment (red) with placebo (blue) groups for the pbc data set.", echo=TRUE} plot(gg_dta, type = "cum_haz") + labs(y = "Cumulative Hazard", x = "Observation Time (years)", color = "Treatment", fill = "Treatment") + theme(legend.position = c(0.2, 0.8)) + coord_cartesian(ylim = c(-0.02, 1.22))

In \autoref{fig:plot_gg_survival}, we demonstrated grouping on the categorical variable (`treatment`). To demonstrate plotting grouped survival on a continuous variable, we examine KM estimates of survival within stratified groups of bilirubin measures. The groupings are obtained directly from[@fleming:1991] Figure 4.4.2, where they presented univariate model results of predicting survival on a function of bilirubin.

We set up the `bili` groups on a temporary data set (`pbc.bili`) using the `cut` function with intervals matching the reference figure. For this example we combine the data generation and plot steps into a single line of code. The `error` argument of the `plot.gg_survival` function is used to control display of the confidence bands. We suppress the intervals for this figure with `error = "none"` and again modify the plot display with \pkg{ggplot2} commands to generate \autoref{fig:gg_survival-bili}.

``` {r gg_survival-bili, fig.cap="Kaplan--Meier survival estimates comparing different groups of Bilirubin measures (bili) for the pbc data set. Groups defined in Chapter 4 of [@fleming:1991].", echo=TRUE, fig.width=5.5}
pbc.bili <- pbc.trial
pbc.bili$bili_grp <- cut(pbc.bili$bili, breaks = c(0, 0.8, 1.3, 3.4, 29))

plot(gg_survival(interval = "years", censor = "status", by = "bili_grp",
                 data = pbc.bili), error = "none") +
  labs(y = "Survival Probability", x = "Observation Time (years)",
       color = "Bilirubin")

In Chapter 4,[@fleming:1991] use partial likelihood methods to build a linear model with log transformations on some variables. We summarize the final, biologically reasonable model in \autoref{T:FHmodel} for later comparison with our random forest results.

``` {r xtab, results="asis"}

Not displayed

Create a table summarizing the ph model from fleming and harrington 1991

fleming.table <- data.frame(matrix(ncol = 3, nrow = 5)) rownames(fleming.table) <- c("Age", "log(Albumin)", "log(Bilirubin)", "Edema", "log(Prothrombin Time)") colnames(fleming.table) <- c("Coef.", "Std. Err.", "Z stat.") fleming.table[,1] <- c(0.0333, -3.0553,0.8792, 0.7847, 3.0157) fleming.table[,2] <- c(0.00866, 0.72408,0.09873,0.29913,1.02380) fleming.table[,3] <- c(3.84,-4.22,8.9,2.62,2.95)

kable(fleming.table, format="latex", caption = "pbc proportional hazards model summary of 312 randomized cases in pbc.trial data set. (Table 4.4.3c [@fleming:1991])\label{T:FHmodel}", digits = 3)

# Random survival forest

A Random Forest [@Breiman:2001] is grown by \emph{bagging} [@Breiman:1996] a collection of \emph{classification and regression trees} (CART) [@cart:1984]. The method uses a set of $B$ \emph{bootstrap} [@bootstrap:1994] samples, growing an independent tree model on each sub-sample of the population. Each tree is grown by recursively partitioning the population based on optimization of a \emph{split rule} over the $p$-dimensional covariate space. At each split, a subset of $m \le p$ candidate variables are tested for the split rule optimization, dividing each node into two daughter nodes. Each daughter node is then split again until the process reaches the \emph{stopping criteria} of either \emph{node purity} or \emph{node member size}, which defines the set of \emph{terminal (unsplit) nodes} for the tree. In regression trees, node impurity is measured by mean squared error, whereas in classification problems, the Gini index is used [@Friedman:2000].

Random forest sorts each training set observation into one unique terminal node per tree. Tree estimates for each observation are constructed at each terminal node, among the terminal node members. The Random Forest estimate for each observation is then calculated by aggregating, averaging (regression) or votes (classification), the terminal node results across the collection of $B$ trees.

Random Survival Forest [@Ishwaran:2007; @Ishwaran:2008] (RSF) are an extension of Random Forest to analyze right censored, time to event data. A forest of survival trees is grown using a log-rank splitting rule to select the optimal candidate variables. Survival estimate for each observation are constructed with a Kaplan--Meier (KM) estimator within each terminal node, at each event time.

Random Survival Forests adaptively discover nonlinear effects and interactions and are fully nonparametric. Averaging over many trees enables RSF to approximate complex survival functions, including non-proportional hazards, while maintaining low prediction error. [@Ishwaran:2010a] showed that RSF is uniformly consistent and that survival forests have a uniform approximating property in finite-sample settings, a property not possessed by individual survival trees.

The \pkg{randomForestSRC} `rfsrc` function call grows the forest, determining the type of forest by the response supplied in the `formula` argument. In the following code block, we grow a random forest for survival, by passing a survival (`Surv`) object to the forest. The forest uses all remaining variables in the `pbc.trial` data set to generate the RSF survival model.

```r
rfsrc_pbc <- rfsrc(Surv(years, status) ~ ., data = pbc.trial,
                   nsplit = 10, na.action = "na.impute", 
                   tree.err = TRUE,importance = TRUE)

The print.rfsrc function returns information on how the random forest was grown. Here the family = "surv" forest has ntree = 1000 trees (the default ntree argument). The forest selected from ceil$(\sqrt{p=17}) = 5$ randomly selected candidate variables for splitting at each node, stopping when a terminal node contained three or fewer observations. For continuous variables, we used a random logrank split rule, which randomly selects from nsplit = 10 split point values, instead of optimizing over all possible values.

Generalization error

One advantage of random forest is a built in generalization error estimate. Each bootstrap sample selects approximately $63.2\%$ of the population on average. The remaining $36.8\%$ of observations, the Out-of-Bag [@BreimanOOB:1996e] (OOB) sample, can be used as a hold out test set for each tree. An OOB prediction error estimate can be calculated for each observation by predicting the response over the set of trees which were not trained with that particular observation. Out-of-Bag prediction error estimates have been shown to be nearly identical to $n$--fold cross validation estimates [@StatisticalLearning:2009]. This feature of random forest allows us to obtain both model fit and validation in one pass of the algorithm.

The gg_error function operates on the random forest (rfsrc_pbc) object to extract the error estimates as a function of the number of trees in the forest. The following code block first creates a gg_error data object, then uses the plot.gg_error function to create a ggplot object for display in a single line of code.

plot(gg_error(rfsrc_pbc))

The gg_error plot of \autoref{fig:errorPlot} demonstrates that it does not take a large number of trees to stabilize the forest prediction error estimate. However, to ensure that each variable has enough of a chance to be included in the forest prediction process, we do want to create a rather large random forest of trees.

Training Set Prediction

The gg_rfsrc function extracts the OOB prediction estimates from the random forest. This code block executes the data extraction and plotting in one line, since we are not interested in holding the prediction estimates for later reuse. Each of the \pkg{ggRandomForests} plot commands return ggplot objects, which we can also store for modification or reuse later in the analysis (ggRFsrc object). Note that we again use additional \pkg{ggplot2} commands to modify the display of the plot object.

ggRFsrc <- plot(gg_rfsrc(rfsrc_pbc), alpha = 0.2) +
  scale_color_manual(values = strCol) +
  theme(legend.position = "none") +
  labs(y = "Survival Probability", x = "Time (years)") +
  coord_cartesian(ylim = c(-0.01, 1.01))
show(ggRFsrc)

The gg_rfsrc plot of \autoref{fig:rfsrc-plot} shows the predicted survival from our RSF model. Each line represents a single patient in the training data set, where censored patients are colored blue, and patients who have experienced the event (death) are colored in red. We extend all predicted survival curves to the longest follow up time (12 years), regardless of the actual length of a patient's follow up time.

Interpretation of general survival properties from \autoref{fig:rfsrc-plot} is difficult because of the number of curves displayed. To get more interpretable results, it is preferable to plot a summary of the survival results. The following code block compares the predicted survival between treatment groups, as we did in \autoref{fig:plot_gg_survival}.

plot(gg_rfsrc(rfsrc_pbc, by = "treatment")) +
  theme(legend.position = c(0.2, 0.2)) +
  labs(y = "Survival Probability", x = "Time (years)") +
  coord_cartesian(ylim = c(-0.01, 1.01))

The gg_rfsrc plot of \autoref{fig:rfsrc-mean2} shows the median survival with a $95\%$ shaded confidence band for the DPCA group in red, and the placebo group in blue. When calling gg_rfsrc with either a by argument or a conf.int argument, the function calculates a bootstrap confidence interval around the median survival line. By default, the function will calculate the conf.int=0.95 confidence interval, with the number of bs.samples equal to the number of observations.

Random forest imputation

There are two modeling issues when dealing with missing data values: How does the algorithm build a model when values are missing from the training data?, and How does the algorithm predict a response when values are missing from the test data?. The standard procedure for linear models is to either remove or impute the missing data values before modelling. Removing the missingness is done by either removing the variable with missing values (column wise) or removing the observations (row wise). Removal is a simple solution, but may bias results when either observations or variables are scarce.

The \pkg{randomForestSRC} package imputes missing values using \emph{adaptive tree imputation} [@Ishwaran:2008]. Rather than impute missing values before growing the forest, the algorithm takes a just-in-time approach. At each node split, the set of mtry candidate variables is checked for missing values. Missing values are then imputed by randomly drawing values from non-missing data within the node. The split-statistic is then calculated on observations that were not missing values. The imputed values are used to sort observations into the subsequent daughter nodes and then discarded before the next split occurs. The process is repeated until the stopping criteria is reached and all observations are sorted into terminal nodes.

A final imputation step can be used to fill in missing values from within the terminal nodes. This step uses a process similar to the previous imputation but uses the OOB non-missing terminal node data for the random draws. These values are aggregated (averaging for continuous variables, voting for categorical variables) over the ntree trees in the forest to estimate an imputed data set. By default, the missing values are not filled into the training data, but are available within the forest object for later use if desired.

Adaptive tree imputation still requires the missing at random assumptions [@Rubin:1976]. At each imputation step, the random forest assumes that similar observations are grouped together within each node. The random draws used to fill in missing data do not bias the split rule, but only sort observations similar in non-missing data into like nodes. An additional feature of this approach is the ability of predicting on test set observations with missing values.

Test set predictions

The strength of adaptive tree imputation becomes clear when doing prediction on test set observations. If we want to predict survival for patients that did not participate in the trial using the model we created in Section \ref{random-survival-forest}, we need to somehow account for the missing values detailed in \autoref{T:missing}.

The predict.rfsrc call takes the forest object (rfsrc_pbc), and the test data set (pbc_test) and returns a predicted survival using the same forest imputation method for missing values within the test data set (na.action="na.impute").

rfsrc_pbc_test <- predict(rfsrc_pbc, newdata = pbc.test,
                          na.action = "na.impute",
                          importance = TRUE)

The forest summary indicates there are 106 test set observations with 36 deaths and the predicted error rate is $19.1\%$. We plot the predicted survival just as we did the training set estimates.

plot(gg_rfsrc(rfsrc_pbc_test), alpha=.2) +
  scale_color_manual(values = strCol) +
  theme(legend.position = "none") +
  labs(y = "Survival Probability", x = "Time (years)") +
  coord_cartesian(ylim = c(-0.01, 1.01))

The gg_rfsrc plot of \autoref{fig:predictPlot} shows the test set predictions, similar to the training set predictions in \autoref{fig:rfsrc-plot}, though with fewer patients the survival curves do not cover the same area of the figure. It is important to note that because \autoref{fig:rfsrc-plot} is constructed with OOB estimates, the survival results are comparable as estimates from unseen observations in \autoref{fig:predictPlot}.

Variable selection

Random forest is not a parsimonious method, but uses all variables available in the data set to construct the response predictor. Also, unlike parametric models, random forest does not require the explicit specification of the functional form of covariates to the response. Therefore there is no explicit $p$-value/significance test for variable selection with a random forest model. Instead, RF ascertains which variables contribute to the prediction through the split rule optimization, optimally choosing variables which separate observations.

The typical goal of a random forest analysis is to build a \emph{prediction} model, in contrast to extracting \emph{information} regarding the underlying process [@Breiman:twoCultures:2001]. There is not usually much care given in how variables are included into the training data set. Since the goal is prediction, investigators often include the ``kitchen sink'' if it can help.

In contrast, in survival settings we are typically also interested in how we can possibly improve the the outcome of interest. To achieve this, for understandable inference, it is important to avoid both duplication and transformations of variables whenever possible when building our data sets. Duplication of variables, including multiple measures of a similar covariate, can reduce or mask the importance of the covariate. Transformations can also mask importance as well as make interpretation of the inference results difficult to impossible.

In this Section, We explore two separate approaches to investigate the RF variable selection process. Variable Importance (\autoref{variable-importance}), a property related to variable misspecification, and Minimal Depth (\autoref{minimal-depth}), a property derived from the construction of the trees within the forest.

Variable Importance

\emph{Variable importance} (VIMP) was originally defined in CART using a measure involving surrogate variables (see Chapter 5 of [@cart:1984]). The most popular VIMP method uses a prediction error approach involving ``noising-up'' each variable in turn. VIMP for a variable $x_v$ is the difference between prediction error when $x_v$ is randomly permuted, compared to prediction error under the observed values [@Breiman:2001; @Liaw:2002; @Ishwaran:2007; @Ishwaran:2008].

Since VIMP is the difference in OOB prediction error before and after permutation, a large VIMP value indicates that misspecification detracts from the predictive accuracy in the forest. VIMP close to zero indicates the variable contributes nothing to predictive accuracy, and negative values indicate the predictive accuracy \emph{improves} when the variable is misspecified. In the later case, we assume noise is more informative than the true variable. As such, we ignore variables with negative and near zero values of VIMP, relying on large positive values to indicate that the predictive power of the forest is dependent on those variables.

The gg_vimp function extracts VIMP measures for each of the variables used to grow the forest. The plot.gg_vimp function shows the variables, in VIMP rank order, labeled with the named vector in the lbls argument.

plot(gg_vimp(rfsrc_pbc), lbls = st.labs) +
  theme(legend.position = c(0.8, 0.2)) +
  labs(fill = "VIMP > 0")
## calculate for document
ggda <- gg_vimp(rfsrc_pbc)

The gg_vimp plot of \autoref{fig:rf-vimp} details VIMP ranking for the pbc.trial baseline variables, from the largest (r gsub(" \\(mg/dl\\)", "",st.labs[as.character(ggda$vars)[1]])) at the top, to smallest (r gsub(" \\(mg/dl\\)", "",st.labs[as.character(ggda$vars)[nrow(ggda)]], )) at the bottom. VIMP measures are shown using bars to compare the scale of the error increase under permutation and colored by the sign of the measure (red for negative values). Note that four of the five highest ranking variables by VIMP match those selected by the[@fleming:1991] model listed in \autoref{T:FHmodel}, with urine copper (2) ranking higher than age (8). We will return to this in \autoref{variable-selection-comparison}.

Minimal Depth

In VIMP, prognostic risk factors are determined by testing the forest prediction under alternative data settings, ranking the most important variables according to their impact on predictive ability of the forest. An alternative method uses inspection of the forest construction to rank variables. \emph{Minimal depth} [@Ishwaran:2010; @Ishwaran:2011] assumes that variables with high impact on the prediction are those that most frequently split nodes nearest to the root node, where they partition the largest samples of the population.

Within each tree, node levels are numbered based on their relative distance to the root of the tree (with the root at 0). Minimal depth measures important risk factors by averaging the depth of the first split for each variable over all trees within the forest. The assumption in the metric is that smaller minimal depth values indicate the variable separates large groups of observations, and therefore has a large impact on the forest prediction.

In general, to select variables according to VIMP, we examine the VIMP values, looking for some point along the ranking where there is a large difference in VIMP measures. Given minimal depth is a quantitative property of the forest construction, \cite{Ishwaran:2010} also derive an analytic threshold for evidence of variable impact. A simple optimistic threshold rule uses the mean of the minimal depth distribution, classifying variables with minimal depth lower than this threshold as important in forest prediction.

The \pkg{randomForestSRC} var.select function uses the minimal depth methodology for variable selection, returning an object with both minimal depth and vimp measures. The \pkg{ggRandomForests} gg_minimal_depth function is analogous to the gg_vimp function. Variables are ranked from most important at the top (minimal depth measure), to least at the bottom (maximal minimal depth).

varsel_pbc <- var.select(rfsrc_pbc)
gg_md <- gg_minimal_depth(varsel_pbc, lbls = st.labs)
# print(gg_md)

The gg_minimal_depth summary mostly reproduces the output from the var.select function from the \pkg{randomForestSRC} package. We report the minimal depth threshold (threshold r round(gg_md$md.obj$threshold, digits=3)) and the number of variables with depth below that threshold (model size r gg_md$modelsize). We also list a table of the top (r gg_md$modelsize) selected variables, in minimal depth rank order with the associated VIMP measures. The minimal depth numbers indicate that bili tends to split between the first and second node level, and the next three variables (albumin, copper, prothrombin) split between the second and third levels on average.

plot(gg_md, lbls = st.labs)

The gg_minimal_depth plot of \autoref{fig:mindepth-plot} is similar to the gg_vimp plot in \autoref{fig:rf-vimp}, ranking variables from most important at the top (minimal depth measure), to least at the bottom (maximal minimal depth). The vertical dashed line indicates the minimal depth threshold where smaller minimal depth values indicate higher importance and larger values indicate lower importance.

Variable selection comparison

Since the VIMP and Minimal Depth measures use different criteria, we expect the variable ranking to be somewhat different. We use gg_minimal_vimp function to compare rankings between minimal depth and VIMP in \autoref{fig:depthVimp}.

plot(gg_minimal_vimp(gg_md), lbls = st.labs) +
  theme(legend.position=c(0.8, 0.2))

The points along the red dashed line indicate where the measures are in agreement. Points above the red dashed line are ranked higher by VIMP than by minimal depth, indicating the variables are more sensitive to misspecification. Those below the line have a higher minimal depth ranking, indicating they are better at dividing large portions of the population. The further the points are from the line, the more the discrepancy between measures.

fleming.table$nm <- c("age","albumin", "bili","edema", "prothrombin")
fh.model <- data.frame(cbind(names = fleming.table$nm,
                             FH = order(abs(fleming.table$`Z stat.`),
                                        decreasing = TRUE),
                             Variable=rownames(fleming.table),
                             Coeff=fleming.table$Coef.
                             ))
gg_v <- gg_vimp(rfsrc_pbc)
gg_v$rank <- 1:nrow(gg_v)
rownames(gg_v) <- gg_v$vars
md <- data.frame(cbind(names=gg_md$topvars))
md$rank <- 1:nrow(md)
rownames(md) <- gg_md$topvars
md$vimp <- gg_v[rownames(md),]$rank

md <- left_join(md, fh.model, by = "names")
md <- md[,c(1, 4, 2,3)]
colnames(md) <- c("Variable", "FH","Min depth", "VIMP" )
kable(md,
      format="latex",
      caption = "Comparison of variable selection criteria. Minimal depth ranking, VIMP ranking and [@fleming:1991] (FH) proportional hazards model ranked according to `abs(Z stat)` from Table \\ref{T:FHmodel}.\\label{T:modelComp}",
      align=c("l", "r","r","r"),
      digits = 3,
      row.names = FALSE)

We examine the ranking of the different variable selection methods further in \autoref{T:modelComp}. We can use the Z statistic from \autoref{T:FHmodel} to rank variables selected in the[@fleming:1991] model to compare with variables selected by minimal depth and VIMP. The table is constructed by taking the r nrow(gg_md) top ranked minimal depth variables (below the selection threshold) and matching the VIMP ranking and[@fleming:1991] model transforms. We see all three methods indicate a strong relation of serum bilirubin to survival, and overall, the minimal depth and VIMP rankings agree reasonably well with the[@fleming:1991] model.

The minimal depth selection process reduced the number of variables of interest from~r ncol(pbc)-2 to r length(varsel_pbc$topvars), which is still a rather large subset of interest. An obvious selection set is to examine the five variables selected by[@fleming:1991]. Combining the Minimal Depth and[@fleming:1991] model, there may be evidence to keep the top 7 variables. Though minimal depth does not indicate the edema variable is very interesting, VIMP ranking does agree with the proportional hazards model, indicating we might not want to remove the edema variable. Both minimal depth and VIMP suggest including copper, a measure associated with liver disease.

Regarding the chol variable, recall missing data summary of \autoref{T:missing}. In in the trial data set, there were 28 observations missing chol values. The forest imputation randomly sorts observations with missing values into daughter nodes when using the chol variable, which is also how \pkg{randomForestSRC} calculates VIMP. We therefore expect low values for VIMP when a variable has a reasonable number of missing values.

Restricting our remaining analysis to the five[@fleming:1991] variables, plus the copper retains the biological sense of these analysis. We will now examine how these six variables are related to survival using variable dependence methods to determine the direction of the effect and verify that the log transforms used by[@fleming:1991] are appropriate.

Variable/Response dependence

As random forest is not parsimonious, we have used minimal depth and VIMP to reduce the number of variables to a manageable subset. Once we have an idea of which variables contribute most to the predictive accuracy of the forest, we would like to know how the response depends on these variables.

Although often characterized as a \emph{black box} method, the forest predictor is a function of the predictor variables $\hat{f}_{RF} = f(x).$ We use graphical methods to examine the forest predicted response dependency on covariates. We again have two options, variable dependence plots (\autoref{variable-dependence}) are quick and easy to generate, and partial dependence plots (\autoref{partial-dependence}) are more computationally intensive but give us a risk adjusted look at variable dependence.

Variable dependence

\emph{Variable dependence} plots show the predicted response relative to a covariate of interest, with each training set observation represented by a point on the plot. Interpretation of variable dependence plots can only be in general terms, as point predictions are a function of all covariates in that particular observation.

Variable dependence is straight forward to calculate, involving only the getting the predicted response for each observation. In survival settings, we must account for the additional dimension of time. We plot the response at specific time points of interest, for example survival at 1 or 3 years.

```r) with vertical dashed lines indicate the 1 and 3 year survival estimates."} ggRFsrc + geom_vline(aes(xintercept = 1), linetype = "dashed") + geom_vline(aes(xintercept = 3), linetype = "dashed") + coord_cartesian(xlim = c(0, 5))

The `gg_rfsrc` of \autoref{fig:rfsrc-plot3Mnth} identical to \autoref{fig:rfsrc-plot} (stored in the `ggRFsrc` variable) with the addition of a vertical dashed line at the 1 and 3 year survival time. A variable dependence plot is generated from the predicted response value of each survival curve at the intersecting time line plotted against covariate value for that observation. This can be visualized as taking a slice of the predicted response at each time line, and spreading the resulting points out along the variable of interest.

The `gg_variable` function extracts the training set variables and the predicted OOB response from `rfsrc` and `predict` objects. In the following code block, we store the `gg_variable` data object for later use (`gg_v`), as all remaining variable dependence plots can be constructed from this object.

```r
gg_v <- gg_variable(rfsrc_pbc, time = c(1, 3),
                    time.labels = c("1 Year", "3 Years"))

plot(gg_v, xvar = "bili", alpha = 0.4) + #, se=FALSE
  labs(y = "Survival", x = st.labs["bili"]) +
  theme(legend.position = "none") +
  scale_color_manual(values = strCol, labels = event.labels) +
  scale_shape_manual(values = event.marks, labels = event.labels) +
  coord_cartesian(ylim = c(-0.01, 1.01))

The gg_variable plot of \autoref{fig:variable-plotbili} shows variable dependence for the Serum Bilirubin (bili) variable. Again censored cases are shown as blue circles, events are indicated by the red x symbols. Each predicted point is dependent on the full combination of all other covariates, not only on the covariate displayed in the dependence plot. The smooth loess line [@cleveland:1981; @cleveland:1988] indicates the trend of the prediction over the change in the variable.

Examination of \autoref{fig:variable-plotbili} indicates most of the cases are grouped in the lower end of bili values. We also see that most of the higher values experienced an event. The ``normal'' range of Bilirubin is from 0.3 to 1.9 mg/dL, indicating the distribution from our population is well outside the normal range. These values make biological sense considering Bilirubin is a pigment created in the liver, the organ effected by the PBC disease. The figure also shows that the risk of death increases as time progresses. The risk at 3 years is much greater than that at 1 year for patients with high Bilirubin values compared to those with values closer to the normal range.

The plot.gg_variable function call operates on the gg_variable object controlled by the list of variables of interest in the xvar argument. By default, the plot.gg_variable function returns a list of ggplot objects, one figure for each variable named in xvar. The remaining arguments are passed to internal \pkg{ggplot2} functions controlling the display of the figure. The se argument is passed to the internal call to geom_smooth for fitting smooth lines to the data. The alpha argument lightens the coloring points in the geom_point call, making it easier to see point over plotting. We also demonstrate modification of the plot labels using the labs function and point attributes with the scale_ functions.

An additional plot.gg_variable argument (panel = TRUE) can be used to combine multiple variable dependence plots into a single figure. In the following code block, we plot the remaining continuous variables of interest found in \autoref{T:modelComp}.

xvar <- c("bili", "albumin", "copper", "prothrombin", "age")
xvar.cat <- c("edema")

plot(gg_v, xvar = xvar[-1], panel = TRUE, alpha = 0.4) +
  labs(y = "Survival") +
  theme(legend.position = "none") +
  scale_color_manual(values = strCol, labels = event.labels) +
  scale_shape_manual(values = event.marks, labels = event.labels) +
  coord_cartesian(ylim = c(-0.05, 1.05))

The gg_variable plot in \autoref{fig:variable-plot} displays a panel of the remaining continuous variable dependence plots. The panels are sorted in the order of variables in the xvar argument and include a smooth loess line [@cleveland:1981; @cleveland:1988] to indicate the trend of the prediction dependence over the covariate values. The se=FALSE argument turns off the loess confidence band, and the span=1 argument controls the degree of smoothing.

The figures indicate that survival increases with albumin level, and decreases with bili, copper, prothrombin and age. Note the extreme value of prothrombin (> 16) influences the loess curve more than other points, which would make it a candidate for further investigation.

We expect survival at 3 years to be lower than at 1 year. However, comparing the two time plots for each variable does indicate a difference in response relation for bili, copper and prothrombine. The added risk for high levels of these variables at 3 years indicates a non-proportional hazards response. The similarity between the time curves for albumin and age indicates the effect of these variables is constant over the disease progression.

There is not a convenient method to panel scatter plots and boxplots together, so we recommend creating panel plots for each variable type separately. We plot the categorical variable (edema) in \autoref{fig:variable-plotCat} separately from the continuous variables in \autoref{fig:variable-plot}.

plot(gg_v, xvar = xvar.cat, alpha = 0.4) + labs(y = "Survival") +
  theme(legend.position = "none") +
  scale_color_manual(values = strCol, labels = event.labels) +
  scale_shape_manual(values = event.marks, labels = event.labels) +
  coord_cartesian(ylim = c(-0.01, 1.02))

The gg_variable plot of \autoref{fig:variable-plotCat} for categorical variable dependence displays boxplots to examine the distribution of predicted values within each level of the variable. The points are plotted with a jitter to see the censored and event markers more clearly. The boxes are shown with horizontal bars indicating the median, 75th (top) and 25th (bottom) percentiles. Whiskers extend to 1.5 times the interquartile range. Points plotted beyond the whiskers are considered outliers.

When using categorical variables with linear models, we use boolean dummy variables to indicate class membership. In the case of edema, we would probably create two logical variables for edema = 0.5 (complex Edema presence indicator) and edema = 1.0 (Edema with diuretics) contrasted with the edema = 0 variable (no Edema). Random Forest can use factor variables directly, separating the populations into homogeneous groups of edema at nodes that split on that variable. \autoref{fig:variable-plotCat} indicates similar survival response distribution between 1 and 3 year when edema = 1.0. The distribution of predicted survival does seem to spread out more than for the other values, again indicating a possible non-proportional hazards response.

Partial dependence

\emph{Partial dependence} plots are a risk adjusted alternative to variable dependence. Partial plots are generated by integrating out the effects of variables beside the covariate of interest. The figures are constructed by selecting points evenly spaced along the distribution of the variable of interest. For each of these points ($X = x$), we calculate the average RF prediction over all remaining covariates in the training set by

$$ \tilde{f}(x) = \frac{1}{n} \sum_{i = 1}^n \hat{f}(x, x_{i, o}), \label{E:partial} $$

where $\hat{f}$ is the predicted response from the random forest and $x_{i, o}$ is the value for all other covariates other than $X = x$ for observation $i$ [@Friedman:2000].

Generating partial dependence data is effectively averaging the response for a series of nomograms constructed for each observation by varying the variable of interest. The operation is computationally intensive, especially when there are a large number of observations. The default parameters for the plot.variable function generate partial dependence estimates at pts = 25 points along the variable of interest. For each point of interest, the plot.variable function averages the n response predictions. This process is repeated for each of the variables of interest.

For time to event data, we also have to deal with the additional time dimension, as with variable dependence. The following code block uses the mclapply function from the \pkg{parallel} package to run the plot.variable function for three time points (time=1, 3 and 5 years) in parallel. For RSF models, we calculate a risk adjusted survival estimates (surv.type="surv"), suppressing the internal base graphs (show.plots = FALSE) and store the point estimates in the partial_pbc list.

xvar <- c(xvar, xvar.cat)

time_index <- c(which(rfsrc_pbc$time.interest > 1)[1]-1,  
                which(rfsrc_pbc$time.interest > 3)[1]-1,  
                which(rfsrc_pbc$time.interest > 5)[1]-1)
partial_pbc <- mclapply(rfsrc_pbc$time.interest[time_index],
                        function(tm){
                          plot.variable(rfsrc_pbc, surv.type = "surv", 
                                        time = tm, xvar.names = xvar,
                                        partial = TRUE , 
                                        show.plots = FALSE)
                        })

Because partial dependence data is collapsed onto the risk adjusted response, we can show multiple time curves on a single panel. The following code block converts the plot.variable output into a list of gg_partial objects, and then combines these data objects, with descriptive labels, along each variable of interest using the combine.gg_partial function.

gg_dta <- mclapply(partial_pbc, gg_partial)
pbc_ggpart <- combine.gg_partial(gg_dta[[1]], gg_dta[[2]],
                                 lbls = c("1 Year", "3 Years"))

We then segregate the continuous and categorical variables, and generate a panel plot of all continuous variables in the gg_partial plot of \autoref{fig:pbc-partial-panel}. The panels are ordered by minimal depth ranking. Since all variables are plotted on the same Y-axis scale, those that are strongly related to survival make other variables look flatter. The figures also confirm the strong non-linear contribution of these variables. Non-proportional hazard response is also evident in at least the bili and copper variables by noting the divergence of curves as time progresses.

ggpart <- pbc_ggpart
ggpart$edema <- NULL

plot(ggpart, panel = TRUE) +
  geom_smooth(se=FALSE)+
  labs(x = "", y = "Survival", color = "Time", shape = "Time") +
  theme(legend.position = c(0.8, 0.2)) 

Categorical partial dependence is displayed as boxplots, similar to categorical variable dependence. Risk adjustment greatly reduces the spread of the response as expected, and may also move the mean response compared to the unadjusted results. The categorical gg_partial plot of \autoref{fig:pbc-partial-edema} indicates that, adjusting for other variables, survival decreases with rising edema values. We also note that the risk adjusted distribution does spread out as we move further out in time.

ggplot(pbc_ggpart[["edema"]], aes(y=yhat, x=edema, col=group))+
  geom_boxplot(notch = TRUE,
               outlier.shape = NA) + # panel=TRUE,
  labs(x = "Edema", y = "Survival (%)", color="Time", shape="Time") +
  theme(legend.position = c(0.1, 0.2))

Partial dependence is an extrapolation operation. By averaging over a series of nomograms, the algorithm constructs observations for all values of the variable of interest, regardless of the relation with other variables. In contrast, variable dependence only uses observations from within the training set. A simple example would be for a model including BMI, weight and height. When examining partial dependence of BMI, the algorithm only manipulates BMI values, height or weight values. The averaging operation is then confounded in two directions. First, dependence on height and weight is shared with BMI, making it difficult to see the true response dependence. Second, partial dependence is calculated over nomograms that can not physically occur. For simple variable combinations, like BMI, it is not difficult to recognize this and modify the independent variable list to avoid these issues. However, care must be taken when interpreting more complex biological variables.

Temporal partial dependence

In the previous section, we calculated risk adjusted (partial) dependence at two time points (1 and 3 years). The selection of these points can be driven by biological times of interest (i.e., 1 year and 5 year survival in cancer studies) or by investigating time points of interest from a gg_rfsrc prediction plot. We typically restrict generating gg_partial plots to the variables of interest at two or three time points of interest due to computational constraints.

It is instructive to see a more detailed map of the risk adjusted response to get a feel for interpreting partial and variable dependence plots. In \autoref{fig:pbc-partial-panel}, we can visualize the two curves as extending into the plane of the page along a time axis. Filling in more partial dependence curves, it is possible to create a partial dependence surface.

For this exercise, we will generate a series of 50 gg_partial plot curves for the bili variable. To fill the surface in, we also increased the number of points along the distribution of bili to npts=50 to create a grid of $50 \times 50$ risk adjusted estimates of survival along time in one dimension and the bili variable in the second.

```r.", cache=TRUE, fig.width=7, fig.height=5}

Restrict the time of interest to less than 5 years.

time_pts <- rfsrc_pbc$time.interest[which(rfsrc_pbc$time.interest<=5)]

Find the 50 points in time, evenly space along the distribution of

event times for a series of partial dependence curves

time_cts <-quantile_pts(time_pts, groups = 50)

generate partial coplot data. (this takes a while)

partial_pbc_time <- lapply(time_cts, function(ct){ randomForestSRC::plot.variable(rfsrc_pbc, xvar.names = "bili", time = ct, npts = 50, show.plots = FALSE, partial = TRUE, surv.type="surv") })

We need to attach the time points of interest to our data.

time.tmp <- do.call(c,lapply(time_cts, function(grp){rep(grp, 50)}))

Convert the list of plot.variable output to gg_partial

partial_time <- do.call(rbind,lapply(partial_pbc_time, gg_partial))

attach the time data to the gg_partial_coplot

partial_time$time <- time.tmp

Modify the figure margins to make it larger

par(mai = c(0.5,0.55,0,0))

Transform the gg_partial_coplot object into a list of three named matrices

for surface plotting with plot3D::surf3D

srf <- surface_matrix(partial_time, c("time", "bili", "yhat"))

Generate the figure.

surf3D(x = srf$x, y = srf$y, z = srf$z, col = heat.colors(25), colkey = FALSE, border = "black", bty = "b2", shade = 0.5, expand = 0.5, theta=110, phi=15, lighting = TRUE, lphi = -50, ticktype="detailed", ylab = "Bilirubin", xlab = "Time", zlab = "Survival" )

Extract the 1 and 3 year points.

Find the indices of the points closest in time

t.pts <- sapply(c(1,3), function(pt){min(abs(srf$x - pt), na.rm=TRUE)}) indx <- vector("list", length=2) indx[[1]] <- which(abs(srf$x - 1) < t.pts[1]+1.e-5) indx[[2]] <- which(abs(srf$x - 3) < t.pts[2]+1.e-5)

Generate curves along 1 and 3 year partial dependence

alt <- lapply(indx, function(ind){ lines3D(x=srf$x[ind], y=srf$y[ind],z=srf$z[ind], add=TRUE, col="blue", lwd=6) })

The `gg_partial` surface of \autoref{fig:timeSurface3d} was constructed using the `surf3D` function from the \pkg{plot3D} package \citep[\url{http://CRAN.R-project.org/package=plot3D}]{plot3D:2014}.

The figure shows partial dependence of survival (Z-axis) as a function of `bili` over a five year follow up time period. Lines perpendicular to the Bilirubin axis are distributed along the `bili` variable. Lines parallel to the Bilirubin axis are taken at 50 training set event times, the first event after $t=0$ at the back to last event before $t=5$ years at the front. The distribution of the time lines is also evenly selected using the same procedure as selecting points for partial dependence curves.

The 2500 estimated partial dependence points are joined together with a simple straight line interpolation to create the surface, colored according to the survival estimates (yellow close to 1, red for lower values) to aid the visualization of 3 dimensions on a 2 dimensional page. The blue lines in \autoref{fig:timeSurface3d} correspond to the 1 and 3 year partial dependence, as shown in the `bili` panel of \autoref{fig:pbc-partial-panel}.

Viewed as a surface, we see how the partial dependence changes with time. For low values of `bili`, survival decreases at a constant rate. For higher values, the rate seems constant until somewhere near 2 years, where it increases rapidly before slowing again as we approach the 5 year point.

# Variable interactions

We could stop with the results that our RF analysis has found these six variables to be important in predicting survival. Where the survival response is decreasing with increasing `bili`, `copper`, `prothrombin`, `age` and `edema` and increasing with increasing `albumin`. These results agree with the sign of the \cite{fleming:1991} model coefficients shown in \autoref{T:FHmodel}. The `gg_partial` plot in \autoref{fig:pbc-partial-panel} supports the `log` transform of `bili`, `albumin` and `prothrombin` and suggest a similar transform for including the `copper` variable in a proportional hazards model. The `age` variable does seem to have a more linear response than the other continuous variables, and using dummy variables for `edema` would preclude the need for a transformation.

Using minimal depth, it is also possible to calculate measures of pairwise interactions among variables. Recall that minimal depth measure is defined by averaging the tree depth of variable $i$ relative to the root node. To detect interactions, this calculation can be modified to measure the minimal depth of a variable $j$ with respect to the maximal subtree for variable $i$ \citep{Ishwaran:2010,Ishwaran:2011}.

The `randomForestSRC::find.interaction` function traverses the forest, calculating all pairwise minimal depth interactions, and returns a $p \times p$ matrix of interaction measures. The diagonal terms are normalized to the root node, and off diagonal terms are normalized measures of pairwise variable interaction.

```r
ggint <- gg_interaction(rfsrc_pbc)

The gg_interaction function wraps the find.interaction matrix for use with the \pkg{ggRandomForests} plot and print functions. The xvar argument is used to restrict the variables of interest and the panel = TRUE argument displays the results in a single figure.

plot(ggint, xvar = xvar)

The gg_interaction plots in \autoref{fig:interactionPanel} show interactions for the target variable (shown with the red cross) with interaction scores for all remaining variables. We expect the covariate with lowest minimal depth (bili) to be associated with almost all other variables, as it typically splits close to the root node, so viewed alone it may not be as informative as looking at a collection of interactive depth plots. Scanning across the panels, we see each successive target depth increasing, as expected. We also see the interactive variables increasing with increasing target depth.

Conditional dependence plots

Conditioning plots (coplots) \citep{chambers:1992,cleveland:1993} are a powerful visualization tool to efficiently study how a response depends on two or more variables \citep{cleveland:1993}. The method allows us to view data by grouping observations on some conditional membership. The simplest example involves a categorical variable, where we plot our data conditional on class membership, for instance on groups of the edema variable. We can view a coplot as a stratified variable dependence plot, indicating trends in the RF prediction results within panels of group membership.

Interactions with categorical data can be generated directly from variable dependence plots. Recall the variable dependence for bilirubin shown in \autoref{fig:variable-plotbili}. We recreated the gg_variable plot in \autoref{fig:var_dep}, modified by adding a linear smooth as we intend on segregating the data along conditional class membership.

```r with a linear smooth to indicate trend."}

Get variable dependence at 1 year

ggvar <- gg_variable(rfsrc_pbc, time = 1)

For labeling coplot membership

ggvar$edema <- paste("edema = ", ggvar$edema, sep = "")

Plot with linear smooth (method argument)

var_dep <- plot(ggvar, xvar = "bili", alpha = 0.5) +

geom_smooth(method = "glm",se = FALSE) +

labs(y = "Survival", x = st.labs["bili"]) + theme(legend.position = "none") + scale_color_manual(values = strCol, labels = event.labels) + scale_shape_manual(values = event.marks, labels = event.labels) + coord_cartesian(y = c(-.01,1.01))

var_dep

We can view the conditional dependence of survival against bilirubin, conditional on `edema` group membership (categorical variable) in \autoref{fig:coplot_bilirubin} by reusing the saved `ggplot` object (`var_dep`) and adding a call to the `facet_grid` function.

```r
var_dep + facet_grid(~edema)

Comparing \autoref{fig:var_dep} with conditional panels of \autoref{fig:coplot_bilirubin}, we see the overall response is similar to the edema=0 response. The survival for edema=0.5 is slightly lower, though the slope of the smooth indicates a similar relation to bili. The edema=1 panel shows that the survival for this (smaller) group of patients is worse, but still follows the trend of decreasing with increasing bili.

Conditional membership within a continuous variable requires stratification at some level. We can sometimes make these stratification along some feature of the variable, for instance a variable with integer values, or 5 or 10 year age group cohorts. However with our variables of interest, there are no logical stratification indications. Therefore we arbitrarily stratify our variables into 6 groups of roughly equal population size using the quantile_cuts function. We pass the break points located by quantile_cuts to the cut function to create grouping intervals, which we can then add to the gg_variable object before plotting with the plot.gg_variable function. This time we use the facet_wrap function to generate the panels grouping interval, which automatically sorts the six panels into two rows of three panels each.

# Find intervals with similar number of observations and create groups.
albumin_cts <- quantile_pts(ggvar$albumin, groups = 6, intervals = TRUE)
ggvar$albumin_grp <- cut(ggvar$albumin, breaks = albumin_cts)

# Adjust naming for facets
levels(ggvar$albumin_grp) <- paste("albumin =", levels(ggvar$albumin_grp))

plot(ggvar, xvar = "bili", alpha = 0.5) +  #method = "glm", , se = FALSE
  labs(y = "Survival", x = st.labs["bili"]) +
  theme(legend.position = "none") +
  scale_color_manual(values = strCol, labels = event.labels) +
  scale_shape_manual(values = event.marks, labels = event.labels) +
  facet_wrap(~albumin_grp) +
  coord_cartesian(y = c(-.01,1.01))

The gg_variable coplot of \autoref{fig:albumin-coplot} indicates that the effect of bili decreases conditional on membership within increasing albumin groups. To get a better feel for how the response depends on both these variables together, it is instructive to look at the compliment coplot of albumin conditional on membership in bili groups. We repeat the previous coplot process, predicted survival as a function of the albumin variable, conditional on membership within 6 groups bili intervals. As the code to create the coplot of \autoref{fig:bili-coplot} is nearly identical to the code for creating \autoref{fig:albumin-coplot}.

# Find intervals with similar number of observations.
bili_cts <-quantile_pts(ggvar$bili, groups = 6, intervals = TRUE)

# We need to move the minimal value so we include that observation
bili_cts[1] <- bili_cts[1] - 1.e-7

# Create the conditional groups and add to the gg_variable object
bili_grp <- cut(ggvar$bili, breaks = bili_cts)
ggvar$bili_grp <- bili_grp

# Adjust naming for facets
levels(ggvar$bili_grp) <- paste("bilirubin =", levels(bili_grp))

# plot.gg_variable
plot(ggvar, xvar = "albumin", alpha = 0.5) +
#     method = "glm", se = FALSE) +
  labs(y = "Survival", x = st.labs["albumin"]) +
  theme(legend.position = "none") +
  scale_color_manual(values = strCol, labels = event.labels) +
  scale_shape_manual(values = event.marks, labels = event.labels) +
  facet_wrap(~bili_grp) +
  coord_cartesian(ylim = c(-0.01,1.01))

The gg_variable coplot of \autoref{fig:bili-coplot} indicates the probability of survival increases with increasing albumin and increases within groups of increasing bili.

Typically, conditional plots for continuous variables include overlapping intervals along the grouped variable \citep{cleveland:1993}. We chose to use mutually exclusive continuous variable intervals for the following reasons:

It is still possible to augment the gg_variable to include overlapping conditional membership with continuous variables by duplicating rows of the training set data within the rfsrc$xvar object, and then setting the conditional group membership as described. The plot.gg_variable function recipe above could be used to generate the panel plot, with panels ordered according to the factor levels of the grouping variable. We leave this as an exercise for the reader.

Partial dependence coplots

By characterizing conditional plots as stratified variable dependence plots, the next logical step would be to generate an analogous conditional partial dependence plot. The process is similar to variable dependence coplots, first determine conditional group membership, then calculate the partial dependence estimates on each subgroup using the \code{plot.variable} function with a \code{subset} argument for each grouped interval. The \pkg{ggRandomForests} \code{gg_partial_coplot} function is a wrapper for generating conditional partial dependence data objects. Given a random forest (\code{rfsrc}) object and a \code{groups} vector for conditioning the training data set observations, \code{gg_partial_coplot} calls the \code{plot.variable} function the training set observations conditional on \code{groups} membership. The function returns a \code{gg_partial_coplot} object, a subclass of the gg_partial object, which can be plotted with the \code{plot.gg_partial} function.

The following code block will generate the data object for creating partial dependence coplot of 1 year survival as a function of bili conditional on membership within the 6 groups of albumin intervals that we examined in the \autoref{fig:albumin-coplot}.

partial_coplot_pbc <- gg_partial_coplot(rfsrc_pbc, xvar = "bili",
                                        groups = ggvar$albumin_grp,
                                        surv_type = "surv",
                                        time = rfsrc_pbc$time.interest[time_index[1]],
                                        show.plots = FALSE)

ggplot(partial_coplot_pbc, aes(x=bili, y=yhat, col=group, shape=group)) +
  geom_smooth(se = FALSE) +
  labs(x = st.labs["bili"], y = "Survival at 1 year (%)",
       color = "albumin", shape = "albumin")

The \code{gg_partial_coplot} of \autoref{fig:bili-albumin} shows point estimates of the risk adjusted survival as a function of bili conditional on group membership defined by albumin intervals. The figure is slightly different than the gg_partial plot of \autoref{fig:pbc-partial-panel} as each set of partial dependence estimates is calculated over a subset of the training data. We again connect the point estimates with a Loess curve.

For completeness, we construct the compliment coplot view of one year survival as a function of albumin conditional on bili interval group membership in \autoref{fig:albumin-bili}.

partial_coplot_pbc2 <- gg_partial_coplot(rfsrc_pbc, xvar = "albumin",
                                         groups = ggvar$bili_grp,
                                         surv_type = "surv",
                                         time = rfsrc_pbc$time.interest[time_index[1]],
                                         show.plots = FALSE)
# Partial coplot
ggplot(partial_coplot_pbc2, aes(x=albumin, y=yhat, col=group, shape=group))+
  geom_smooth(se = FALSE) +
  labs(x = st.labs["albumin"], y = "Survival at 1 year (%)",
       color = "Bilirubin", shape = "Bilirubin")
  #coord_cartesian(y = c(49,101))

Partial plot surfaces

Just as in partial dependence, we can view the partial coplot curves as slices along a surface that could extend along an axis into the page. This visualization is made a bit difficult by our choice to select groups of similar population size, as the curves are not evenly spaced along the grouping variables. So, similar to the partial dependence surface we created along time in \autoref{timeSurface}, we can examine the relation of these two variables using a partial dependence surface.

A difficulty with conditional dependence for this exercise is the reduction of the sample sizes for calculating a coplot surface. So instead, we calculate the full partial dependence surface by generating 50 albumin values spaced evenly along the data distribution. For each value of albumin, we calculate the partial dependence on bili at npts = 50 points with the plot.variable function. We generate the surface again using the surf3D function.

```r) and bili (\autoref{fig:albumin-bili}).", fig.width=7, fig.height=5, cache = TRUE}

Find the quantile points to create 50 cut points

alb_partial_pts <-quantile_pts(ggvar$albumin, groups = 50)

Load the stored partial coplot data.

partial_pbc_surf <- lapply(alb_partial_pts, function(ct){ rfsrc_pbc$xvar$albumin <- ct plot.variable(rfsrc_pbc, xvar = "bili", time = rfsrc_pbc$time.interest[time_index[1]], npts = 50, show.plots = FALSE, partial = TRUE, surv.type="surv") })

Instead of groups, we want the raw albumin point values,

To make the dimensions match, we need to repeat the values

for each of the 50 points in the albumin direction

albumin.tmp <- do.call(c,lapply(alb_partial_pts, function(grp){rep(grp, 50)}))

Convert the list of plot.variable output to

partial_surf <- do.call(rbind,lapply(partial_pbc_surf, gg_partial))

attach the data to the gg_partial_coplot

partial_surf$albumin <- albumin.tmp

Modify the figure margins to make the figure larger

par(mai = c(0,.3,0,0))

Transform the gg_partial_coplot object into a list of three named matrices

for surface plotting with plot3D::surf3D

srf <- surface_matrix(partial_surf, c("bili", "albumin", "yhat"))

Generate the figure.

surf3D(x = srf$x, y = srf$y, z = srf$z, col = topo.colors(25), colkey = FALSE, border = "black", bty = "b2", shade = 0.5, expand = 0.5, theta=55, phi=15, lighting = TRUE, lphi = -50, xlab = "Bilirubin", ylab = "Albumin", zlab = "Survival at 1 Year" )

Extract the albumin and bilirubin points

Remove end points

bli <- bili_cts[-c(1,7)] alb <- albumin_cts[-c(1,7)]

Find the indices of the points closest to split points

alb.pts <- lapply(alb, function(pt){min(abs(srf$y - pt), na.rm=TRUE)}) bli.pts <- lapply(bli, function(pt){min(abs(srf$x - pt), na.rm=TRUE)})

indx.alb <- lapply(1:length(alb.pts), function(al){ which(abs(srf$y - alb[al]) < alb.pts[[al]]+1.e-5)}) indx.bli <- lapply(1:length(bli.pts), function(al){ which(abs(srf$x - bli[al]) < bli.pts[[al]]+1.e-5)})

Draw the lines

indx <- c(indx.alb, indx.bli) st <- lapply(indx, function(ind){ lines3D(x=srf$x[ind], y=srf$y[ind], z=srf$z[ind], add=TRUE, col="blue", lwd=6)})

The partial dependence surface of \autoref{fig:surface3d} shows partial dependence of 1 year survival on the Z-axis against values of Bilirubin and Albumin. We again use linear interpolation between the 2500 estimates, and color the surface by the response. Here blue corresponds to lower and yellow to higher risk adjusted survival. The blue lines are placed at the cut points between groups of `albumin` and `bili` used in the partial coplots of Figures \ref{fig:bili-albumin} and \ref{fig:albumin-bili} respectively.

To construct the partial coplot for groups of `albumin` in \autoref{fig:bili-albumin}, we arbitrarily segmented the training set into 6 groups of equal membership size. The segments between blue lines parallel to the Bilirubin axis indicate where on the surface these observations are located. Similarly, the blues lines perpendicular to the Bilirubin axis segment observations into the 6 groups of `bili` intervals. \autoref{fig:surface3d} indicates the arbitrary grouping for groups of `bili` in \autoref{fig:albumin-bili}.

The figure indicates that partial dependence of higher `albumin` levels are similar, which results in the over plotting seen in \autoref{fig:bili-albumin}. The distribution is sparser at lower `albumin` levels, creating the larger area in lowest `albumin` values, where the partial dependence changes the most.

# Conclusion

In this vignette, we have demonstrated the use of Random Survival Forest methods with the \pkg{ggRandomForests}~(\url{http://CRAN.R-project.org/package=ggRandomForests}) package. We have shown how to grow a random forest model and determine which variables contribute to the forest prediction accuracy using both VIMP and Minimal Depth measures. We outlined how to investigate variable associations with the response variable using variable dependence and the risk adjusted partial dependence plots. We've also explored variable interactions by using pairwise minimal depth interactions and directly viewed these interactions using variable dependence coplots and partial dependence coplots. Along the way, we've demonstrated the use of additional commands from the \pkg{ggplot2} package \citep[\url{http://CRAN.R-project.org/package=ggplot2}]{Wickham:2009} package for modifying and customizing plots from \pkg{ggRandomForests} functions.

## Computational details

This document is a package vignette for the \pkg{ggRandomForests} package for ``Visually Exploring Random Forests'' (\url{http://CRAN.R-project.org/package=ggRandomForests}). The \pkg{ggRandomForests} package is designed for use with the \pkg{randomForestSRC} package \citep[\url{http://CRAN.R-project.org/package=randomForestSRC}]{Ishwaran:RFSRC:2014} for growing survival, regression and classification random forest models and uses the \pkg{ggplot2} package \citep[\url{http://CRAN.R-project.org/package=ggplot2}]{Wickham:2009} for plotting diagnostic and variable association results. \pkg{ggRandomForests} is  structured to extract data objects from \pkg{randomForestSRC} objects and provides functions for printing and plotting these objects.

The vignette is a tutorial for using the \pkg{ggRandomForests} package with the \pkg{randomForestSRC} package for building and post-processing random survival forests. In this tutorial, we explore a random forest for survival model constructed for the primary biliary cirrhosis (PBC) of the liver data set \citep{fleming:1991}, available in the \pkg{randomForestSRC} package. We grow a random survival forest and demonstrate how \pkg{ggRandomForests} can be used when determining how the survival response depends on predictive variables within the model. The tutorial demonstrates the design and usage of many of \pkg{ggRandomForests} functions and features and also how to modify and customize the resulting `ggplot` graphic objects along the way.

The vignette is written in \LaTeX using the \pkg{knitr} package \citep[\url{http://CRAN.R-project.org/package=knitr}]{Xie:2015, Xie:2014,Xie:2013}, which facilitates weaving \proglang{R} \citep{rcore} code, results and figures into document text.

This vignette is available within the \pkg{ggRandomForests} package on the Comprehensive R Archive Network (CRAN) \citep[\url{http://cran.r-project.org}]{rcore}. Once the package has been installed, the vignette can be viewed directly from within \proglang{R} with the following command:
```r
vignette("randomForestSRC-Survival", package = "ggRandomForests")

A development version of the \pkg{ggRandomForests} package is also available on GitHub (\url{https://github.com}). We invite comments, feature requests and bug reports for this package at \url{https://github.com/ehrlinger/ggRandomForests}.

Acknowledgement

This work was supported in part by the National Institutes of Health grant R01-HL103552-01A1.

References



ehrlinger/ggRFVignette documentation built on May 16, 2019, 12:16 a.m.