knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.height = 5, 
  fig.width = 7,
  warning = FALSE,
  message = FALSE
)

Partial dependence (PD)

r aorsf:::roxy_pd_explain()

library(aorsf)
library(ggplot2)

You can compute PD and individual conditional expectation (ICE) in three ways:

Classification

Begin by fitting an oblique classification random forest:

set.seed(329)

index_train <- sample(nrow(penguins_orsf), 150) 

penguins_orsf_train <- penguins_orsf[index_train, ]
penguins_orsf_test <- penguins_orsf[-index_train, ]

fit_clsf <- orsf(data = penguins_orsf_train, 
                 formula = species ~ .)

Compute PD using out-of-bag data for flipper_length_mm = c(190, 210).

pred_spec <- list(flipper_length_mm = c(190, 210))

pd_oob <- orsf_pd_oob(fit_clsf, pred_spec = pred_spec)

pd_oob

Note that predicted probabilities are returned for each class and probabilities in the mean column sum to 1 if you take the sum over each class at a specific value of the pred_spec variables. For example,

sum(pd_oob[flipper_length_mm == 190, mean])

But this isn't the case for the median predicted probability!

sum(pd_oob[flipper_length_mm == 190, medn])

Regression

Begin by fitting an oblique regression random forest:

set.seed(329)

index_train <- sample(nrow(penguins_orsf), 150) 

penguins_orsf_train <- penguins_orsf[index_train, ]
penguins_orsf_test <- penguins_orsf[-index_train, ]

fit_regr <- orsf(data = penguins_orsf_train, 
                 formula = bill_length_mm ~ .)

Compute PD using new data for flipper_length_mm = c(190, 210).

pred_spec <- list(flipper_length_mm = c(190, 210))

pd_new <- orsf_pd_new(fit_regr, 
                      pred_spec = pred_spec,
                      new_data = penguins_orsf_test)

pd_new

You can also let pred_spec_auto pick reasonable values like so:

pred_spec = pred_spec_auto(species, island, body_mass_g)

pd_new <- orsf_pd_new(fit_regr, 
                      pred_spec = pred_spec,
                      new_data = penguins_orsf_test)

pd_new

By default, all combinations of all variables are used. However, you can also look at the variables one by one, separately, like so:

pd_new <- orsf_pd_new(fit_regr, 
                      expand_grid = FALSE,
                      pred_spec = pred_spec,
                      new_data = penguins_orsf_test)

pd_new

And you can also bypass all the bells and whistles by using your own data.frame for a pred_spec. (Just make sure you request values that exist in the training data.)

custom_pred_spec <- data.frame(species = 'Adelie', 
                               island = 'Biscoe')

pd_new <- orsf_pd_new(fit_regr, 
                      pred_spec = custom_pred_spec,
                      new_data = penguins_orsf_test)

pd_new

Survival

Begin by fitting an oblique survival random forest:

set.seed(329)

index_train <- sample(nrow(pbc_orsf), 150) 

pbc_orsf_train <- pbc_orsf[index_train, ]
pbc_orsf_test <- pbc_orsf[-index_train, ]

fit_surv <- orsf(data = pbc_orsf_train, 
                 formula = Surv(time, status) ~ . - id,
                 oobag_pred_horizon = 365.25 * 5)

Compute PD using in-bag data for bili = c(1,2,3,4,5):

pd_train <- orsf_pd_inb(fit_surv, pred_spec = list(bili = 1:5))
pd_train

If you don't have specific values of a variable in mind, let pred_spec_auto pick for you:

pd_train <- orsf_pd_inb(fit_surv, pred_spec_auto(bili))
pd_train

Specify pred_horizon to get PD at each value:

pd_train <- orsf_pd_inb(fit_surv, pred_spec_auto(bili),
                        pred_horizon = seq(500, 3000, by = 500))
pd_train

One variable, moving horizon

For the next few sections, we update orsf_fit to include all the data in pbc_orsf instead of just the training sample:

# a rare case of modify_in_place = TRUE
orsf_update(fit_surv, 
            data = pbc_orsf, 
            modify_in_place = TRUE)

fit_surv

What if the effect of a predictor varies over time? Partial dependence can show this.

pd_sex_tv <- orsf_pd_oob(fit_surv, 
                         pred_spec = pred_spec_auto(sex),
                         pred_horizon = seq(365, 365*5))

ggplot(pd_sex_tv) +
 aes(x = pred_horizon, y = mean, color = sex) + 
 geom_line() +
 labs(x = 'Time since baseline',
      y = 'Expected risk')

From inspection, we can see that males have higher risk than females and the difference in that risk grows over time. This can also be seen by viewing the ratio of expected risk over time:

library(data.table)

ratio_tv <- pd_sex_tv[
 , .(ratio = mean[sex == 'm'] / mean[sex == 'f']), by = pred_horizon
]

ggplot(ratio_tv, aes(x = pred_horizon, y = ratio)) + 
 geom_line(color = 'grey') + 
 geom_smooth(color = 'black', se = FALSE) + 
 labs(x = 'time since baseline',
      y = 'ratio in expected risk for males versus females')

To get a view of PD for any number of variables in the training data, use orsf_summarize_uni(). This function computes out-of-bag PD for the most important n_variables and returns a nicely formatted view of the output:

pd_smry <- orsf_summarize_uni(fit_surv, n_variables = 4)

pd_smry

This 'summary' object can be converted into a data.table for downstream plotting and tables.

head(as.data.table(pd_smry))

Multiple variables, jointly

Partial dependence can show the expected value of a model's predictions as a function of a specific predictor, or as a function of multiple predictors. For instance, we can estimate predicted risk as a joint function of bili, edema, and trt:

pred_spec = pred_spec_auto(bili, edema, trt)

pd_bili_edema <- orsf_pd_oob(fit_surv, pred_spec)

ggplot(pd_bili_edema) + 
 aes(x = bili, y = medn, col = trt, linetype = edema) + 
 geom_line() + 
 labs(y = 'Expected predicted risk')

From inspection,

```r

library(survival)

pbc_orsf$edema_05 <- ifelse(pbc_orsf$edema == '0.5', 'yes', 'no')

fit_cph <- coxph(Surv(time,status) ~ edema_05 * bili, data = pbc_orsf)

anova(fit_cph)

```

# drop so it won't interfere with downstream code

pbc_orsf$edema_05 <- NULL

Find interactions using PD

Random forests are good at using interactions, but less good at telling you about them. Use orsf_vint() to apply the method for variable interaction scoring with PD described by Greenwell et al (2018). This can take a little while if you have lots of predictors, and it seems to work best with continuous by continuous interactions. Interactions with categorical variables are sometimes over- or under- scored.

# use just the continuous variables
preds <- names(fit_surv$get_means())

vint_scores <- orsf_vint(fit_surv, predictors = preds)

vint_scores

The scores include partial dependence values that you can pull out and plot:

# top scoring interaction
pd_top <- vint_scores$pd_values[[1]]

# center pd values so it's easier to see the interaction effect
pd_top[, mean := mean - mean[1], by = var_2_value]

ggplot(pd_top) + 
 aes(x = var_1_value, 
     y = mean, 
     color = factor(var_2_value), 
     group = factor(var_2_value)) + 
 geom_line() + 
 labs(x = "albumin", 
      y = "predicted mortality (centered)",
      color = "protime")

Again we use a sanity check with coxph to see if these interactions are detected using a standard test:

# test the top score (expect strong interaction)
fit_cph <- coxph(Surv(time,status) ~ albumin * protime, 
                 data = pbc_orsf)

anova(fit_cph)

Note: Caution is warranted when interpreting statistical hypotheses that are motivated by the same data they are tested with. Results like the p-values for interaction shown above should be interpreted as exploratory.

Individual conditional expectations (ICE)

r aorsf:::roxy_ice_explain()

Classification

Compute ICE using out-of-bag data for flipper_length_mm = c(190, 210).

pred_spec <- list(flipper_length_mm = c(190, 210))

ice_oob <- orsf_ice_oob(fit_clsf, pred_spec = pred_spec)

ice_oob

There are two identifiers in the output:

Note that predicted probabilities are returned for each class and each observation in the data. Predicted probabilities for a given observation and given variable value sum to 1. For example,

ice_oob %>%
 .[flipper_length_mm == 190] %>% 
 .[id_row == 1] %>% 
 .[['pred']] %>% 
 sum()
1

Regression

Compute ICE using new data for flipper_length_mm = c(190, 210).

pred_spec <- list(flipper_length_mm = c(190, 210))

ice_new <- orsf_ice_new(fit_regr, 
                        pred_spec = pred_spec,
                        new_data = penguins_orsf_test)

ice_new

You can also let pred_spec_auto pick reasonable values like so:

pred_spec = pred_spec_auto(species, island, body_mass_g)

ice_new <- orsf_ice_new(fit_regr, 
                        pred_spec = pred_spec,
                        new_data = penguins_orsf_test)

ice_new

By default, all combinations of all variables are used. However, you can also look at the variables one by one, separately, like so:

ice_new <- orsf_ice_new(fit_regr, 
                        expand_grid = FALSE,
                        pred_spec = pred_spec,
                        new_data = penguins_orsf_test)

ice_new

And you can also bypass all the bells and whistles by using your own data.frame for a pred_spec. (Just make sure you request values that exist in the training data.)

custom_pred_spec <- data.frame(species = 'Adelie', 
                               island = 'Biscoe')

ice_new <- orsf_ice_new(fit_regr, 
                        pred_spec = custom_pred_spec,
                        new_data = penguins_orsf_test)

ice_new

Survival

Compute ICE using in-bag data for bili = c(1,2,3,4,5):

ice_train <- orsf_ice_inb(fit_surv, pred_spec = list(bili = 1:5))
ice_train

If you don't have specific values of a variable in mind, let pred_spec_auto pick for you:

ice_train <- orsf_ice_inb(fit_surv, pred_spec_auto(bili))
ice_train

Specify pred_horizon to get ICE at each value:

ice_train <- orsf_ice_inb(fit_surv, pred_spec_auto(bili),
                          pred_horizon = seq(500, 3000, by = 500))
ice_train

Multi-prediction horizon ice comes with minimal extra computational cost. Use a fine grid of time values and assess whether predictors have time-varying effects.

Visualizing ICE curves

Inspecting the ICE curves for each observation can help identify whether there is heterogeneity in a model's predictions. I.e., does the effect of the variable follow the same pattern for all the data, or are there groups where the variable impacts risk differently?

I am going to turn off boundary checking in orsf_ice_oob by setting boundary_checks = FALSE, and this will allow me to generate ICE curves that go beyond the 90th percentile of bili.

```r

pred_spec <- list(bili = seq(1, 10, length.out = 25))

ice_oob <- orsf_ice_oob(fit_surv, pred_spec, boundary_checks = FALSE)

ice_oob

```

For plots, it is helpful to scale the ICE data. I subtract the initial value of predicted risk (i.e., when bili = 1) from each observation's conditional expectation values. So,

```r

ice_oob[, pred_subtract := rep(pred[id_variable==1], times=25)] ice_oob[, pred := pred - pred_subtract]

```

Now we can visualize the curves.

ggplot(ice_oob, aes(x = bili, 
                    y = pred, 
                    group = id_row)) + 
 geom_line(alpha = 0.15) + 
 labs(y = 'Change in predicted risk') +
 geom_smooth(se = FALSE, aes(group = 1))

From inspection of the figure,

Limitations of PD

r aorsf:::roxy_pd_limitations()

References

  1. r aorsf:::cite("hooker_2021")


bcjaeger/aorsf documentation built on April 3, 2025, 4:16 p.m.