In brian-j-smith/MachineShop: Machine Learning Models and Tools

Melanoma Example

Use of the MachineShop package is demonstrated with a survival analysis example in which the response variable is a censored time to event outcome. Since survival outcomes are a combination of numerical (time to event) and categorical (event) variables, package features for both variable types are illustrated with the example. Support for outcomes other than survival, including nominal and ordinal factors as well as numeric vectors and matrices, is also discussed.

Survival analysis is performed with the Melanoma dataset from the MASS package [@andersen:1993:SMB]. This dataset provides survival time, in days, from disease treatment to (1) death from disease, (2) alive at study termination, or (3) death from other causes for 205 Denmark patients with malignant melanomas. Also provided are potential predictors of the survival outcomes. The analysis begins by loading required packages MachineShop, survival [@therneau:2020:SA], and MASS. For the analysis, a binary overall survival outcome is created by combining the two death categories (1 and 3) into one.

## Analysis libraries and dataset
library(MachineShop)
library(survival)
data(Melanoma, package = "MASS")

## Malignant melanoma analysis dataset
surv_df <- within(Melanoma, {
  y <- Surv(time, status != 2)
  remove(time, status)
})

Descriptive summaries of the study variables are given below in Table 1, followed by a plot of estimated overall survival probabilities and 95% confidence intervals.

Table 1. Variable summaries for the Melanoma survival analysis example.

surv_stats <- list(
  list("Number of subjects" = ~ length(y)),
  "time" = list("Median (Range)" = ~ median_range(y[, "time"])),
  "status" = list("1 = Dead" = ~ n_perc(y[, "status"] == 1),
                  "0 = Alive" = ~ n_perc(y[, "status"] == 0)),
  "sex" = list("1 = Male" = ~ n_perc(sex == 1),
               "0 = Female" = ~ n_perc(sex == 0)),
  "age" = list("Median (Range)" = ~ median_range(age)),
  "year" = list("Median (Range)" = ~ median_range(year)),
  "thickness" = list("Median (Range)" = ~ median_range(thickness)),
  "ulcer" = list("1 = Presence" = ~ n_perc(ulcer == 1),
                 "0 = Absence" = ~ n_perc(ulcer == 0))
)

summary_kbl(surv_stats, surv_df)

library(ggplot2)

col <- "#F8766D"
survfit(y ~ 1, data = surv_df) %>%
  with(data.frame(time, surv, lower, upper, censor = ifelse(n.censor > 0, time, NA))) %>%
  ggplot(aes(x = time, y = surv)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = col, alpha = 0.2) +
  geom_step(color = col) +
  geom_point(aes(x = censor), shape = 3, color = col) +
  coord_cartesian(ylim = c(0, 1)) +
  labs(x = "Follow-Up Time (Days)", y = "Overall Survival Probability",
       title = "Kaplan-Meier survival plot")

For the analyses, the dataset is split into a training set on which survival models will be fit and a test set on which predictions will be made and performance evaluated. A global formula surv_fo is defined to relate the predictors on the right hand side to the overall survival outcome on the left and will be used for all subsequent survival models.

## Training and test sets
set.seed(123)
train_indices <- sample(nrow(surv_df), nrow(surv_df) * 2 / 3)
surv_train <- surv_df[train_indices, ]
surv_test <- surv_df[-train_indices, ]

## Global formula for the analysis
surv_fo <- y ~ sex + age + year + thickness + ulcer