Use of the MachineShop package is demonstrated with a survival analysis example in which the response variable is a censored time to event outcome. Since survival outcomes are a combination of numerical (time to event) and categorical (event) variables, package features for both variable types are illustrated with the example. Support for outcomes other than survival, including nominal and ordinal factors as well as numeric vectors and matrices, is also discussed.
Survival analysis is performed with the Melanoma
dataset from the MASS package [@andersen:1993:SMB]. This dataset provides survival time, in days, from disease treatment to (1) death from disease, (2) alive at study termination, or (3) death from other causes for 205 Denmark patients with malignant melanomas. Also provided are potential predictors of the survival outcomes. The analysis begins by loading required packages MachineShop, survival [@therneau:2020:SA], and MASS. For the analysis, a binary overall survival outcome is created by combining the two death categories (1 and 3) into one.
## Analysis libraries and dataset library(MachineShop) library(survival) data(Melanoma, package = "MASS") ## Malignant melanoma analysis dataset surv_df <- within(Melanoma, { y <- Surv(time, status != 2) remove(time, status) })
Descriptive summaries of the study variables are given below in Table 1, followed by a plot of estimated overall survival probabilities and 95% confidence intervals.
surv_stats <- list( list("Number of subjects" = ~ length(y)), "time" = list("Median (Range)" = ~ median_range(y[, "time"])), "status" = list("1 = Dead" = ~ n_perc(y[, "status"] == 1), "0 = Alive" = ~ n_perc(y[, "status"] == 0)), "sex" = list("1 = Male" = ~ n_perc(sex == 1), "0 = Female" = ~ n_perc(sex == 0)), "age" = list("Median (Range)" = ~ median_range(age)), "year" = list("Median (Range)" = ~ median_range(year)), "thickness" = list("Median (Range)" = ~ median_range(thickness)), "ulcer" = list("1 = Presence" = ~ n_perc(ulcer == 1), "0 = Absence" = ~ n_perc(ulcer == 0)) ) summary_kbl(surv_stats, surv_df)
library(ggplot2) col <- "#F8766D" survfit(y ~ 1, data = surv_df) %>% with(data.frame(time, surv, lower, upper, censor = ifelse(n.censor > 0, time, NA))) %>% ggplot(aes(x = time, y = surv)) + geom_ribbon(aes(ymin = lower, ymax = upper), fill = col, alpha = 0.2) + geom_step(color = col) + geom_point(aes(x = censor), shape = 3, color = col) + coord_cartesian(ylim = c(0, 1)) + labs(x = "Follow-Up Time (Days)", y = "Overall Survival Probability", title = "Kaplan-Meier survival plot")
For the analyses, the dataset is split into a training set on which survival models will be fit and a test set on which predictions will be made and performance evaluated. A global formula surv_fo
is defined to relate the predictors on the right hand side to the overall survival outcome on the left and will be used for all subsequent survival models.
## Training and test sets set.seed(123) train_indices <- sample(nrow(surv_df), nrow(surv_df) * 2 / 3) surv_train <- surv_df[train_indices, ] surv_test <- surv_df[-train_indices, ] ## Global formula for the analysis surv_fo <- y ~ sex + age + year + thickness + ulcer
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.