knitr::opts_chunk$set(fig.width=7, warning=FALSE, tidy=TRUE, tidy.opts=list(width.cutoff=60))
If you are viewing the html version of this tutorial, you can find the R markdown file here: https://github.com/denisagniel/survivalsurrogate. It is the file called survivalsurrogate_tutorial.Rmd. In this tutorial, we will go through an example using the main functions in the survivalsurrogate package. First, install the package from CRAN and load it. We also load several other packages used in the document.
#install.packages("survivalsurrogate") library(survivalsurrogate) library(ggplot2) library(tidyr) library(dplyr) library(viridis) library(glue) library(gbm) library(mlr3) library(rpart) library(gbm) library(mlr3extralearners)
This package provides influence function-based methods to evaluate a longitudinal surrogate marker in a censored time-to-event outcome setting, with plug-in and targeted maximum likelihood estimation options. More details will be available in the future in: Agniel D and Parast L (2025+). "Robust Evaluation of Longitudinal Surrogate Markers with Censored Data." Journal of the Royal Statistical Society: Series B, In press.
The functions in this package expect the data in a certain format. Specifically, data must be a dataframe with variables that indicate: (1) the folds for crossfitting; (2) a unique observation identifier; (3) baseline covariates, if there are none you must have a variable with all values equal to 1 to provide to the argument x below; (4) a variable indicating treatment group which should be 1 for treatment and 0 for control; (5) a set of variables that contain the surrogate marker value at each time point up to and including the landmark time, denoted t0; (6) a set of variables that indicate observation status at each time point where the surrogate marker is measured, in addition to measurements beyond t0 up to a final time point t > t0; (7) a set of variables that indicate primary outcome status at each time point where the surrogate marker is measured, in addition to measurements beyond t0 up to a final time point t > t0. The exampledata in the package has the required format where t0=4 and t=5. Let's take a look.
data(exampledata) names(exampledata)
In this dataset, ID is the observation ID, X_0 is the single baseline covariate, G_0 is the treatment indicator, and ff indicates the folds. The variables Y with a number are the variable indicate the primary outcome status at each time. For example, Y_2 is 1 if the time of the primary outcome is greater than time 2. The variables A with a number indicate the observation status at each time. For example, A_2 is 1 if the individual is still observable (not censored) at time 2. The variables S with a number indicate the surrogate marker values at each time, up to and including t0=4. For example, S_2 is the surrogate marker value at time 2. Notice that here, t0=4 and t=5, and thus the dataset contains Y_5 and A_5 but does not contain S_5.
Let's take a look at these surrogate marker measurements over time. If you don't care about these pictures, you can skip them. The first picture shows individual surrogate trajectories, and the second shows the same data but colored by treatment group.
library(ggplot2) # Pivot the data to long format df_long <- exampledata %>% mutate(ID = as.factor(ID)) %>% # ensure ID is a factor pivot_longer( cols = starts_with("S_"), names_to = "Time", names_prefix = "S_", values_to = "S_value" ) %>% mutate( Time = as.numeric(Time) ) #surrogate marker values over time ggplot(df_long, aes(x = Time, y = S_value, group = ID, color = ID)) + geom_line(alpha = 0.8, linewidth = 1, na.rm = TRUE) + scale_color_viridis_d(option = "rocket") + labs(title = "Individual Surrogate Trajectories", x = "Time", y = "Surrogate Value") + theme_minimal() + theme(legend.position = "none") #now color by treatment group ggplot(df_long, aes(x = Time, y = S_value, group = ID, color = factor(G_0))) + geom_line(alpha = 0.6, linewidth = 1, na.rm = TRUE) + scale_color_manual(values = c("pink", "purple"), labels = c("Control", "Treatment")) + labs(title = "Surrogate Trajectories by Treatment Group", x = "Time", y = "Surrogate Value", color = "Treatment") + theme_minimal()
The goal is to estimate the proportion of the treatment effect on the primary outcome at time t=5 that is explained by the longitudinal surrogate marker up to t0=4. Values close to 1 indicate a strong surrogate whereas values close to 0 indicate a weak surrogate. The package offers estimation via a plug-in estimator or targeted maximum likelihood estimation (TMLE). Let's start with the plug-in estimator. In the code below, we set up t as tt=5 and t0 as t0=4; we set the learner to be gbm (generalized boosted model); and specify which variables indicate the Y, A, and S. We first estimate the treatment effect on the primary outcome (delta) using plugin_delta, and then the residual treatment effect (delta_s) using plugin_delta_s. Next, we give these results to the estimate_R function to calculate the proportion. The default method for variance estimation is "asymptotic", but you can change this to the bootstrap by specifying se_type = "bootstrap" in the arguments; if you do this, you will need to specify a number for the bootstrap sample, for example n_boot=200.
tt <- 5 t0 <- 4 yvars <- paste0('Y_', 0:tt) lrn <- 'gbm' lrnc <- glue('regr.{lrn}') lrnb <- glue('classif.{lrn}') p_deltahat <- plugin_delta( data = exampledata, folds = 'ff', id = 'ID', x = 'X_0', g = 'G_0', a = paste0('A_', 0:tt), y = yvars, s = paste0('S_', 0:t0), binary_lrnr = lrn(lrnb, predict_type = 'prob'), cont_lrnr = lrn(lrnc) ) p_deltahat_s <- plugin_delta_s( data = exampledata, folds = 'ff', id = 'ID', x = 'X_0', g = 'G_0', a = paste0('A_', 0:tt), y = yvars, s = paste0('S_', 0:t0), binary_lrnr = lrn(lrnb, predict_type = 'prob'), cont_lrnr = lrn(lrnc), t0=t0 ) estimate_R(p_deltahat$if_data[[1]]$eif, p_deltahat_s$if_data[[1]]$eif)
The output shows the estimate, standard error (se), and lower and upper bounds for the 95\% confidence interval (ci_l and ci_h), for each estimand, where R is the proportion of the treatment effect explained.
Now we use TMLE using the functions tmle_delta and tmle_delta_s:
tml_deltahat <- tmle_delta(data = exampledata, folds = 'ff', id = 'ID', x = 'X_0', g = 'G_0', a = paste0('A_', 0:tt), y = yvars, s = paste0('S_', 0:t0), binary_lrnr = lrn(lrnb, predict_type = 'prob'), cont_lrnr = lrn(lrnc)) tml_deltahat_s <- tmle_delta_s(data = exampledata, folds = 'ff', id = 'ID', x = 'X_0', g = 'G_0', a = paste0('A_', 0:tt), y = yvars, s = paste0('S_', 0:t0), binary_lrnr = lrn(lrnb, predict_type = 'prob'), cont_lrnr = lrn(lrnc)) estimate_R(tml_deltahat$if_data[[1]]$eif, tml_deltahat_s$if_data[[1]]$eif)
Again, the output shows the estimate, standard error (se), and lower and upper bounds for the 95\% confidence interval (ci_l and ci_h), for each estimand, where R is the proportion of the treatment effect explained.
That's all for now!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.