In MattWiener/heemodFits: Using Survival Fits with Heemod

library(heemod)
library(heemodFits)
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
library(flexsurv)

Running an oncology cost/benefit model in heemod involves several steps:

Fit available survival data under different distributional assumptions.
Examine the survival fits and decide which ones are most appropriate to use in the economic model.
Run the economic model in heemod with the chosen survival fits.

This document will demonstrate how to accomplish those steps in heemod, using simulated data sets included with the package.

base_dir <- system.file("surv", package = "heemodFits")
spec_file <- "example_oncSpecs_nomatrix.csv"

Other vignettes cover more general information on the files needed to run heemod from tabular input, and additional information on survival curve fitting. Here we will deal only with those aspects specific to fitting survival curves from data and using them to run economic models.

Specifying the data to fit

The source of the survival data, as well as the names and locations of the resulting fits, are specified in a file with one row for each (treatment, outcome) pair, where the outcome can be either progression-free survival (PFS), overall survival (OS), or time on treatment (ToT). Here’s an example of such a file, pointing to an example using randomly-generated data and two fake treatments, “A” and “B”:

surv_inf_file_name <- "survival_info.csv"
full_path <- file.path(base_dir, surv_inf_file_name)
surv_info <- heemodFits:::read_file(full_path)
print(surv_info)

The data_directory and data_file columns allow us to specify the location of data separately for each row. The fit_directory, fit_file, and fit_name columns, similarly, allow us specify where the fits should be kept, and what the R object containing the fits should be called. time_col, treatment_col, and censor_col give us the names of the columns in the data files that contain the three elements needed to specify the model. Together, these give us a great deal of flexibility. For example, we may keep our data in a set of directories that are, for most analysts, read-only, while still allowing them to save their fits (each analyst can even save their own fits, though it seems preferable to have them shared). In these cases, we may sometimes want to specify an absolute directory for our data files instead of a subdirectory of our base directory. heemod analyzes entries in the data_directory column to understand whether it is an absolute file path (for example, whether it begins with a double slash, or, on Windows, a letter followed by a colon), and treats them accordingly. We can have data for different treatments and outcomes in different files or in the same file - if they are in the same file, we can specify the columns to be used to specify the different outcomes. Because different data sets may specify events and censoring in different ways, the columns event_code and censor_code allow us to specify those. If those columns are omitted, event_code will be set to 1 and censor_code to 0.

An additional column, treatment_flag_col, specifies a column that tells whether given data points should be included in the analysis - typically based on either intention to treat or actual treatment delivery. If the column is not present, it will be set to 'ITTFL' (intention to treat flag) for rows in which the type is 'PFS' or 'OS', and 'TRTFL' (treatment flag) for rows in which the type is "ToT".

Specifying multiple subsets of data for survival models

Some analyses require fitting survival curves for different subsets of the data. These subsets may be based on demographic or potential biomarker variables - anything that could potentially influence survival. The file defining the subsets should be in the fit directory where the fits will be stored (which may often be the same place where the raw data is stored). That allows different people to fit different subsets if necessary, and also allows for defining subsets if the data directory is read-only for the analyst. Currently the file must be called "set_definitions.X", where "X" can be ".csv", ".xls", or ".xlsx" depending on the file format.

Below is an example:

subset_def_file_name <- "set_definitions.csv"
full_path <- file.path(base_dir, "survival_data", subset_def_file_name)
subset_info <- heemodFits:::read_file(full_path)
print(subset_info)

The file must contain three columns: treatment, set_name, and condition. "treatment" must match the treatment column in the survival information file, because that is how fits will be associated to the appropriate treatments. The set_name and condition files define the names of the subsets, which will be used later in specifying which fits to use in a particular analysis, and the condition defining the subset. The condition must be a statement that can be parsed in R, and variables used should match the relevant column names in the survival data file. For example, the file presented here uses "time" in rows 3 and 4 because that is the column name in our example data files. If the example data used "Time" or "t" or "event_time", the condition statements would need to use those names instead.

One notable case highlighted here is subsets based on time itself - for example, we may wish to fit the a survival model only for event times after some chosen time. In this case, we will generally want to fit the survival model as if the chosen time were time 0. We can do this by adding the column time_subract with the appropriate number. This currently requires the analyst to take care that the name of the subset, the condition defining the subset, and the time subtracted all match up. For example, we would not want to name a subset time.gt.50 but then define it using the condition time > 100, nor would we want to assign a time_subtract other than 50 to a subset with that name. In the future, we may automatically create the time condition based on time_subract, but names will still need to be checked by the analyst. If time_subtract is not provided in the file, or is missing in any given row, if it assumed to be 0.

It is also possible to define subsets not just by treatment, but separately for each survival curve type, PFS or OS. In the case below, the subset based on time (GT50) has been defined only for PFS, and omitted for OS. If the type column is included, then subsets desired for both PFS and OS must be specified twice. If the type column is omitted, as in the example above, then each subset will be fit for both PFS and OS.

subset_def_file_name <- "set_def_pfs_os.csv"
full_path <- file.path(base_dir, "survival_data", subset_def_file_name)
subset_info <- heemodFits:::read_file(full_path)
print(subset_info)

If the set_definition file is missing, then a single subset, called "all", with condition equal to TRUE (which takes all points) will be fit for each treatment. If the set_definition file is present, it should include the "all" subset, as in the first two rows of the example above (unless, of course, you don't want to fit a survival curve based on all the data).

The example also demonstrates that we need not define the same subsets, or even the same number of subsets, for each treatment. Here, although both A and B have values for "biomarker" in the files, we choose to define a subset based on "biomarker" only for treatment B.
All the fits for all the defined subsets will be fit at the same time, and stored in a data frame (actually a tibble, for those interested in the technical details) for future use.

Fitting survival models

The function partitioned_survival_from_tabular fits the survival models to data. Here we use save_fits = TRUE; it is also possible to set save_fits = FALSE to obtain the fits in R without saving them to a file. The specification file will be used again to run the economic model, but to create the fits it needs only the tm field. tm stands for "transition matrix", because the heemod package initially took only Markov transition matrices; now it can take the information to specify partitioned survival models also - in this case, the survival information file just discussed. If desired, the rest of the fields can be added later.

The distributions currently used for survival fits are exponential, Weibull, lognormal, log-logistic, gamma, Gompertz, and generalized gamma. If fitting fails for some distribution, heemod will note the error and continue, so results will be available for the other distributional assumptions. (Here we set save_fits = FALSE because we've already saved the fits in the package.)

temp1 <- survival_fits_from_tabular(base_dir, spec_file, 
                                    save_fits = FALSE)

heemod does not automatically choose a distribution to use in the survival model - the modeler must choose the most appropriate distribution for each set of survival data. Fit quality can be examined using the negative log-likelihood, the Aikake and Bayes information critera, and various kinds of graphs. Below we remove the fit, set_def, and time_subract columns so the fit metrics are visible. Note that fit metric values are not generally comparable across different subsets, which are fit on different subsets of data. They are more appropriate for comparing across different models fit to the same data.

names(temp1)

extract_surv_fit_metrics(temp1, c("AIC", "BIC" ,"m2LL"))[, -c(5,6, 7)]

Specifying which fits to use in a given analysis

When a modeler has decided to run the economic analysis using a particular set of fits, the specification must be included in a file. The name of the file goes in the specification file, in a row with data element use_fits.

The file has three or four columns: \itemize{ \item{.strategy, corresponding to the strategies in the model;} \item{.type, which for each row is either 'pfs' or 'os';} \item{dist, specifying the survival distribution to use;} \item{until (optionally), specifying when the distribution ceases to apply.} } until needs to be specified only if there is more than one row with the same .strategy and .type. In that case, each distribution will pick up where the previous one left off. (The rows need not be presented in order, though that is a good practice for readability; the code will sort distributions into order by until.) Here we don't require it.

If we define use_fits as

use_fits_file <- file.path(base_dir, "use_fits_example0.csv")
use_fits <- heemodFits:::read_file(use_fits_file)
use_fits

then the analysis will use an exponential survival model for overall survival for both strategies, but will use the gamma survival model for progression-free survival for A, and a lognormal survival model for progression-free survival for B.

We can also use fit('km') to get the Kaplan-Meier curve for a given fit.

Running the analysis using already-fit survival models

The same specification file that was used for survival_fits_from_tabular can be used for run_model_tabular. run_model_tabular uses the information in the specification file to pick up the desired fits from those created earlier. No survival fitting happens when running run_model_tabular, only selection. The selection is governed by the use_fits element of the reference file.

Here is the use_fits element of this file:

ref <- heemodFits:::read_file(file.path(base_dir, spec_file))
use_fits_file <- file.path(base_dir, ref[ref$data == "use_fits", "file"])
use_fits <- heemodFits:::read_file(use_fits_file)
use_fits

surv_example <- heemod::run_model_tabular(base_dir, spec_file,
                                  run_psa = FALSE, run_demo = FALSE,
                                  save = FALSE, overwrite= FALSE)

The model has run as requested:

plot(surv_example$model_runs)

We can also mix explicit survival distribution definitions with fits, as shown below.

spec_file <- "example_oncSpecs_mixed.csv"
ref <- heemodFits:::read_file(file.path(base_dir, spec_file))
mixed_use_fits_file <- file.path(base_dir, ref[ref$data == "use_fits", "file"])
mixed_use_fits <- heemodFits:::read_file(mixed_use_fits_file)
mixed_use_fits

surv_example_mixed <- heemod::run_model_tabular(base_dir, spec_file,
                                  run_psa = FALSE, run_demo = FALSE,
                                  save = FALSE, overwrite= FALSE)
plot(surv_example_mixed$model_runs)

We can also specify more complex models, using all of the modifications to survival fits that are described in the survival vignette. In particular, we can use the Kaplan-Meier curve until a particular time and then switch over to a fit, as shown below.

spec_file <- "example_oncSpecs_nomatrix_join.csv"
ref <- heemodFits:::read_file(file.path(base_dir, spec_file))
join_use_fits_file <- file.path(base_dir, ref[ref$data == "use_fits", "file"])
join_use_fits <- heemodFits:::read_file(join_use_fits_file)
join_use_fits

In this case, we see the less regular changes of the Kaplan-Meier curve until time 50, followed by the more regular changes of the fit model.

surv_example_mixed <- heemod::run_model_tabular(base_dir, spec_file,
                                  run_psa = FALSE, run_demo = FALSE,
                                  save = FALSE, overwrite= FALSE)
plot(surv_example_mixed$model_runs)