library(heemod) library(heemodFits) library(knitr) knitr::opts_chunk$set(echo = TRUE) library(flexsurv)
Running an oncology cost/benefit model in heemod involves several steps:
Fit available survival data under different distributional assumptions.
Examine the survival fits and decide which ones are most appropriate to use in the economic model.
Run the economic model in heemod with the chosen survival fits.
This document will demonstrate how to accomplish those steps in heemod, using simulated data sets included with the package.
base_dir <- system.file("surv", package = "heemodFits") spec_file <- "example_oncSpecs_nomatrix.csv"
Other vignettes cover more general information on the files needed to run heemod from tabular input, and additional information on survival curve fitting. Here we will deal only with those aspects specific to fitting survival curves from data and using them to run economic models.
The source of the survival data, as well as the names and locations of the resulting fits, are specified in a file with one row for each (treatment, outcome) pair, where the outcome can be either progression-free survival (PFS), overall survival (OS), or time on treatment (ToT). Here’s an example of such a file, pointing to an example using randomly-generated data and two fake treatments, “A” and “B”:
surv_inf_file_name <- "survival_info.csv" full_path <- file.path(base_dir, surv_inf_file_name) surv_info <- heemodFits:::read_file(full_path) print(surv_info)
The data_directory
and data_file
columns allow us to specify the location
of data separately for each row. The fit_directory
, fit_file
, and
fit_name
columns, similarly, allow us specify where the fits should be kept,
and what the R object containing the fits should be called. time_col
,
treatment_col
, and censor_col
give us the names of the columns in the data
files that contain the three elements needed to specify the model. Together,
these give us a great deal of flexibility. For example, we may keep our data
in a set of directories that are, for most analysts, read-only, while still
allowing them to save their fits (each analyst can even save their own fits,
though it seems preferable to have them shared). In these cases, we may
sometimes want to specify an absolute directory for our data files instead of
a subdirectory of our base directory. heemod
analyzes entries in the data_directory
column to understand whether it is an absolute file path (for example, whether
it begins with a double slash, or, on Windows, a letter followed by a colon),
and treats them accordingly. We can have data for
different treatments and outcomes in different files or in the same file -
if they are in the same file, we can specify the columns to be used
to specify the different outcomes. Because different data sets
may specify events and censoring in different ways, the columns event_code
and censor_code
allow us to specify those. If those columns are omitted,
event_code
will be set to 1 and censor_code
to 0.
An additional column, treatment_flag_col
, specifies a column that tells whether
given data points should be included in the analysis - typically based on either
intention to treat or actual treatment delivery. If the column is not present,
it will be set to 'ITTFL' (intention to treat flag) for rows in which the
type is 'PFS' or 'OS', and 'TRTFL' (treatment flag) for rows in which the type
is "ToT".
Some analyses require fitting survival curves for different subsets of the data. These subsets may be based on demographic or potential biomarker variables - anything that could potentially influence survival. The file defining the subsets should be in the fit directory where the fits will be stored (which may often be the same place where the raw data is stored). That allows different people to fit different subsets if necessary, and also allows for defining subsets if the data directory is read-only for the analyst. Currently the file must be called "set_definitions.X", where "X" can be ".csv", ".xls", or ".xlsx" depending on the file format.
Below is an example:
subset_def_file_name <- "set_definitions.csv" full_path <- file.path(base_dir, "survival_data", subset_def_file_name) subset_info <- heemodFits:::read_file(full_path) print(subset_info)
The file must contain three columns: treatment, set_name, and condition. "treatment" must match the treatment column in the survival information file, because that is how fits will be associated to the appropriate treatments. The set_name and condition files define the names of the subsets, which will be used later in specifying which fits to use in a particular analysis, and the condition defining the subset. The condition must be a statement that can be parsed in R, and variables used should match the relevant column names in the survival data file. For example, the file presented here uses "time" in rows 3 and 4 because that is the column name in our example data files. If the example data used "Time" or "t" or "event_time", the condition statements would need to use those names instead.
One notable case highlighted here is subsets based on time itself - for example,
we may wish to fit the a survival model only for event times after some chosen time.
In this case, we will generally want to fit the survival model as if the
chosen time were time 0. We can do this by adding the column time_subract
with the appropriate number. This currently requires the analyst to take
care that the name of the subset, the condition defining the subset, and
the time subtracted all match up. For example, we would not want to
name a subset time.gt.50
but then define it using the condition time > 100
,
nor would we want to assign a time_subtract
other than 50 to a subset with
that name. In the future, we may automatically create the time condition
based on time_subract
, but names will still need to be checked by the
analyst. If time_subtract
is not provided in the file, or is missing
in any given row, if it assumed to be 0.
It is also possible to define subsets not just by treatment, but separately for
each survival curve type,
PFS or OS. In the case below, the subset based on time (GT50)
has been defined only for PFS, and omitted for OS. If the type
column is
included, then subsets desired for both PFS and OS must be specified twice.
If the type
column is omitted, as in the example above, then each subset will
be fit for both PFS and OS.
subset_def_file_name <- "set_def_pfs_os.csv" full_path <- file.path(base_dir, "survival_data", subset_def_file_name) subset_info <- heemodFits:::read_file(full_path) print(subset_info)
If the set_definition
file is missing, then a single subset, called "all",
with condition equal to TRUE
(which takes all points) will be fit
for each treatment. If the set_definition
file is present, it should
include the "all" subset, as in the first two rows of the example above
(unless, of course, you don't want to fit a survival curve based on all the
data).
The example also demonstrates that we need not define the same subsets,
or even the same number of subsets, for each treatment. Here, although
both A and B have values for "biomarker" in the files, we choose to
define a subset based on "biomarker" only for treatment B.
All the fits for all the defined subsets will be fit at the same time,
and stored in a data frame (actually a tibble, for those interested
in the technical details) for future use.
The function partitioned_survival_from_tabular
fits the survival models
to data.
Here we use save_fits = TRUE
; it is also possible to set save_fits = FALSE
to obtain the fits in R without saving them to a file. The specification file
will be used again to run the economic model, but to create the fits it needs
only the tm
field. tm
stands for "transition matrix", because the heemod
package
initially took only Markov transition matrices; now it can take the information
to specify partitioned survival models also - in this case, the survival information
file just discussed. If desired, the rest of the fields can be
added later.
The distributions currently used for survival fits are exponential, Weibull,
lognormal, log-logistic, gamma, Gompertz, and generalized gamma.
If fitting fails for some distribution, heemod
will note the error and
continue, so results will be available for the other distributional assumptions.
(Here we set save_fits = FALSE
because we've already saved the fits in
the package.)
temp1 <- survival_fits_from_tabular(base_dir, spec_file, save_fits = FALSE)
heemod
does not automatically choose a distribution to use in the survival model - the modeler must choose the most appropriate distribution for each set of survival data.
Fit quality can be examined using the negative log-likelihood, the Aikake
and Bayes information critera, and various kinds of graphs. Below we remove the fit
, set_def
, and time_subract
columns so the fit metrics are visible. Note that fit metric values are not
generally comparable across different subsets, which are fit on different subsets of data.
They are more appropriate for comparing across different models fit to the same data.
names(temp1)
extract_surv_fit_metrics(temp1, c("AIC", "BIC" ,"m2LL"))[, -c(5,6, 7)]
When a modeler has decided to run the economic analysis using a particular set of fits, the specification must be included in a file. The name of the file goes in the
specification file, in a row with data
element use_fits
.
The file has three or four columns: \itemize{
\item{.strategy
, corresponding to the strategies in the model;}
\item{.type
, which for each row is either 'pfs' or 'os';}
\item{dist
, specifying the survival distribution to use;}
\item{until
(optionally), specifying when the distribution ceases to apply.}
}
until
needs to be specified only if there is more than one row with the same
.strategy
and .type
. In that case, each distribution will pick up where
the previous one left off. (The rows need not be presented in order, though
that is a good practice for readability; the code will sort distributions into
order by until
.) Here we don't require it.
If we define use_fits
as
use_fits_file <- file.path(base_dir, "use_fits_example0.csv") use_fits <- heemodFits:::read_file(use_fits_file) use_fits
then the analysis will use an exponential survival model for overall survival for both strategies, but will use the gamma survival model for progression-free survival for A, and a lognormal survival model for progression-free survival for B.
We can also use fit('km')
to get the Kaplan-Meier curve for a given fit.
The same specification file that was used for survival_fits_from_tabular
can be used for run_model_tabular
. run_model_tabular
uses the information in the specification file to pick up the desired fits from those created earlier. No survival fitting happens when running run_model_tabular, only selection. The selection is governed by the use_fits
element of the reference file.
Here is the use_fits element of this file:
ref <- heemodFits:::read_file(file.path(base_dir, spec_file)) use_fits_file <- file.path(base_dir, ref[ref$data == "use_fits", "file"]) use_fits <- heemodFits:::read_file(use_fits_file) use_fits
surv_example <- heemod::run_model_tabular(base_dir, spec_file, run_psa = FALSE, run_demo = FALSE, save = FALSE, overwrite= FALSE)
The model has run as requested:
plot(surv_example$model_runs)
We can also mix explicit survival distribution definitions with fits, as shown below.
spec_file <- "example_oncSpecs_mixed.csv" ref <- heemodFits:::read_file(file.path(base_dir, spec_file)) mixed_use_fits_file <- file.path(base_dir, ref[ref$data == "use_fits", "file"]) mixed_use_fits <- heemodFits:::read_file(mixed_use_fits_file) mixed_use_fits
surv_example_mixed <- heemod::run_model_tabular(base_dir, spec_file, run_psa = FALSE, run_demo = FALSE, save = FALSE, overwrite= FALSE) plot(surv_example_mixed$model_runs)
We can also specify more complex models, using all of the modifications to survival fits that are described in the survival vignette. In particular, we can use the Kaplan-Meier curve until a particular time and then switch over to a fit, as shown below.
spec_file <- "example_oncSpecs_nomatrix_join.csv" ref <- heemodFits:::read_file(file.path(base_dir, spec_file)) join_use_fits_file <- file.path(base_dir, ref[ref$data == "use_fits", "file"]) join_use_fits <- heemodFits:::read_file(join_use_fits_file) join_use_fits
In this case, we see the less regular changes of the Kaplan-Meier curve until time 50, followed by the more regular changes of the fit model.
surv_example_mixed <- heemod::run_model_tabular(base_dir, spec_file, run_psa = FALSE, run_demo = FALSE, save = FALSE, overwrite= FALSE) plot(surv_example_mixed$model_runs)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.