In adibender/pammtools: Piece-Wise Exponential Additive Mixed Modeling Tools for Survival Analysis

library(knitr)
opts_chunk$set(
  fig.align  = "center",
  fig.width  = 4,
  fig.height = 4,
  crop       = TRUE)

library(dplyr)
library(tidyr)
library(pammtools)

In this vignette we provide details on transforming data into a format suitable to fit piece-wise exponential (additive) models (PAM). Three main cases need to be distinguished

Standard time-to-event data

In the case of "standard" time-to-event the data, the transformation is relatively straight forward and handled by the as_ped (as piece-wise exponential data) function. This function internally calls survival::survSplit, thus all arguments available for the survSplit function are also available to the as_ped function.

Using defaults

As an example data set we first consider a data set of a stomach area tumor available from the pammtools package. Patients were operated on to remove the tumor. Some patients experienced complications during the operation:

data("tumor")
tumor <- tumor %>%
  mutate(id = row_number()) %>%
  select(id, everything())
head(tumor)

Each row contains information on the survival time (days), an event indicator (status) as well as information about comorbidities (charlson_score), age, sex, and covariates that might be relevant like whether or not complications occurred (complications) and whether the cancer metastasized (metastases).

To transform the data into piece-wise exponential data (ped) format, we need to

define $J+1$ interval break points $t_{\min} = \kappa_0 < \kappa_1 < \cdots < \kappa_J = t_{\max}$
create pseudo-observations for each interval $j = 1,\ldots, J$ in which subject $i, i = 1,\ldots,n$ was under risk.

Using the pammtools package this is easily achieved with the as_ped function as follows:

ped <- as_ped(Surv(days, status) ~ ., data = tumor, id = "id")
ped %>% filter(id %in% c(132, 141, 145)) %>%
  select(id, tstart, tend, interval, offset, ped_status, age, sex, complications)

When no cut points are specified, the default is to use the unique event times. As can be seen from the above output, the function creates an id variable, indicating the subjects $i = 1,\ldots,n$, with one row per interval that the subject "visited". Thus,

subject 132 was at risk during 4 intervals,
subject 141 was at risk during 12 intervals,
and subject 145 in 7 intervals.

In addition to the optional id variable the function also creates

tstart: the beginning of each interval
tend: the end of each interval
interval: a factor variable denoting the interval
offset: the log of the duration during which the subject was under risk in that interval

Additionally the time-constant covariates (here: age, sex, complications,) are repeated $n_i$ number of times, where $n_i$ is the number of intervals during which subject $i$ was in the risk set.

Custom cut points

Per default, the as_ped function uses all unique event times as cut points (which is a sensible default as for example the (cumulative) baseline hazard estimates in Cox-PH functions are also updated at event times). Also, PAMMs represent the baseline hazard via splines whose basis functions are evaluated at the cut points. Using unique events times automatically increases the density of cut points in relevant areas of the follow-up.

In some cases, however, one might want to reduce the number of cut points, to reduce the size of the resulting data and/or faster estimation times. To do so the cut argument has to be specified. Following the above example, we can use only 4 cut points $\kappa_0 = 0, \kappa_1 = 3, \kappa_2 = 7, \kappa_3 = 15$, resulting in $J = 3$ intervals:

ped2 <- as_ped(Surv(days, status) ~ ., data = tumor, cut = c(0, 3, 7, 15), id = "id")
ped2 %>%
  filter(id %in% c(132, 141, 145)) %>%
  select(id, tstart, tend, interval, offset, ped_status, age, sex, complications)

Note that now subjects 141 and 145 have both three rows in the data set. The fact that subject 141 was in the risk set of the last interval for a longer time than subject 145 is, however, still accounted for by the offset variable. Note that the offset for subject 145 is only 0, because the last interval for this subject starts at $t=7$ and the subject experienced and event at $t=8$, thus $\log(8-7) = 0$

tumor %>% slice(c(85, 112))

Left-truncated Data

In left truncated data, subjects enter the risk set at some time-point unequal to zero. Such data usually ships in the so called start-stop notation, where for each subject the start time specifies the left-truncation time (can also be 0) and the stop time the time-to-event, which can be right-censored.

For an illustration, we use a data set on infant mortality and maternal death available in the eha package (see ?eha::infants) for details.

data(infants, package = "eha")
head(infants)

As children were included in the study after death of their mother, their survival time is left-truncated at the entry time (variable enter). When transforming such data to the PED format, we need to create a interval cut-point at each entry time. In general, each interval has to be constructed such that only subjects already at risk have an entry in the PED data. The previously introduced function as_ped can be also used for this setting, however, the LHS of the formula must be in start-stop format:

infants_ped <- as_ped(Surv(enter, exit, event) ~ mother + age, data = infants)
infants_ped %>% filter(id == 4)

For analysis of such data refer to the Left-truncation vignette

Data with time-dependent covariates

In case of data with time-dependent covariates (that should not be modeled as cumulative effects), the follow-up is usually split at desired cut-points and additionally at the time-points at which the TDC changes its value.

We assume that the data is provided in two data sets, one that contains time-to-event data and time-constant covariates and one data set that contains information of time-dependent covariates.

For illustration, we use the pbc data from the survival package. Before the actual data transformation we perform the data preprocessing in vignette("timedep", package="survival"):

data("pbc", package = "survival") # loads pbc and pbcseq
pbc <- pbc %>% filter(id <= 312) %>%
  mutate(status = 1L * (status == 2)) %>%
  select(id:sex, bili, protime)

We want to transform the data such that we could fit a concurrent effect of the time-dependent covariates bili and protime, therefore, the RHS of the formula passed to as_ped needs to contain an additional component, separated by | and all variables for which should be transformed to the concurrent format specified within the concurrent special. The data sets are provided within a list. In concurrent, we also have to specify the name of the variable which stores information on the time points at which the TDCs are updated.

ped_pbc <- as_ped(
  data    = list(pbc, pbcseq),
  formula = Surv(time, status) ~ . + concurrent(bili, protime, tz_var = "day"),
  id      = "id")

Multiple data sets can be provided within the list and multiple concurrent terms with different tz_var arguments can be specified with different lags. This will increase the data set, as an (additional) split point will be generated for each (lagged) time-point at which the TDC was observed:

pbc_ped <- as_ped(data = list(pbc, pbcseq),
  formula = Surv(time, status) ~ . + concurrent(bili, tz_var = "day") +
    concurrent(protime, tz_var = "day", lag = 10),
    id = "id")

If different tz_var arguments are provided, the union of both will be used as cut points in addition to usual cut points.

Warning: For time-points on the follow up that have no observation of the TDC, the last observation will be carried forward until the next update is available.

Data with time-dependent covariates (with cumulative effects)

Here we demonstrate data transformation of data with TDCs ($\mathbf{z}={z(t_{z,q}),q=1,\ldots,Q}$, that subsequently will be modeled as cumulative effects defined as

$$ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)} h(t, t_z, z(t_z))\mathrm{d}t_z $$ Here

the three-variate function $h(t,t_z,z(t_z))$ defines the so-called partial effects of the TDC $z(t_z)$ observed at exposure time $t_z$ on the hazard at time $t$ [@Bender2018]. The general partial effect definition given above is only one possibility, other common choices are
- $h(t-t_z)z(t_z)$ (This is the WCE model by @Sylvestre2009)
- $h(t-t_z, z(t_z))$ (This is the DLNM model by @Gasparrini2017)
the cumulative effect $g(\mathbf{z}, t)$ at follow-up time $t$ is the integral (or sum) of the partial effects over exposure times $t_z$ contained within $\mathcal{T}(t)$
the so called lag-lead window (or window of effectiveness) is denoted by $\mathcal{T}(t)$. This represents the set of exposure times at which exposures can affect the hazard rate at time $t$. The most common definition is $\mathcal{T}(t) = {t_{z,q}: t \geq t_{z,q}, q=1,\ldots, Q}$, which means that all exposures that were observed prior to $t$ or at $t$ are eligible.

As before we use the function as_ped to transform the data and additionally use the formula special cumulative (for functional covariate) to specify the partial effect structure on the right-hand side of the formula.

Let $t$ (time) the follow up time, $\mathbf{z}1$ (z1) and $\mathbf{z}_2$ (z2) two TDCs observed at different exposure time grids $t_z$ (tz1) and $s_z$ (tz2) with lag-lead windows $\mathcal{T}_1(t) = {t{z,q}: t \geq t_{z,q}}$ and $\mathcal{T}2(t) = {s{z,k}: t \geq s_{z,k} + 2}$ (defined by ll2 <- function(t, tz) { t >= tz + 2}).

The table below gives a selection of possible partial effects and the usage of cumulative to create matrices needed to estimate different types of cumulative effects of $\mathbf{z}_1$ and $\mathbf{z}_2$:

| partial effect | as_ped specification | |------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------| | $\int_{\mathcal{T}1}h(t-t_z, z_1(t_z))$ | cumulative(latency(tz1), z1, tz_var= "tz1") | | $\int{\mathcal{T}1}h(t, t-t_z, z_1(t_z))$ | cumulative(time, latency(tz1), z1, tz_var="tz1") | | $\int{\mathcal{T}1}h(t, t_z, z_1(t_z))$ | cumulative(time, tz1, z1, tz_var="tz1") | | $\int{\mathcal{T}1}h(t, t_z, z_1(t_z)) + \int{\mathcal{T}_2}h(t-s_z, z_2(s_z))$ | cumulative(time, tz1, z1, tz_var="tz1") + cumulative(latency(tz2), z2, tz_var="tz2", ll_fun=ll2) | | ... | ... |

Note that

the variable representing follow-up time $t$ in cumulative (here time) must match the time variable specified on the left-hand side of the formula
the variable representing exposure time $t_z$ (here tz1 and tz2) must be wrapped within latency() to tell as_ped to calculate $t-t_z$
by default, $\mathcal{T}(t)$ is defined as function(t, tz) {t >= tz}, thus for $\mathcal{T}_1$ it is not necessary to explicitly specify the lag-lead window. In case you want to use a custom lag-lead window, provide the respective function to the ll_fun argument in cumulative (see ll2 in the examples above)
cumulative makes no difference between partial effects such as $h(t-t_z, z(t_z))$ and $h(t-t_z)z(t_z)$ as the necessary data transformation is the same in both cases. Later, however, when fitting the model, a distinction must be made
more than one $z$ variable can be provided to cumulative, which can be convenient if multiple covariates should have the same time components and the same lag-lead window
multiple cumulative terms can be specified, having different exposure times t_z, s_z and/or different lag-lead windows for different covariates $\mathbf{z}_1$, $\mathbf{z}_2$
To tell cumulative which of the variables specified is the exposure time $t_z$, the tz_var argument must be specified within each cumulative term. The follow-up time component $t$ will be automatically recognized via the left-hand side of the formula.

One time-dependent covariate

For illustration we use the ICU patients data sets patient and daily provided in the pammtools package, with follow-up time survhosp ($t$), exposure time Study_Day ($t_z$) and TDCs caloriesPercentage $\mathbf{z}_1$ and proteinGproKG ($\mathbf{z}_2$):

head(patient)
head(daily)

Below we illustrate the usage of as_ped and cumulative to obtain covariate matrices for the latency matrix of follow up time and exposure time and a matrix of the TDC (caloriesPercentage). The follow up will be split in 30 intervals with interval breaks points 0:30. By default, the exposure time provided to tz_var will be used as a suffix for the names of the new matrix columns to indicate the exposure they refer to, however, you can override the suffix by specifying the suffix argument to cumulative.

ped <- as_ped(
  data    = list(patient, daily),
  formula = Surv(survhosp, PatientDied) ~ . +
    cumulative(latency(Study_Day), caloriesPercentage, tz_var = "Study_Day"),
  cut     = 0:30,
  id = "CombinedID")
str(ped, 1)

As you can see, the data is transformed to the PED format as before and two additional matrix columns are added, Study_Day_latency and caloriesPercentage.

Two time-dependent covariates observed on the same exposure time grid

Using the same data, we show a slightly more complex example with two functional covariate terms with partial effects $h(t, t-t_z, z_1(t_z))$ and $h(t-t_z, z_2(t_z))$. As in the case of ICU patient data, both TDCs were observed on the same exposure time grid, we want to use the same lag-lead window and the latency term $t-t_z$ occurs in both partial effects, we only need to specify one cumulative term.

ped <- as_ped(
  data = list(patient, daily),
  formula = Surv(survhosp, PatientDied) ~ . +
    cumulative(survhosp, latency(Study_Day), caloriesPercentage, proteinGproKG,
      tz_var = "Study_Day"),
  cut = 0:30,
  id = "CombinedID")
str(ped, 1)

Multiple TDCs observed at different exposure time grids

To illustrate data transformation when TDCs $\mathbf{z}_1$ and $\mathbf{z}_2$ were observed on different exposure time grids ($t_z$ and $s_z$) and are assumed to have different lag_lead windows $\mathcal{T}_1(t)$ and $\mathcal{T}_2(t)$, we use simulated data simdf_elra contained in pammtools (see example in sim_pexp for data generation).

simdf_elra
simdf_elra %>% slice(1) %>% select(id, tz1) %>% unnest()
simdf_elra %>% slice(1) %>% select(id, tz2) %>% unnest()

Note that tz1 has a maximum length of 10, while tz2 has a maximum length of 11 and while tz1 was observed during the follow up, tz2 was partially observed before the beginning of the follow up.

If we wanted to estimate the following cumulative effects

$g(\mathbf{z}1, t) = \int{\mathcal{T}_1(t)} h(t, t-t_z, z_1(s_z))$ and
$g(\mathbf{z}2, t) = \int{\mathcal{T}_2(t)} h(t-t_z, z_2(s_z))$

with $\mathcal{T}_1(t) = {t_z: t \geq t_z + 2}$ and $\mathcal{T}_2(t) = {s_z: t \geq s_z}$ the data transformation function must reflect the different exposure times as well as different lag-lead window specifications as illustrated below:

ped_sim <- as_ped(
  data = simdf_elra,
  formula = Surv(time, status) ~ . +
    cumulative(time, latency(tz1), z.tz1, tz_var = "tz1") +
    cumulative(latency(tz2), z.tz2, tz_var = "tz2"),
  cut = 0:10,
  id = "id")
str(ped_sim, 1)

Note how the matrix covariates associated with tz1 have 10 columns while matrix covariates associated with tz2 have 11 columns. Note also that the lag-lead matrices differ