library(knitr) opts_chunk$set( fig.align = "center", fig.width = 4, fig.height = 4, crop = TRUE)
library(dplyr) library(tidyr) library(pammtools)
In this vignette we provide details on transforming data into a format suitable to fit piece-wise exponential (additive) models (PAM). Three main cases need to be distinguished
In the case of "standard" time-to-event the data, the transformation is relatively straight forward and handled by the as_ped
(as piece-wise exponential data) function. This function internally calls survival::survSplit
, thus all arguments available for the survSplit
function are also available to the as_ped
function.
As an example data set we first consider a data set of a stomach area tumor
available from the pammtools
package. Patients were operated on to remove the tumor. Some patients experienced complications during the operation:
data("tumor") tumor <- tumor %>% mutate(id = row_number()) %>% select(id, everything()) head(tumor)
Each row contains information on the survival time (days
), an event indicator
(status
) as well as information about comorbidities (charlson_score
), age
, sex
, and covariates that might be relevant like whether or not complications occurred (complications
) and whether the cancer metastasized (metastases
).
To transform the data into piece-wise exponential data (ped
) format, we need to
define $J+1$ interval break points $t_{\min} = \kappa_0 < \kappa_1 < \cdots < \kappa_J = t_{\max}$
create pseudo-observations for each interval $j = 1,\ldots, J$ in which subject $i, i = 1,\ldots,n$ was under risk.
Using the pammtools
package this is easily achieved with the as_ped
function
as follows:
ped <- as_ped(Surv(days, status) ~ ., data = tumor, id = "id") ped %>% filter(id %in% c(132, 141, 145)) %>% select(id, tstart, tend, interval, offset, ped_status, age, sex, complications)
When no cut points are specified, the default is to use the unique event times.
As can be seen from the above output, the function creates an id
variable,
indicating the subjects $i = 1,\ldots,n$, with one row per interval that
the subject "visited". Thus,
132
was at risk during 4 intervals,141
was at risk during 12 intervals,145
in 7 intervals.In addition to the optional id
variable the function also creates
tstart
: the beginning of each intervaltend
: the end of each intervalinterval
: a factor variable denoting the intervaloffset
: the log of the duration during which the subject was under risk in
that intervalAdditionally the time-constant covariates (here: age
, sex
, complications
,)
are repeated $n_i$ number of times, where $n_i$ is the number of intervals during
which subject $i$ was in the risk set.
Per default, the as_ped
function uses all unique event times as cut points
(which is a sensible default as for example the (cumulative) baseline hazard
estimates in Cox-PH functions are also updated at event times). Also, PAMMs represent the baseline hazard via splines whose basis functions are evaluated at
the cut points. Using unique events times automatically increases the density
of cut points in relevant areas of the follow-up.
In some cases, however, one might want to reduce the number of cut points,
to reduce the size of the resulting data and/or faster estimation times.
To do so the cut
argument has to be specified.
Following the above example, we can use only 4 cut points
$\kappa_0 = 0, \kappa_1 = 3, \kappa_2 = 7, \kappa_3 = 15$,
resulting in $J = 3$ intervals:
ped2 <- as_ped(Surv(days, status) ~ ., data = tumor, cut = c(0, 3, 7, 15), id = "id") ped2 %>% filter(id %in% c(132, 141, 145)) %>% select(id, tstart, tend, interval, offset, ped_status, age, sex, complications)
Note that now subjects 141
and 145
have both three rows in the data set.
The fact that subject 141
was in the risk set of the last interval for a longer time than subject 145
is, however, still accounted for by the offset
variable.
Note that the offset
for subject 145
is only 0
, because the last interval for this subject starts at $t=7$ and the subject experienced and event at $t=8$, thus $\log(8-7) = 0$
tumor %>% slice(c(85, 112))
In left truncated data, subjects enter the risk set at some time-point unequal to zero. Such data usually ships in the so called start-stop notation, where for each subject the start time specifies the left-truncation time (can also be 0) and the stop time the time-to-event, which can be right-censored.
For an illustration, we use a data set on infant mortality and maternal death
available in the eha
package (see ?eha::infants
) for details.
data(infants, package = "eha") head(infants)
As children were included in the study after death of their mother, their survival
time is left-truncated at the entry time (variable enter
). When transforming such data to the PED format, we need to create a interval cut-point at each entry time.
In general, each interval has to be constructed such that only subjects already at risk have an entry in the PED data. The previously introduced function as_ped
can be also used for this setting, however, the LHS of the formula must be in start-stop format:
infants_ped <- as_ped(Surv(enter, exit, event) ~ mother + age, data = infants) infants_ped %>% filter(id == 4)
For analysis of such data refer to the Left-truncation vignette
In case of data with time-dependent covariates (that should not be modeled as cumulative effects), the follow-up is usually split at desired cut-points and additionally at the time-points at which the TDC changes its value.
We assume that the data is provided in two data sets, one that contains time-to-event data and time-constant covariates and one data set that contains information of time-dependent covariates.
For illustration, we use the pbc
data from the survival
package.
Before the actual data transformation we perform the data preprocessing
in vignette("timedep", package="survival")
:
data("pbc", package = "survival") # loads pbc and pbcseq pbc <- pbc %>% filter(id <= 312) %>% mutate(status = 1L * (status == 2)) %>% select(id:sex, bili, protime)
We want to transform the data such that we could fit a concurrent effect of
the time-dependent covariates bili
and protime
, therefore, the RHS of
the formula passed to as_ped
needs to contain an additional component,
separated by |
and all variables for which should be transformed to the
concurrent format specified within the concurrent
special. The data sets
are provided within a list. In concurrent
, we also have to specify the name of
the variable which stores information on the time points at which the TDCs are
updated.
ped_pbc <- as_ped( data = list(pbc, pbcseq), formula = Surv(time, status) ~ . + concurrent(bili, protime, tz_var = "day"), id = "id")
Multiple data sets can be provided within the list and multiple concurrent
terms with different tz_var
arguments can be specified with different
lags. This will increase the data set, as an (additional) split point will be generated for each (lagged) time-point at which the TDC was observed:
pbc_ped <- as_ped(data = list(pbc, pbcseq), formula = Surv(time, status) ~ . + concurrent(bili, tz_var = "day") + concurrent(protime, tz_var = "day", lag = 10), id = "id")
If different tz_var
arguments are provided, the union of both will be used as
cut points in addition to usual cut points.
Warning: For time-points on the follow up that have no observation of the TDC, the last observation will be carried forward until the next update is available.
Here we demonstrate data transformation of data with TDCs ($\mathbf{z}={z(t_{z,q}),q=1,\ldots,Q}$, that subsequently will be modeled as cumulative effects defined as
$$ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)} h(t, t_z, z(t_z))\mathrm{d}t_z $$ Here
the three-variate function $h(t,t_z,z(t_z))$ defines the so-called
partial effects of the TDC $z(t_z)$ observed at exposure time
$t_z$ on the hazard at time $t$ [@Bender2018]. The general partial effect definition
given above is only one possibility, other common choices are
the cumulative effect $g(\mathbf{z}, t)$ at follow-up time $t$ is the
integral (or sum) of the partial effects over exposure times $t_z$ contained
within $\mathcal{T}(t)$
the so called lag-lead window (or window of effectiveness) is denoted by $\mathcal{T}(t)$. This represents the set of exposure times at which exposures can affect the hazard rate at time $t$. The most common definition is $\mathcal{T}(t) = {t_{z,q}: t \geq t_{z,q}, q=1,\ldots, Q}$, which means that all exposures that were observed prior to $t$ or at $t$ are eligible.
As before we use the function as_ped
to transform the data and additionally
use the formula special cumulative
(for functional covariate) to
specify the partial effect structure on the right-hand side of the formula.
Let $t$ (time
) the follow up time, $\mathbf{z}1$ (z1
) and $\mathbf{z}_2$ (z2
)
two TDCs observed at different exposure time grids $t_z$ (tz1
) and $s_z$ (tz2
)
with lag-lead windows $\mathcal{T}_1(t) = {t{z,q}: t \geq t_{z,q}}$
and $\mathcal{T}2(t) = {s{z,k}: t \geq s_{z,k} + 2}$ (defined by
ll2 <- function(t, tz) { t >= tz + 2}
).
The table below gives a selection of possible partial effects and the usage of
cumulative
to create matrices needed to estimate different types of cumulative
effects of $\mathbf{z}_1$ and $\mathbf{z}_2$:
| partial effect | as_ped
specification |
|------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| $\int_{\mathcal{T}1}h(t-t_z, z_1(t_z))$ | cumulative(latency(tz1), z1, tz_var= "tz1")
|
| $\int{\mathcal{T}1}h(t, t-t_z, z_1(t_z))$ | cumulative(time, latency(tz1), z1, tz_var="tz1")
|
| $\int{\mathcal{T}1}h(t, t_z, z_1(t_z))$ | cumulative(time, tz1, z1, tz_var="tz1")
|
| $\int{\mathcal{T}1}h(t, t_z, z_1(t_z)) + \int{\mathcal{T}_2}h(t-s_z, z_2(s_z))$ | cumulative(time, tz1, z1, tz_var="tz1") + cumulative(latency(tz2), z2, tz_var="tz2", ll_fun=ll2)
|
| ... | ... |
Note that
the variable representing follow-up time $t$ in cumulative
(here time
) must
match the time variable specified on the left-hand side of the formula
the variable representing exposure time $t_z$ (here tz1
and tz2
) must be
wrapped within latency()
to tell as_ped
to calculate $t-t_z$
by default, $\mathcal{T}(t)$ is defined as function(t, tz) {t >= tz}
,
thus for $\mathcal{T}_1$ it is not necessary to explicitly specify the lag-lead window.
In case you want to use a custom lag-lead window, provide the respective
function to the ll_fun
argument in cumulative
(see ll2
in the examples above)
cumulative
makes no difference between partial effects such as $h(t-t_z, z(t_z))$
and $h(t-t_z)z(t_z)$ as the necessary data transformation is the same in both
cases. Later, however, when fitting the model, a distinction must be made
more than one $z$ variable can be provided to cumulative
, which can be convenient
if multiple covariates should have the same time components and the
same lag-lead window
multiple cumulative
terms can be specified, having different exposure times
t_z
, s_z
and/or different lag-lead windows for different covariates
$\mathbf{z}_1$, $\mathbf{z}_2$
To tell cumulative
which of the variables specified is the exposure time $t_z$,
the tz_var
argument must be specified within each cumulative
term. The
follow-up time component $t$ will be automatically recognized via the left-hand
side of the formula
.
For illustration we use the ICU patients
data sets patient
and daily
provided in the pammtools
package, with
follow-up time survhosp
($t$), exposure time Study_Day
($t_z$) and TDCs
caloriesPercentage
$\mathbf{z}_1$ and proteinGproKG
($\mathbf{z}_2$):
head(patient) head(daily)
Below we illustrate the usage of as_ped
and cumulative
to obtain covariate
matrices for the latency matrix of follow up time and exposure time and
a matrix of the TDC (caloriesPercentage
). The follow up will be split
in 30 intervals with interval breaks points 0:30
. By default,
the exposure time provided to tz_var
will be used as a suffix for the names of the new
matrix columns to indicate the exposure they refer to,
however, you can override the suffix by specifying the
suffix
argument to cumulative
.
ped <- as_ped( data = list(patient, daily), formula = Surv(survhosp, PatientDied) ~ . + cumulative(latency(Study_Day), caloriesPercentage, tz_var = "Study_Day"), cut = 0:30, id = "CombinedID") str(ped, 1)
As you can see, the data is transformed to the PED format as before and
two additional matrix columns are added, Study_Day_latency
and
caloriesPercentage
.
Using the same data, we show a slightly more complex example with two
functional covariate terms with partial effects
$h(t, t-t_z, z_1(t_z))$ and $h(t-t_z, z_2(t_z))$.
As in the case of ICU patient data, both TDCs were observed on the same exposure
time grid, we want to use the same lag-lead window and the latency term
$t-t_z$ occurs in both partial effects, we only need to specify one cumulative
term.
ped <- as_ped( data = list(patient, daily), formula = Surv(survhosp, PatientDied) ~ . + cumulative(survhosp, latency(Study_Day), caloriesPercentage, proteinGproKG, tz_var = "Study_Day"), cut = 0:30, id = "CombinedID") str(ped, 1)
To illustrate data transformation when TDCs $\mathbf{z}_1$ and $\mathbf{z}_2$
were observed on different exposure time grids ($t_z$ and $s_z$) and
are assumed to have different lag_lead windows $\mathcal{T}_1(t)$ and
$\mathcal{T}_2(t)$, we use simulated data simdf_elra
contained in
pammtools
(see example in sim_pexp
for data generation).
simdf_elra simdf_elra %>% slice(1) %>% select(id, tz1) %>% unnest() simdf_elra %>% slice(1) %>% select(id, tz2) %>% unnest()
Note that tz1
has a maximum length of 10, while tz2
has a maximum length
of 11
and while tz1
was observed during the follow up, tz2
was partially
observed before the beginning of the follow up.
If we wanted to estimate the following cumulative effects
with $\mathcal{T}_1(t) = {t_z: t \geq t_z + 2}$ and $\mathcal{T}_2(t) = {s_z: t \geq s_z}$ the data transformation function must reflect the different exposure times as well as different lag-lead window specifications as illustrated below:
ped_sim <- as_ped( data = simdf_elra, formula = Surv(time, status) ~ . + cumulative(time, latency(tz1), z.tz1, tz_var = "tz1") + cumulative(latency(tz2), z.tz2, tz_var = "tz2"), cut = 0:10, id = "id") str(ped_sim, 1)
Note how the matrix covariates associated with tz1
have 10 columns while
matrix covariates associated with tz2
have 11 columns. Note also that
the lag-lead matrices differ
gg_laglead(ped_sim)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.