library(knitr) opts_chunk$set( fig.align = "center", fig.width = 4, fig.height = 4, crop = TRUE, cache = TRUE) ## output hooks, e.g., to truncate rows of output hook_output <- knit_hooks$get("output") knit_hooks$set(output = function(x, options) { lines <- options$output.lines if (is.null(lines)) { return(hook_output(x, options)) # pass to default hook } x <- unlist(strsplit(x, "\n")) more <- "..." if (length(lines) == 1) { if (length(x) > lines) { # truncate the output, but add .... x <- c(head(x, lines), more) } } else { x <- c(if (abs(lines[1]) > 1) more else NULL, x[lines], if (length(x) > lines[abs(length(lines))]) more else NULL ) } # paste these lines together x <- paste(c(x, ""), collapse = "\n") hook_output(x, options) })
This vignette provides a short reference for the estimation and interpretation
of cumulative effects using pammtools
. For a more detailed overview see
@Bender2018f. In the first section, a
short introduction to cumulative effects is provided. In the
second section,
we present a worked example on a real data set, including necessary data
transformation, model estimation and visualization.
R Setup
library(dplyr) library(tidyr) library(ggplot2) theme_set(theme_bw()) library(mgcv) library(pammtools)
Cumulative effects can be thought of as an extension to the modeling of time-constant effects of time-dependent covariates (TDCs; see vignette on time-dependent covariates). Standard modeling of TDCs assumes that only the current (or one lagged) value of the covariate can affect the hazard at time $t$, i.e., $$\lambda(t|z(t)) = \lambda_0(t)\exp(\beta z(t)).$$
Cumulative effects on the other hand allow the hazard at time $t$ to depend on multiple past observations of the covariate, such that the effect at time $t$ is the sum of all weighted effects of past observations (partial effects). to make this more concrete, consider the following quantities:
Given these definitions, one possible specification of a model with a cumulative effect was suggested by @Sylvestre2009 and is given below:
$$ \lambda(t|\mathbf{z}) = \lambda_0(t)\exp\left(\sum_{t_z \leq t} h(t-t_z)z(t_z)\right) $$
Here, past observations are weighted by a (potentially) non-linear function $h$ of the latency ($t-t_z$), i.e., the time that has passed since covariate $z(t_z)$ was observed, relative to the time of interest $t$. The cumulative effect is then the sum of all partial effects $h(t-t_z)z(t_z)$.
The above model also makes some simplifying assumptions, e.g.:
A more flexible cumulative effect can be defined as follows:
$$ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)}h(t, t_z, z(t_z))\mathrm{d}t_z, $$
with
A visual representation of exemplary lag-lead windows is depicted below:
p_ll1 <- gg_laglead(0:10, 0:9, ll_fun = function(t, tz) t >= tz) my_ll_fun <- function(t, tz, tlag = 2, tlead = 5) { t >= tz + tlag & t < tz + tlag + tlead } p_ll2 <- gg_laglead(0:10, 0:9, ll_fun = function(t, tz) t == tz)
gridExtra::grid.arrange(p_ll1, p_ll2, nrow = 1L)
The left panel represents the minimal requirement $\mathcal{T}(t)={t_{z}: t_z \leq t}$. For example, the observation of the TDC at time $t_z=4$, starts to contribute to the cumulative effect in interval $(4,5]$ and continues to contribute until the last interval $(9,10]$, although with potentially different weights, as, in general, $h(t=5, t_z = 4, z(4))$ can be different to $h(t=10, t_z=4, z(4))$.
The right panel illustrates that a "standard" model for time-to-event data with a constant effect follows as a special case of the general definition of the cumulative effect with lag-lead window $\mathcal{T}(t)={t_{z}: t_z = t}$ and $h(t, t_z, z(t_z)) \equiv \beta z(t)$
For an illustration of the functionality that pammtools
provides to
work with cumulative effects, we use (a sample) of the
data analyzed in @Heyard2018 and provided in the package TBFmultinomial
(note that their analyses had a totally different goal, including modeling
of competing risks; here we simply use the data for illustrative purposes).
In the following analysis, we will model the (cause-specific) hazard for
the event extubation
, considering all competing events as censoring events.
Covariates include gender
, type
(admission type), SAPSadmission
(SAPS score at admission) and the time-dependent covariate SOFA
score
(lower scores indicate better health).
To use the data for our purposes, we first perform some preprocessing, which produces two data sets:
event_df
: Data in "standard" format (one row per subject), that only
includes time-constant covariatestdc_df
: Data containing information on time-dependent covariates
(here SOFA
) and exposure time $t_z$R-Code data preprocessing
data(VAP_data, package = "TBFmultinomial") # create unique IDs VAP_data <- VAP_data %>% mutate(tmp_id = paste(ID, day, sep = ".")) %>% group_by(ID) %>% mutate(csdup = cumsum(duplicated(tmp_id))) %>% ungroup() %>% mutate(ID = ifelse(csdup > 0, ID + 1000, ID)) # assume constant SOFA score between updates VAP_complete <- VAP_data %>% group_by(ID) %>% mutate(time = max(day)) %>% ungroup() %>% complete(ID, day = full_seq(day, 1)) %>% fill(gender, type, SAPSadmission, SOFA, outcome, time, .direction = "down") %>% filter(day <= time) event_df <- VAP_complete %>% select(ID, gender, type, SAPSadmission, outcome, time) %>% group_by(ID) %>% slice(n()) %>% mutate(outcome = 1 * (outcome == "extubated")) %>% ungroup() tdc_df <- VAP_complete %>% select(ID, day, SOFA) %>% mutate(day = day - 1) %>% # assume that SOFA available at the beginning of the day filter(day <= 49)
An overview of the preprocessed data is given below:
event_df %>% head() tdc_df %>% head()
tdc_df %>% group_by(ID) %>% mutate( sort = 1000 * sum(!is.na(SOFA)) + rep(mean(SOFA, na.rm = TRUE), sum(!is.na(SOFA)))) %>% ungroup() %>% mutate(ID = factor(ID) %>% reorder(sort, mean)) %>% ggplot(aes(x = day, fill = SOFA, y = ID)) + geom_raster() + scale_fill_viridis_c() + theme(axis.text.y = element_blank()) + scale_y_discrete("patients", breaks = NULL)
For illustration, we fit a DLNM [@Gasparrini2017] with cumulative effect of the SOFA score ($z$) as defined below:
$$ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)}h(t-t_z, z(t_z))\mathrm{d}t_z, $$
with $\mathcal{T}(t) = {t_z: t_z \leq t \leq t_z + 5}$, which means that only the SOFA scores observed within the last five days can affect the hazard at time $t$.
Remember that the specification of the partial effect $h(t,t_z, z(t_z))$ as well as the definition of the lag-lead window are important parts of the analysis and need to be considered at the beginning of the analysis as this information goes into the code for the data transformation:
ped <- as_ped( list(event_df, tdc_df), Surv(time, outcome) ~ . + cumulative(latency(day), SOFA, tz_var = "day", ll_fun = function(t, tz) t >= tz & t <= tz + 5), cut = 0:49, # administrative censoring at t = 49 id = "ID") ped$day_latency <- ped$day_latency * ped$LL
Above, we
cumulative
to inform as_ped
to perform data
preprocessing in order to fit cumulative effectsday
($t_z$) within the latency
function to calculate
$t - t_z$ll_fun
argument to match the desired specification of
$\mathcal{T}(t)$Note that the data now contains 3 matrix columns (day_latency
, SOFA
and
LL
; the latter stores the lag-lead matrix $\mathcal{T}(t)$):
str(ped, 1)
ll_df <- get_laglead(ped) ll_df <- ll_df %>% filter(t <= 10, tz <= 10) class(ll_df) <- c("LL_df", class(ll_df)) gg_laglead(ll_df)
After successful data transformation with as_ped
, the model can be estimated
directly using mgcv::gam
. Including matrix columns into the model specification
will inform gam
to estimate cumulative effects. In our case the the following
call estimates the DLNM with a cumulative effect (of the SOFA score) as specified
above:
mod <- gam(ped_status ~ s(tend) + type + gender + SAPSadmission + te(day_latency, SOFA, by = LL), method = "REML", offset = offset, family = poisson(), data = ped) summary(mod)
Interpretation of the estimated cumulative effects is best performed by
visualization of either the partial effects or the cumulative effect.
pammtools
provides a couple of convenience functions that facilitate this
process:
gg_tensor
: This visualizes 2D effect surfaces as heat maps/contour plots and
is based on the output of mgcv::plot.gam
gg_partial
and gg_partial_ll
: The former visualizes the estimated partial
effect, e.g., $\hat h(t-t_z, z(t_z))$ for each combination of the (specified)
input, the latter visualizes the partial effects as a heat-map within the
specified lag-lead window (this makes it easier to see which partial effects
actually contribute to the cumulative effect in a specific interval)gg_slice
: Plots 1D slices of 2D surfaces and can be used more generally
to plot covariate effects conditional on pre-specified values of other covariates.gg_cumu_eff
: Either plots the estimated cumulative effect $\hat g(\mathbf{z}, t)$ for a given $\mathbf{z}$ or the (log-) hazard ratio $\log\left(\frac{\hat g(\mathbf{z}_1, t)}{\hat g(\mathbf{z}_2, t)}\right)$Note that for each of these plot functions there are respective functions to
retrieve the data used for plotting such that the plots can be generated and
customized manually (see especially examples in make_newdata
).
The code below visualizes the estimated 2D partial effect $\hat h(t-t_z, z(t_z))$ for each combination of latency and SOFA score (including confidence intervals). The gray areas indicate combinations that did not occur in the data.
gg_tensor(mod, ci = TRUE) + ylab("SOFA") + xlab("latency (day)") + theme(legend.position = "bottom")
This illustrates that a low SOFA score substantially increases the log-hazard
if it was observed recently (latency $<$ 2 days), while partial effects of the
SOFA score observed further in the past (latency $>$ 3 days) go towards zero
(the event of interest is "extubation", therefore increased (log-)hazards
imply increased probabilities of extubation). Note that for this graphic,
the function $\hat h(t-t_z, z(t_z))$ is evaluated at values of $t-t_z$ (day_latency
)
that are not present in the data.
Alternatively, the function gg_partial
can be used to produce similar
visualizations and allows more control over the inputs. For example, the
following code produces the partial effect surface plot evaluated only at
latencies 0 through 5 and calculates the partial effects relative to a patient
with SOFA score 10 (everything else being equal):
gg_partial(ped, mod, term = "SOFA", day_latency = 0:5, SOFA = seq_range(SOFA, n = 20), LL = c(1), reference = list(SOFA = 10))
See also gg_partial_ll
that shows the lag-lead window and partial effects
that contribute to the cumulative effect in each interval. Due to the definition
of the partial effect, however, these are constant in this case, as no
time-variation was specified.
p_slice_latency <- gg_slice(ped, mod, term = "SOFA", day_latency = unique(day_latency), SOFA = c(0, 5, 10)) p_slice_sofa <- gg_slice(ped, mod, term = "SOFA", day_latency = c(0, 1, 4), SOFA = unique(SOFA)) gridExtra::grid.arrange(p_slice_latency, p_slice_sofa, nrow = 1)
These plots contain essentially the same information as the 2D surface, but focus on isolating the effects of the individual variables. This can be especially useful for three-dimensional partial effects which are hard to visualize otherwise.
Since it can be difficult to assess how the estimated partial effects actually
affect estimated hazard rates, pammtools
provides additional functions to
visualize and compare estimated cumulative effects on the level of the
(log-)hazard rates for given TDCs $\mathbf{z}$.
Below, we visualize the cumulative effect in each interval of the follow-up.
Since the effect depends on a TDC $\mathbf{z}$, its values must be provided in
the call to gg_cumu_eff
as z1
argument (either length 1 for time-constant
values or length equal to maximum number of times $z(t_z)$ was observed).
If z2
is additionally specified, the log-hazard ratio is calculated
as $\log\left(\frac{\hat g(\mathbf{z}_1, t)}{\hat g(\mathbf{z}_2, t)}\right)$.
Below, two such log-hazard ratios are visualized for two different comparisons
of SOFA score profiles:
p_cumu1 <- gg_cumu_eff(ped, mod, term = "SOFA", z1 = seq(20, 0, length.out = 50), z2 = 10) + geom_point(size = rel(.5)) + geom_vline(xintercept = c(5, 25), lty = 2) p_cumu2 <- gg_cumu_eff(ped, mod, term = "SOFA", z1 = seq(0, 20, length.out = 50), z2 = c(rep(0, 10), rep(20, 20), rep(5, 20))) + geom_point(size = rel(.5)) + geom_vline(xintercept = c(10, 30), lty = 2) gridExtra::grid.arrange(p_cumu1, p_cumu2, nrow = 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.