library(knitr)
opts_chunk$set(
  fig.align  = "center",
  fig.width  = 4,
  fig.height = 4,
  crop       = TRUE,
  cache      = TRUE)
## output hooks, e.g., to truncate rows of output
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
   lines <- options$output.lines
   if (is.null(lines)) {
     return(hook_output(x, options))  # pass to default hook
   }
   x <- unlist(strsplit(x, "\n"))
   more <- "..."
   if (length(lines) == 1) {
     if (length(x) > lines) {
       # truncate the output, but add ....
       x <- c(head(x, lines), more)
     }
   } else {
     x <- c(if (abs(lines[1]) > 1) more else NULL,
            x[lines],
            if (length(x) > lines[abs(length(lines))]) more else NULL
           )
   }
   # paste these lines together
   x <- paste(c(x, ""), collapse = "\n")
   hook_output(x, options)
 })

This vignette provides a short reference for the estimation and interpretation of cumulative effects using pammtools. For a more detailed overview see @Bender2018f. In the first section, a short introduction to cumulative effects is provided. In the second section, we present a worked example on a real data set, including necessary data transformation, model estimation and visualization.

R Setup

library(dplyr)
library(tidyr)
library(ggplot2)
theme_set(theme_bw())
library(mgcv)
library(pammtools)

Introduction

Cumulative effects can be thought of as an extension to the modeling of time-constant effects of time-dependent covariates (TDCs; see vignette on time-dependent covariates). Standard modeling of TDCs assumes that only the current (or one lagged) value of the covariate can affect the hazard at time $t$, i.e., $$\lambda(t|z(t)) = \lambda_0(t)\exp(\beta z(t)).$$

Cumulative effects on the other hand allow the hazard at time $t$ to depend on multiple past observations of the covariate, such that the effect at time $t$ is the sum of all weighted effects of past observations (partial effects). to make this more concrete, consider the following quantities:

Given these definitions, one possible specification of a model with a cumulative effect was suggested by @Sylvestre2009 and is given below:

$$ \lambda(t|\mathbf{z}) = \lambda_0(t)\exp\left(\sum_{t_z \leq t} h(t-t_z)z(t_z)\right) $$

Here, past observations are weighted by a (potentially) non-linear function $h$ of the latency ($t-t_z$), i.e., the time that has passed since covariate $z(t_z)$ was observed, relative to the time of interest $t$. The cumulative effect is then the sum of all partial effects $h(t-t_z)z(t_z)$.

The above model also makes some simplifying assumptions, e.g.:

A more flexible cumulative effect can be defined as follows:

$$ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)}h(t, t_z, z(t_z))\mathrm{d}t_z, $$

with

A visual representation of exemplary lag-lead windows is depicted below:

R-Code lag-lead windows

p_ll1 <- gg_laglead(0:10, 0:9, ll_fun = function(t, tz) t >= tz)
my_ll_fun <- function(t, tz, tlag = 2, tlead = 5) {
  t >= tz + tlag & t < tz + tlag + tlead
}
p_ll2 <- gg_laglead(0:10, 0:9, ll_fun = function(t, tz) t == tz)

gridExtra::grid.arrange(p_ll1, p_ll2, nrow = 1L)

Example: SOFA-score and extubation

For an illustration of the functionality that pammtools provides to work with cumulative effects, we use (a sample) of the data analyzed in @Heyard2018 and provided in the package TBFmultinomial (note that their analyses had a totally different goal, including modeling of competing risks; here we simply use the data for illustrative purposes).

In the following analysis, we will model the (cause-specific) hazard for the event extubation, considering all competing events as censoring events. Covariates include gender, type (admission type), SAPSadmission (SAPS score at admission) and the time-dependent covariate SOFA score (lower scores indicate better health).

To use the data for our purposes, we first perform some preprocessing, which produces two data sets:

R-Code data preprocessing

data(VAP_data, package = "TBFmultinomial")
# create unique IDs
VAP_data <- VAP_data %>%
  mutate(tmp_id =  paste(ID, day, sep = ".")) %>%
  group_by(ID) %>%
  mutate(csdup = cumsum(duplicated(tmp_id))) %>%
  ungroup() %>%
  mutate(ID = ifelse(csdup > 0, ID + 1000, ID))

# assume constant SOFA score between updates
VAP_complete <- VAP_data %>%
  group_by(ID) %>%
  mutate(time = max(day)) %>%
  ungroup() %>%
  complete(ID, day = full_seq(day, 1)) %>%
  fill(gender, type, SAPSadmission, SOFA, outcome, time, .direction = "down") %>%
  filter(day <= time)

event_df <- VAP_complete %>%
  select(ID, gender, type, SAPSadmission, outcome, time) %>%
  group_by(ID) %>%
  slice(n()) %>%
  mutate(outcome = 1 * (outcome == "extubated")) %>%
  ungroup()

tdc_df <- VAP_complete %>%
  select(ID, day, SOFA) %>%
  mutate(day = day - 1) %>% # assume that SOFA available at the beginning of the day
  filter(day <= 49)

 

An overview of the preprocessed data is given below:

event_df %>% head()
tdc_df %>% head()
tdc_df %>% group_by(ID) %>%
  mutate(
    sort = 1000 * sum(!is.na(SOFA)) +
      rep(mean(SOFA, na.rm = TRUE), sum(!is.na(SOFA)))) %>%
  ungroup() %>%
  mutate(ID = factor(ID) %>%  reorder(sort, mean)) %>%
  ggplot(aes(x = day, fill = SOFA, y = ID)) +
  geom_raster() + scale_fill_viridis_c() +
  theme(axis.text.y = element_blank()) +
  scale_y_discrete("patients", breaks = NULL)

Data transformation

For illustration, we fit a DLNM [@Gasparrini2017] with cumulative effect of the SOFA score ($z$) as defined below:

$$ g(\mathbf{z}, t) = \int_{\mathcal{T}(t)}h(t-t_z, z(t_z))\mathrm{d}t_z, $$

with $\mathcal{T}(t) = {t_z: t_z \leq t \leq t_z + 5}$, which means that only the SOFA scores observed within the last five days can affect the hazard at time $t$.

Remember that the specification of the partial effect $h(t,t_z, z(t_z))$ as well as the definition of the lag-lead window are important parts of the analysis and need to be considered at the beginning of the analysis as this information goes into the code for the data transformation:

ped <- as_ped(
  list(event_df, tdc_df),
  Surv(time, outcome) ~ . + cumulative(latency(day), SOFA, tz_var = "day",
      ll_fun = function(t, tz) t >= tz &  t <= tz + 5),
  cut = 0:49, # administrative censoring at t = 49
  id = "ID")
ped$day_latency <- ped$day_latency * ped$LL

Above, we

Note that the data now contains 3 matrix columns (day_latency, SOFA and LL; the latter stores the lag-lead matrix $\mathcal{T}(t)$):

str(ped, 1)
ll_df <- get_laglead(ped)
ll_df <- ll_df %>% filter(t <= 10, tz <= 10)
class(ll_df) <- c("LL_df", class(ll_df))
gg_laglead(ll_df)

Model estimation

After successful data transformation with as_ped, the model can be estimated directly using mgcv::gam. Including matrix columns into the model specification will inform gam to estimate cumulative effects. In our case the the following call estimates the DLNM with a cumulative effect (of the SOFA score) as specified above:

mod <- gam(ped_status ~ s(tend) + type + gender + SAPSadmission +
   te(day_latency, SOFA, by = LL),
  method = "REML", offset = offset, family = poisson(), data = ped)
summary(mod)

Interpretation

Interpretation of the estimated cumulative effects is best performed by visualization of either the partial effects or the cumulative effect. pammtools provides a couple of convenience functions that facilitate this process:

Note that for each of these plot functions there are respective functions to retrieve the data used for plotting such that the plots can be generated and customized manually (see especially examples in make_newdata).

2D partial effect surface

The code below visualizes the estimated 2D partial effect $\hat h(t-t_z, z(t_z))$ for each combination of latency and SOFA score (including confidence intervals). The gray areas indicate combinations that did not occur in the data.

gg_tensor(mod, ci = TRUE) + ylab("SOFA") + xlab("latency (day)") +
  theme(legend.position = "bottom")

This illustrates that a low SOFA score substantially increases the log-hazard if it was observed recently (latency $<$ 2 days), while partial effects of the SOFA score observed further in the past (latency $>$ 3 days) go towards zero (the event of interest is "extubation", therefore increased (log-)hazards imply increased probabilities of extubation). Note that for this graphic, the function $\hat h(t-t_z, z(t_z))$ is evaluated at values of $t-t_z$ (day_latency) that are not present in the data.

Alternatively, the function gg_partial can be used to produce similar visualizations and allows more control over the inputs. For example, the following code produces the partial effect surface plot evaluated only at latencies 0 through 5 and calculates the partial effects relative to a patient with SOFA score 10 (everything else being equal):

gg_partial(ped, mod, term = "SOFA", day_latency = 0:5,
  SOFA = seq_range(SOFA, n = 20), LL = c(1), reference = list(SOFA = 10))

See also gg_partial_ll that shows the lag-lead window and partial effects that contribute to the cumulative effect in each interval. Due to the definition of the partial effect, however, these are constant in this case, as no time-variation was specified.

1D conditional effects (slices through the 2D surface):

p_slice_latency <- gg_slice(ped, mod, term = "SOFA",
  day_latency = unique(day_latency), SOFA = c(0, 5, 10))
p_slice_sofa <- gg_slice(ped, mod, term = "SOFA",
  day_latency = c(0, 1, 4), SOFA = unique(SOFA))
gridExtra::grid.arrange(p_slice_latency, p_slice_sofa, nrow = 1)

These plots contain essentially the same information as the 2D surface, but focus on isolating the effects of the individual variables. This can be especially useful for three-dimensional partial effects which are hard to visualize otherwise.

Cumulative effects

Since it can be difficult to assess how the estimated partial effects actually affect estimated hazard rates, pammtools provides additional functions to visualize and compare estimated cumulative effects on the level of the (log-)hazard rates for given TDCs $\mathbf{z}$.

Below, we visualize the cumulative effect in each interval of the follow-up. Since the effect depends on a TDC $\mathbf{z}$, its values must be provided in the call to gg_cumu_eff as z1 argument (either length 1 for time-constant values or length equal to maximum number of times $z(t_z)$ was observed). If z2 is additionally specified, the log-hazard ratio is calculated as $\log\left(\frac{\hat g(\mathbf{z}_1, t)}{\hat g(\mathbf{z}_2, t)}\right)$. Below, two such log-hazard ratios are visualized for two different comparisons of SOFA score profiles:

p_cumu1 <- gg_cumu_eff(ped, mod, term = "SOFA",
  z1 = seq(20, 0, length.out = 50), z2 = 10) +
  geom_point(size = rel(.5)) + geom_vline(xintercept = c(5, 25), lty = 2)
p_cumu2 <- gg_cumu_eff(ped, mod, term = "SOFA",
    z1 = seq(0, 20, length.out = 50),
    z2 = c(rep(0, 10), rep(20, 20), rep(5, 20))) +
  geom_point(size = rel(.5)) + geom_vline(xintercept = c(10, 30), lty = 2)
gridExtra::grid.arrange(p_cumu1, p_cumu2, nrow = 1)

References



adibender/pammtools documentation built on Feb. 27, 2024, 8:40 a.m.