library(knitr) opts_chunk$set( fig.align = "center", crop = TRUE, cache = TRUE)
library(dplyr) library(mgcv) library(pammtools) library(ggplot2) theme_set(theme_bw())
Survival analysis entails a set of standard tasks beyond model estimation which need to be performed in most real world applications. These task include:
The pammtools
package provides many convenience functions to facilitate
these tasks. The overall philosophy of the package is to provide functions that
return the underlying data used for visualization in a tidy format [@wickham2014],
such that everybody can use familiar tools for further processing. In addition,
some (high level) convenience functions for visualization are also included.
To illustrate the usual workflow and some of the convenience functions provided
by pammtools
, we will use the tumor
in the following sections.
For a more complete overview see @Bender2018f.
Often one is interested in calculating the hazard, cumulative hazard or
survival probability given specific values for some covariates (while other
covariates are fixed at their mean values). For PAMMs, one usually also needs
to create a data set where all intervals occur, in order to be able to calculate
predicted values at different time points. pammtools
provides a flexible
interface to create such data sets (make_newdata
) and the functions
add_hazard
add_cumu_hazard
andadd_surv_prob
to add respective predicted values (including confidence intervals) of the
respective quantities. We illustrate the workflow using the tumor
data and
the model given below:
ped <- tumor %>% as_ped(Surv(days, status)~ age + sex + complications, id = "id") pam <- gam(ped_status ~ s(tend) + sex + age + complications, data = ped, family = poisson(), offset = offset)
In the following, we create a new data set that contains all intervals as well
as mean values for all other covariates. Note that within make_newdata
you
can specify the desired covariate values by using any function applicable to
the data type of the respective column. The new data will contain one row for
each combination of the two variables (similar to expand.grid
):
ped_df <- ped %>% make_newdata(tend = unique(tend), age = seq_range(age, n = 3))
Calling the respective functions, we can add the (predicted) hazard to the newly created data.
ped_df <- ped_df %>% add_hazard(pam) ped_df %>% select(interval, hazard, ci_lower, ci_upper) %>% head() ped_df %>% select(interval, hazard, ci_lower, ci_upper) %>% tail()
Which in turn can be conveniently plotted using respective functions, e.g.
ggplot(ped_df, aes(x = tend, group = age)) + geom_stephazard(aes(y = hazard, col = age)) + geom_stepribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = age), alpha = 0.2) + ylab(expression(hat(lambda)(t))) + xlab(expression(t))
The hazard ratio can be calculated by specifying the reference
argument:
# the code below gives hazard reatio of a 30 year old relative to a 58 year old (cp) # ped %>% # make_newdata(age = c(30)) %>% # add_hazard(pam, reference = list(age = c(58))) %>% # select(hazard, ci_lower, ci_upper) # -> 30 year old has about half the risk of a 58 year old
This might seem a little convoluted, but can be easily generalized in case of time-varying effects, time-dependent covariates, cumulative effects, etc.
Analogously, we can obtain the cumulative hazard (piece-wise linear function). Note that for the cumulative hazard and survival probability it is important to group the data accordingly, such that the cumulative hazard is cumulated for each value of the grouping variable. Otherwise it would be cumulated over the whole data set (which is usually not intended, except if the specified variable is a time-dependent covariate).
ped_df %>% group_by(age) %>% add_cumu_hazard(pam) %>% ggplot(aes(x = tend, y = cumu_hazard, ymin = cumu_lower, ymax = cumu_upper, group = age)) + geom_hazard(aes(col = age)) + geom_ribbon(aes(fill = age), alpha = 0.2) + ylab(expression(hat(Lambda)(t))) + xlab(expression(t))
ped_df %>% group_by(age) %>% add_surv_prob(pam) %>% ggplot(aes(x = tend, y = surv_prob, ymax = surv_lower, ymin = surv_upper, group = age)) + geom_line(aes(col = age)) + geom_ribbon(aes(fill = age), alpha = 0.2) + ylab(expression(hat(S)(t))) + xlab(expression(t))
This approach is very convenient, as it can be extended to arbitrary covariate specifications:
ped_df_sex <- ped %>% make_newdata(tend = unique(tend), age = seq_range(age, 3), complications = levels(complications)) %>% group_by(complications, age) %>% add_surv_prob(pam) ggplot(ped_df_sex, aes(x = tend, y = surv_prob, group = age)) + geom_ribbon(aes(ymin = surv_lower, ymax = surv_upper, fill = age), alpha = 0.3) + geom_surv(aes(col = age)) + ylab(expression(hat(S)(t))) + xlab(expression(t)) + facet_wrap(~complications, labeller = label_both)
Data sets for the visualization of the (non-linear) effect of a continuous
covariates can be created in a similar fashion. For illustration, we add
a non-linear age effect to the previous model (see also high level
functions?gg_smooth
and ?gg_tensor
and gg_re
for mgcv::plot.gam
like plots):
ped <- tumor %>% as_ped(Surv(days, status)~ age + sex + complications + charlson_score, id = "id") pam <- gam(ped_status ~ s(tend) + sex + s(age) + complications , data = ped, family = poisson(), offset = offset)
Using make_newdata
and add_term
we obtain:
term_charlson <- ped %>% make_newdata(age = seq_range(age, n = 25)) %>% add_term(pam, term = "age", reference = list(age = mean(.$age))) ggplot(term_charlson, aes(x = age, y = fit)) + geom_line() + geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2)
Similarly, we could use high level function gg_slice
to obtain
the estimated effect of the age variable, relative to the mean age and
age of 30, respectively:
gg_slice(ped, pam, "age", age = seq_range(age, n = 25), reference = list(age = mean(ped$age))) gg_slice(ped, pam, "age", age = seq_range(age, n = 25), reference = list(age = c(30)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.