Description Duration-Based Quantities of Interest for the Cox Model Simulating duration data for the Cox model Author(s) References Examples

The Cox proportional hazards model (implemented in R with `coxph`

in the `survival`

package or with `cph`

in the `rms`

package) is one of the most frequently used estimators
in duration (survival) analysis. Because it is estimated using only the observed durations' rank ordering, typical
quantities of interest used to communicate results of the Cox model come from the hazard function (e.g., hazard ratios
or percentage changes in the hazard rate). These quantities are substantively vague and difficult for many audiences
of research to understand. The `coxed`

package introduces a suite of methods to address these problems.
The package allows researchers to calculate duration-based quantities from Cox model results, such as the expected
duration (or survival time) given covariate values and marginal changes in duration for a specified change in a
covariate. These duration-based quantities often match better with researchers' substantive interests and are
easily understood by most readers.
In addition, no standard method exists for simulating durations directly from the Cox model's data generating
process because it does not assume a distributional form for the baseline hazard function. The `coxed`

package
also contains functions to simulate general duration data that does not rely on an assumption of any particular parametric
hazard function.

The `coxed`

function generates expected durations for individual
observations and/or marginal changes in expected duration given a change in a covariate
from the Cox proportional hazards model. Specifically, the methods can compute (1) the
expected duration for each observation used to fit the Cox model, given the covariates,
(2) the expected duration for a "new" observation with a covariate profile set by the
analyst, or (3) the first difference, or change, in expected duration given two new data frames.

There are two different methods of generating duration-based quantities in the package.
`coxed`

with `type="npsf"`

calculates expected durations by using the method proposed by
Cox and Oakes (1984, 107-109) for estimating the cumulative baseline hazard function. This
method is nonparametric and results in a step-function representation of the cumulative
baseline hazard. Cox and Oakes (1984, 108) show that the cumulative baseline hazard function can be estimated
after fitting a Cox model by

*\hat{H}_0(t) = ∑_{τ_j < t}\frac{d_j}{∑_{l\in\Re(τ_j)}\hat{ψ}(l)},*

where *τ_j* represents time points earlier than *t*, *d_j* is a count of the
total number of failures at *τ_j*, *\Re(τ_j)* is the remaining risk set at *τ_j*,
and *\hat{ψ}(l)* represents the ELP from the Cox model for observations still in the
risk set at *τ_j*. This equation is used calculate the cumulative baseline hazard at
all time points in the range of observed durations. This estimate is a stepwise function
because time points with no failures do not contribute to the cumulative hazard, so the function
is flat until the next time point with observed failures.

We extend this method to obtain expected durations by first calculating the baseline survivor function from the cumulative hazard function, using

*\hat{S}_0(t) = \exp[-\hat{H}_0(t)].*

Each observation's survivor function is related to the baseline survivor function by

*\hat{S}_i(t) = \hat{S}_0(t)^{\hat{ψ}(i)},*

where *\hat{ψ}(i)* is the exponentiated linear predictor (ELP) for observation *i*.
These survivor functions can be used directly to calculate expected durations for each
observation. The expected value of a non-negative random variable can be calculated by

*E(X) = \int_0^{∞} \bigg(1 - F(t)\bigg)dt,*

where *F(.)* is the cumulative distribution function for *X*. In the case of a
duration variable *t_i*, the expected duration is

*E(t_i) = \int_0^T S_i(t)\,dt,*

where *T* is the largest possible duration and *S(t)* is the individual's survivor
function. We approximate this integral with a right Riemann-sum by calculating the survivor
functions at every discrete time point from the minimum to the maximum observed durations,
and multiplying these values by the length of the interval between time points with observed failures:

*E(t_i) \approx ∑_{t_j \in [0,T]} (t_j - t_{j-1})S_i(t_j).*

`coxed`

with `type="gam"`

employs a generalized additive model (GAM) to map the model's estimated linear
predictor values to duration times and proceeds according to five steps. First, it uses coefficient
estimates from the Cox model, so researchers must first estimate the model just as they always have.
Then the method computes expected values of risk for each observation by matrix-multiplying the
covariates by the estimated coefficients from the model, then exponentiating the result. This creates
the exponentiated linear predictor (ELP). Then the observations are ranked from smallest to largest
according to their values of the ELP. This ranking is interpreted as the expected order of failure;
the larger the value of the ELP, the sooner the model expects that observation to fail, relative to
the other observations. The next step is to connect the model's expected risk for each observation (ELP) to duration time
(the observed durations). A `gam`

fits a model to data by using a series of locally-estimated polynomial
splines set by the user (see, for example, Wood, Pya, and Saefken 2016). It is a flexible means of allowing for
the possibility of nonlinear relationships between variables. `coxed`

with `type="gam"`

uses a GAM to model the observed
durations as a function of the linear predictor ranks generated in the previous step. More specifically, the method
utilizes a cubic regression spline to draw a smoothed line summarizing the bivariate relationship between
the observed durations and the ranks. The GAM fit can be used directly to compute expected durations, given the covariates, for each observation
in the data.

See Kropko and Harden (2018) for further details about generating expected durations and marginal changes in expected
duration from the Cox model. The `coxed`

function can also generate these quantities from data with time-varying
covariates (see `coxed.npsf.tvc`

and `coxed.gam.tvc`

).

The `sim.survdata`

function generates simulated duration data. It can accept a user-supplied
hazard function, or else it uses the flexible-hazard method described in Harden and Kropko (2018) to generate
a hazard that does not necessarily conform to any parametric hazard function. It can generate data with time-varying
covariates or coefficients. For time-varying covariates `type="tvc"`

it employs the permutational algorithm by Sylvestre and Abrahamowicz (2008).
For time-varying coefficients with `type="tvbeta"`

, the first beta coefficient that is either supplied by the user or generated by
the function is multiplied by the natural log of the failure time under consideration.

The flexible-hazard method employed when `hazard.fun`

is `NULL`

generates a unique baseline hazard by fitting a curve to
randomly-drawn points. This produces a wide variety
of shapes for the baseline hazard, including those that are unimodal, multimodal, monotonically increasing or decreasing, and many other
shapes. The method then generates a density function based on each baseline hazard and draws durations from it in a way that circumvents
the need to calculate the inverse cumulative baseline hazard. Because the shape of the baseline hazard can vary considerably, this approach
matches the Cox model’s inherent flexibility and better corresponds to the assumed data generating process (DGP) of the Cox model. Moreover,
repeating this process over many iterations in a simulation produces simulated samples of data that better reflect the considerable
heterogeneity in data used by applied researchers. This increases the generalizability of the simulation results. See Harden and Kropko (2018)
for more detail.

When generating a marginal effect, first the user specifies a covariate by typing its column number in the `X`

matrix into the `covariate`

argument, then specifies the high and low values at which to fix this covariate. The function calculates the differences in expected duration for each
observation when fixing the covariate to the high and low values. If `compare`

is `median`

, the function reports the median of these differences,
and if `compare`

is `mean`

, the function reports the median of these differences, but any function may be employed that takes a vector as input and
outputs a scalar.

If `censor.cond`

is `FALSE`

then a proportion of the observations specified by `censor`

is randomly and uniformly selected to be right-censored.
If `censor.cond`

is `TRUE`

then censoring depends on the covariates as follows: new coefficients are drawn from normal distributions with mean 0 and
standard deviation of 0.1, and these new coefficients are used to create a new linear predictor using the `X`

matrix. The observations with the largest
(100 x `censor`

) percent of the linear predictors are designated as right-censored.

Finally, `link[coxed]{survsim.plot}`

is useful for visualizing the baseline functions, including hazard, that result from
`link[coxed]{sim.survdata}`

for a particular draw of simulated duration data. The function uses `ggplot`

to create line plots for the baseline failure PDF, the baseline failure CDF, the baseline survivor function,
and the baseline hazard function over time. The baseline functions and time are attributes of the
`sim.survdata`

output.

Jonathan Kropko <jkropko@virginia.edu> and Jeffrey J. Harden <jharden2@nd.edu>

Harden, J. J. and Kropko, J. (2018) Simulating Duration Data for the Cox Model.
*Political Science Research and Methods* https://doi.org/10.1017/psrm.2018.19

Kropko, J. and Harden, J. J. (2018) Beyond the Hazard Ratio: Generating Expected
Durations from the Cox Proportional Hazards Model. *British Journal of Political Science*
https://doi.org/10.1017/S000712341700045X

Box-Steffensmeier, J. M. (1996)
A Dynamic Analysis of The Role of War Chests in Campaign Strategy.
*American Journal of Political Science* **40** 352-371

Hyman, J. M. (1983) Accurate monotonicity preserving cubic interpolation. *SIAM J. Sci. Stat. Comput.* **4**, 645–654.
https://doi.org/10.1137/0904045

Martin, L. W and Vanberg, G. (2003) Wasting Time?
The Impact of Ideology and Size on Delay in Coalition Formation.
*British Journal of Political Science* **33** 323-344
https://doi.org/10.1017/S0007123403000140

Sylvestre M.-P., Abrahamowicz M. (2008) Comparison of algorithms to generate event times conditional on time-dependent
covariates. *Statistics in Medicine* **27(14)**:2618–34. https://doi.org/10.1002/sim.3092

Wood, S.N., N. Pya and B. Saefken (2016) Smoothing parameter and model selection for general smooth models (with discussion). *Journal of the American Statistical Association* **111**, 1548-1575
http://dx.doi.org/10.1080/01621459.2016.1180986

1 | ```
## See the examples for coxed, sim.survdata, and survsim.plot
``` |

```
Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:base’:
format.pval, units
Loading required package: SparseM
Attaching package: ‘SparseM’
The following object is masked from ‘package:base’:
backsolve
Loading required package: mgcv
Loading required package: nlme
This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.