sieveSurv
: Nonparametrically estimating survival with B-spline sievesThe complete R package sieveSurv
and code for example data anlysis.
To install the package, run the following in your R
console:
devtools::install_github(repo = "sarahlotspeich/sieveSurv")
lung
datasetObjective: We first estimate unconditional survival, before considering conditional survival given sex
and/or age
. The Kaplan-Meier estimator is used for comparison in the unconditional case, or when conditioning on just binary sex
; survival based on the Cox proportional hazards model with the Breslow estimator is used for comparison otherwise.
```{r, message = FALSE} library(sieveSurv) library(survival) data(lung)
We draw upon the following variables from the `lung` dataset from the `survival` package:
- `time`: Survival time in days
- `status`: Censoring status `1=censored`, `2=dead`
- `age`: Age in years
- `sex`: Patient sex `Male=1`, `Female=2`
Note that we have releveled `status` so that `0=censored` and `1=dead`.
```{r, message = FALSE}
lung$status <- lung$status - 1
# Kaplan-Meier estimator of S(t)
km <- survival::survfit(formula = survival::Surv(time = time, event = status) ~ 1, data = lung)
Since we are not conditioning on any additional covariates, there is no need for B-splines. However, to use the software we supply a single column bs1
which is a constant of ones.
# Sieve estimator of S(t)
lung$bs1 <- 1
sieve <- sieve_surv(time = "time", event = "status", bspline = "bs1", data = lung)
From the plot below, we see that the Kaplan-Meier and sieve estimators of the survival function line up for nearly all values of time
; the only exception seems to be how they handle time points beyond the latest observed death at time = 883
. Both are valid - the Kaplan-Meier simply carries forward S(t=883) whereas the sieve estimator continues to extrapolate the survival curve so that survival at the latest observed time
hits zero.
# Kaplan-Meier estimator of S(t|sex)
km_m <- survival::survfit(formula = survival::Surv(time = time, event = status) ~ 1, data = lung, subset = sex == 1)
km_f <- survival::survfit(formula = survival::Surv(time = time, event = status) ~ 1, data = lung, subset = sex == 2)
To estimate the survival function from a Cox proportional hazards model, we use Breslow's estimator. The function breslow_estimator()
for Breslow's baseline survival estimator is available from the imputeCensoRd
package, which can be installed and used as follows:
# Install and load imputeCensoRd
devtools::install_github(repo = "kylefred/Imputing-Censored-Covariates", subdir = "imputeCensoRd")
library(imputeCensoRd)
# Cox proportional hazards estimator of h(t|sex)
cox <- survival::coxph(formula = survival::Surv(time = time, event = status) ~ sex, data = lung)
lung$hr <- exp(cox$coefficients * as.numeric(lung$sex == 2))
base_surv <- breslow_estimator(time = "time", event = "status", hr = "hr", data = lung)
We are now conditioning on additional covariate, sex
, but since it is binary there is still not a need for B-splines. To align with the software implementation, though, we define two columns, bs1
and bs2
, as indicators of sex = Male
and sex = Female
, respectively.
# Sieve estimator of S(t|sex)
lung$bs1 <- as.numeric(lung$sex == 1)
lung$bs2 <- as.numeric(lung$sex == 2)
sieve <- sieve_surv(time = "time", event = "status", bspline = c("bs1", "bs2"), data = lung)
Performance remains comparable when comparing separate Kaplan-Meier estimators fit to male and female patients or the Cox model + Breslow estimator survival estimate against the sieve estimator of the conditional survival. Once again, the only noticeable deviation between the two estimators comes in the tail of time
.
# Cox proportional hazards estimator of h(t|sex)
cox <- survival::coxph(formula = survival::Surv(time = time, event = status) ~ age, data = lung)
lung$hr <- exp(cox$coefficients[1] * lung$age)
base_surv <- imputeCensoRd::breslow_estimator(time = "time", event = "status", hr = "hr", data = lung)
First, we have to tune the B-splines for age
. We consider linear (q = 2), quadratic (q = 3), and cubic B-splines (q = 4) with bn = 10, 15, or 20 interior knots. There are two potential methods to select q and bn:
Given the small sample size in the lung
dataset, we focus only on tuning with AIC; splitting the n = 228 subjects into two or more folds could make method 2. unreliable.
For illustration, we also plot the sieve survival estimation against the Cox model estimate for each B-spline parameterization. The plot is based on the survival functions conditional on the median age of r median(lung$age)
years old. Note: This plot represents the fit for a single subject in the sample and should not be used to singlehandedly assess fit of the B-spline basis.
Define the Akaike Information Criterion (AIC) for the B-spline coefficients as $AIC({\hat{p}{kj}}) = -2l_n({\hat{p}{kj}}) + 2K$, where $K$ is the number of parameters being estimated. Since the columns of the B-spline basis are constrained such that $\sum_{k=1}^{m}p_{kj} = 1$ ($j = 1, \dots, s_n$), it follows that $K = s_n(m - 1)$. We want to choose $q$ and $b_n$ to minimize the $AIC({\hat{p}_{kj}})$.
Based on this result, we place a linear B-spline (q = 2) with bn = 5 interior knots on age
.
B <- splines::bs(x = lung$age, degree = 1, df = 5, intercept = TRUE, Boundary.knots = range(lung$age))
colnames(B) <- paste0("bs", 1:ncol(B))
colSums(B > 0)
# Sieve estimator of S(t|sex)
sieve <- sieveSurv(time = "time", event = "status", covar = "age", bspline = colnames(B), data = cbind(lung, B))
# Cox proportional hazards estimator of h(t|sex)
cox <- survival::coxph(formula = survival::Surv(time = time, event = status) ~ age + sex, data = lung)
lung$hr <- exp(cox$coefficients[1] * lung$age + cox$coefficients[2] * lung$sex)
base_surv <- breslow_estimator(time = "time", event = "status", hr = "hr", data = lung)
We once again tune the B-spline specification by finding (q, bn) to minimize the AIC. Since approximately 60% of the sample were men, 60% of the bn interior knots in each parameterization are allocated to the men; the remaining 40% are designated to women accordingly. Note: We cannot consider bn < 10 for $q \geq 2$ because they require more than 10 splines.
The lowest AIC was attained with q = 5 and bn = 5; this is the same parameterization as above, but now we are stratifying on sex
and placing separate linear B-splines on age
.
nsieve <- 5
B <- matrix(data = 0, nrow = nrow(lung), ncol = nsieve)
B[lung$sex == 1, 1:(nsieve * 0.6)] <- splines::bs(x = lung$age[lung$sex == 1], degree = 1,
df = (nsieve * 0.6), intercept = TRUE,
Boundary.knots = range(lung$age[lung$sex == 1]))
B[lung$sex == 2, (nsieve * 0.6 + 1):nsieve] <- splines::bs(x = lung$age[lung$sex == 2], degree = 1,
df = (nsieve * 0.4), intercept = TRUE,
Boundary.knots = range(lung$age[lung$sex == 2]))
colnames(B) <- paste0("bs", 1:nsieve)
# Sieve estimator of S(t|sex,age)
sieve <- sieveSurv(time = "time", event = "status", covar = c("age", "sex"), bspline = paste0("bs", 1:nsieve), data = cbind(lung, B))
Visual comparisons are once again made based on the survival functions conditional on either sex at the median age of r median(lung$age)
years old.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.