README.md

sieveSurv: Nonparametrically estimating survival with B-spline sieves

Lotspeich and Garcia (2021)

The complete R package sieveSurv and code for example data anlysis.

Install

To install the package, run the following in your R console:

devtools::install_github(repo = "sarahlotspeich/sieveSurv")

Example analysis using the lung dataset

Objective: We first estimate unconditional survival, before considering conditional survival given sex and/or age. The Kaplan-Meier estimator is used for comparison in the unconditional case, or when conditioning on just binary sex; survival based on the Cox proportional hazards model with the Breslow estimator is used for comparison otherwise.

Setup

```{r, message = FALSE} library(sieveSurv) library(survival) data(lung)


We draw upon the following variables from the `lung` dataset from the `survival` package:

  - `time`: Survival time in days
  - `status`:   Censoring status `1=censored`, `2=dead`
  - `age`:  Age in years
  - `sex`:  Patient sex `Male=1`, `Female=2`

Note that we have releveled `status` so that `0=censored` and `1=dead`. 

```{r, message = FALSE}
lung$status <- lung$status - 1

Unconditional survival, S(t)

# Kaplan-Meier estimator of S(t)
km <- survival::survfit(formula = survival::Surv(time = time, event = status) ~ 1, data = lung)

Since we are not conditioning on any additional covariates, there is no need for B-splines. However, to use the software we supply a single column bs1 which is a constant of ones.

# Sieve estimator of S(t)
lung$bs1 <- 1
sieve <- sieve_surv(time = "time", event = "status", bspline = "bs1", data = lung)

From the plot below, we see that the Kaplan-Meier and sieve estimators of the survival function line up for nearly all values of time; the only exception seems to be how they handle time points beyond the latest observed death at time = 883. Both are valid - the Kaplan-Meier simply carries forward S(t=883) whereas the sieve estimator continues to extrapolate the survival curve so that survival at the latest observed time hits zero.

Unconditional survival functions

Conditional survival with a binary covariate, S(t|sex)

# Kaplan-Meier estimator of S(t|sex)
km_m <- survival::survfit(formula = survival::Surv(time = time, event = status) ~ 1, data = lung, subset = sex == 1)
km_f <- survival::survfit(formula = survival::Surv(time = time, event = status) ~ 1, data = lung, subset = sex == 2)

To estimate the survival function from a Cox proportional hazards model, we use Breslow's estimator. The function breslow_estimator() for Breslow's baseline survival estimator is available from the imputeCensoRd package, which can be installed and used as follows:

# Install and load imputeCensoRd
devtools::install_github(repo = "kylefred/Imputing-Censored-Covariates", subdir = "imputeCensoRd")
library(imputeCensoRd)

# Cox proportional hazards estimator of h(t|sex)
cox <- survival::coxph(formula = survival::Surv(time = time, event = status) ~ sex, data = lung)
lung$hr <- exp(cox$coefficients * as.numeric(lung$sex == 2))
base_surv <- breslow_estimator(time = "time", event = "status", hr = "hr", data = lung)

We are now conditioning on additional covariate, sex, but since it is binary there is still not a need for B-splines. To align with the software implementation, though, we define two columns, bs1 and bs2, as indicators of sex = Male and sex = Female, respectively.

# Sieve estimator of S(t|sex)
lung$bs1 <- as.numeric(lung$sex == 1)
lung$bs2 <- as.numeric(lung$sex == 2)
sieve <- sieve_surv(time = "time", event = "status", bspline = c("bs1", "bs2"), data = lung)

Performance remains comparable when comparing separate Kaplan-Meier estimators fit to male and female patients or the Cox model + Breslow estimator survival estimate against the sieve estimator of the conditional survival. Once again, the only noticeable deviation between the two estimators comes in the tail of time.

Conditional survival functions given sex

Conditional survival with a continuous covariate, S(t|age)

# Cox proportional hazards estimator of h(t|sex)
cox <- survival::coxph(formula = survival::Surv(time = time, event = status) ~ age, data = lung)
lung$hr <- exp(cox$coefficients[1] * lung$age)
base_surv <- imputeCensoRd::breslow_estimator(time = "time", event = "status", hr = "hr", data = lung)

First, we have to tune the B-splines for age. We consider linear (q = 2), quadratic (q = 3), and cubic B-splines (q = 4) with bn = 10, 15, or 20 interior knots. There are two potential methods to select q and bn:

  1. Akaike Information Criterion (AIC) or
  2. Cross-Validation.

Given the small sample size in the lung dataset, we focus only on tuning with AIC; splitting the n = 228 subjects into two or more folds could make method 2. unreliable.

For illustration, we also plot the sieve survival estimation against the Cox model estimate for each B-spline parameterization. The plot is based on the survival functions conditional on the median age of r median(lung$age) years old. Note: This plot represents the fit for a single subject in the sample and should not be used to singlehandedly assess fit of the B-spline basis.

Conditional survival for a subject of median age using different B-spline specifications

Tuning B-Splines with AIC

Define the Akaike Information Criterion (AIC) for the B-spline coefficients as $AIC({\hat{p}{kj}}) = -2l_n({\hat{p}{kj}}) + 2K$, where $K$ is the number of parameters being estimated. Since the columns of the B-spline basis are constrained such that $\sum_{k=1}^{m}p_{kj} = 1$ ($j = 1, \dots, s_n$), it follows that $K = s_n(m - 1)$. We want to choose $q$ and $b_n$ to minimize the $AIC({\hat{p}_{kj}})$.

Akaike Information Criterion (AIC) with different B-spline specifications for age

Based on this result, we place a linear B-spline (q = 2) with bn = 5 interior knots on age.

B <- splines::bs(x = lung$age, degree = 1, df = 5, intercept = TRUE, Boundary.knots = range(lung$age))
colnames(B) <- paste0("bs", 1:ncol(B))
colSums(B > 0)
# Sieve estimator of S(t|sex)
sieve <- sieveSurv(time = "time", event = "status", covar = "age", bspline = colnames(B), data = cbind(lung, B))

Conditional survival functions given median age

Conditional survival with binary and continuous covariates, S(t|sex, age)

# Cox proportional hazards estimator of h(t|sex)
cox <- survival::coxph(formula = survival::Surv(time = time, event = status) ~ age + sex, data = lung)
lung$hr <- exp(cox$coefficients[1] * lung$age + cox$coefficients[2] * lung$sex)
base_surv <- breslow_estimator(time = "time", event = "status", hr = "hr", data = lung)

We once again tune the B-spline specification by finding (q, bn) to minimize the AIC. Since approximately 60% of the sample were men, 60% of the bn interior knots in each parameterization are allocated to the men; the remaining 40% are designated to women accordingly. Note: We cannot consider bn < 10 for $q \geq 2$ because they require more than 10 splines.

Akaike Information Criterion (AIC) with different B-spline specifications for sex and age

The lowest AIC was attained with q = 5 and bn = 5; this is the same parameterization as above, but now we are stratifying on sex and placing separate linear B-splines on age.

nsieve <- 5
B <- matrix(data = 0, nrow = nrow(lung), ncol = nsieve)
B[lung$sex == 1, 1:(nsieve * 0.6)] <- splines::bs(x = lung$age[lung$sex == 1], degree = 1, 
                                                      df = (nsieve * 0.6), intercept = TRUE,
                                                      Boundary.knots = range(lung$age[lung$sex == 1]))
B[lung$sex == 2, (nsieve * 0.6 + 1):nsieve] <- splines::bs(x = lung$age[lung$sex == 2], degree = 1,
                                                           df = (nsieve * 0.4), intercept = TRUE, 
                                                           Boundary.knots = range(lung$age[lung$sex == 2]))
colnames(B) <- paste0("bs", 1:nsieve)

# Sieve estimator of S(t|sex,age)
sieve <- sieveSurv(time = "time", event = "status", covar = c("age", "sex"), bspline = paste0("bs", 1:nsieve), data = cbind(lung, B))

Visual comparisons are once again made based on the survival functions conditional on either sex at the median age of r median(lung$age) years old.

Conditional survival functions given sex and median age



sarahlotspeich/sieveSurv documentation built on Feb. 14, 2022, 5:10 a.m.