knitr::opts_chunk$set(prompt = TRUE, comment = "")

In this vignette, we demonstrate the usage of the aftsrr() function from the aftgee package [@R:aftgee; @chiou2014fitting]. The vignette is based on aftgee version r packageVersion("aftgee"). {R aftsrr-args} library(aftgee) packageVersion("aftgee")

AFT model

The univariate semiparametric semiparametric accelerated failure time (AFT) model is specified as $$\log(T_i) = X_i^\top\beta + \epsilon_i, i = 1, \ldots, n, $$ where $T_i$ is the failure time, $X_i$ is the $p\times1$ covariate vector, $\beta$ is the $p\times1$ regression coefficient, and $\epsilon_i$'s are independent and identically distributed random variables with an unspecified distribution. In the presence of right censoring, the observed data are independent copies of $(Y_i, \Delta_i, X_i), i = 1, \ldots, n$, where $Y_i = \min(T_i, C_i), \Delta_i = I(T_i < C_i)$, and $I(\cdot)$ is the indicator function.

Rank-based estimator

The regression parameters, $\beta$, can be estimated by solving the rank-based weighted estimating equation, $$U_{n, \varphi}(\beta) = \sum_{i = 1}^nw_i\varphi_i(\beta)\Delta_i \left[X_i - \frac{\sum_{j = 1}^nw_jX_jI{e_j(\beta)\ge e_i(\beta)}}{\sum_{j = 1}^nw_jI{e_j(\beta)\ge e_i(\beta)}}\right] = 0,$$ where $e_i(\beta) = Y_i - X_i^\top\beta$, $w_i$ is a sampling weight, and $\varphi(\beta)$ is a possibly data-dependent non-negative weight function with values between 0 and 1. Let $\widehat{F}_{e_i(\beta)}(t)$ be the estimated cumulative distribution based on the residuals $e_i(\beta)$'s. Common choices of $\varphi_i(\beta)$ implemented in the aftgee package are.

The solution to $U_{n, \varphi}(\beta) = 0$, denoted by $\widehat\beta_{n, \varphi}$, is consistent to the true parameter, $\beta_0$, and is asymptotically normal [e.g., @Tsia:esti:1990]. We used the Kaplan-Meier estimator to obtain $\widehat{F}{e_i(\beta)}(t)$. The Barzilai-Borwein spectral method implemented in package BB package [@R:BB] is the used to solve $U{n, \varphi}(\beta) = 0$ directly.

Smoothed rank-based estimator

A computationally more efficient approach is the induced smoothing procedure [@brown2005standard; @brown2007induced]. The idea is to replace the nonsmooth estimating equations with a smoothed version, whose solutions are asymptotically equivalent to those of the former. Define an independent $p\times1$ standard normal random vector $Z$ and a $p\times p$ matrix $\Gamma_n$ such that $\Gamma_n^2 = \Sigma_n$, where $\Sigma_n$ is a symmetric positive definite matrix. The induced smoothing procedure replaces $U_{n, \varphi}(\beta)$ with $E_Z[U_{n, \varphi}(\beta + n^{-1/2}\Gamma_nZ)]$, where the expectation is taken with respect to $Z$. With the Gehan's weight, the induced smooth estimating equation is $$\widetilde{U}{n, G}(\beta) = \sum{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i(X_i - X_j) \Phi\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0, $$ where $r_{ij}^2 = (X_i - X_j)^\top\Sigma_n(X_i - X_j)$ and $\Phi(\cdot)$ is the standard normal cumulative distribution function. The smooth Gehan estimating equation is monotone and continuously differentiable with respect to $\beta$, hence, its root can be found with standard numerical methods such as the Barzilai-Borwein spectral method.

On the other hand, deriving the smoothed estimating equations with general weights is challenging because $E_Z[U_{n, \varphi}(\beta + n^{-1/2}\Gamma_nZ)]$ involves the expectation of the ratio of two random quantities. The aftgee package implements the iterative induced smoothing procedure proposed in @chiou2015rank. For general weights, the regression parameters, $\beta$, can be estimated with the following steps:

  1. Obtain an initial estimate $\widetilde\beta_{n, \varphi}^{(0)} = b_n$ of $\beta$ and initialize with $m = 1$.
  2. Update $\widetilde\beta_{n, \varphi}^{(m)}$ by solving $\widetilde{U}{n, \varphi}(\widetilde\beta{n, \varphi}^{(m-1)} , \widetilde\beta_{n, \varphi}^{(m)}) = 0$.
  3. Increase $m$ by one and repeat Step 2 until $\left|\widetilde\beta_{n, \varphi, q}^{(m-1)} - \widetilde\beta_{n, \varphi, q}^{(m)}\right| < t$ for all $q = 1, \ldots, p$, where $\widetilde\beta_{n, \varphi, q}^{(m)}$ is the $q$th component of $\widetilde\beta_{n, \varphi}^{(m)}$ and $t$ is a prefixed tolerance.

The smoothed estimating equation, $\widetilde{U}{n, \varphi}(b, \beta)$, has the form $$\widetilde{U}{n, \varphi}(b, \beta) = \sum_{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i\phi_i(b)(X_i - X_j)\Phi\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0.$$ A simple choice of the initial estimator is the easy-to-compute Gehan's estimator, $\widetilde\beta_{n, G}$.

The aftsrr() interface

The aftsrr() is used to fit a semiparametric AFT model with rank-based approaches. The function interface of aftsrr() is

args(aftsrr)

The arguments are:

The equation types

The arguments eqType = "is"estimates the regression parameters by solving the non-smoothed estimating equation, $U_{n, \varphi}(\beta)$ directly. The arguments eqType = "is"estimates the regression parameters by solving the approximated induced smoothing estimating equation $$U_{n, \varphi}(\beta) = \sum_{i = 1}^nw_i\varphi_i(\beta)\Delta_i \left[X_i - \frac{\sum_{j = 1}^nw_jX_j\Phi{e_j(\beta)\ge e_i(\beta)}}{\sum_{j = 1}^nw_j\Phi{e_j(\beta)\ge e_i(\beta)}}\right] = 0.$$ The arguments eqType = "mis"estimates the regression parameters using the above mentioned iterative procedure with $$\widetilde{U}{n, \varphi}(b, \beta) = \sum{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i\phi_i(b)(X_i - X_j)\Phi\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0.$$ The arguments eqType = "mis"estimates the regression parameters using the same algorithm except with the induced smoothing $\widetilde{U}{n, \varphi}(b, \beta)$ replaced with $$\widehat{U}{n, \varphi}(b, \beta) = \sum_{i = 1}^n\sum_{j = 1}^nw_iw_j\Delta_i\phi_i(b)(X_i - X_j)I\left[\frac{e_j(\beta) - e_i(\beta)}{r_{ij}}\right] = 0.$$

Variance estimation

The aftsrr() allows users to estimate the variance-covariance matrix of $\beta$ through the nonparametric bootstrap or the multiplier resampling procedure. Bootstrap samples that failed to converge are removed when computing the empirical variance matrix. When bootstrap is not called, we assume the variance-covariance matrix has a sandwich form $$\Sigma = A^{-1}V(A^{-1})^\top,$$ where $V$ is the asymptotic variance of the estimating function and $A$ is the slope matrix. In aftgee, we provide seveal methods to estimate the variance-covariance matrix via this sandwich form, depending on how $V$ and $A$ are estimated. Specifically, the asymptotic variance, $V$, can be estimated by either a closed-form formulation (CF) or through bootstrap the estimating equations (MB). On the other hand, the methods to estimate the slope matrix $A$ are the induced smoothing approach (IS), Zeng and Lin's approach (ZL) [@zeng2008efficient], and the smoothed Huang's approach (sH) [@huang2002calibration].

The control list

The control list controls equation solver, maxiter, tolerance, and resampling variance estimation. The available equation solvers are BBsolve and dfsane of the BB package. The default algorithm control parameters are used when these functions are called. However, the monotonicity parameter, M, can be specified by users via the control list. When M is specified, the merit parameter, noimp, is set at 10 * M. The readers are refered to the BB package for details. Instead of searching for the zero crossing, options including BBoptim and optim will return solution from maximizing the corresponding objective function. When se = "bootstrap" or se = "MB", an additional argument parallel = TRUE can be specified to enable parallel computation. The number of CPU cores can be specified with parCl, the default number of CPU cores is the integer value of detectCores() / 2.

An simulated example

The following function generates simulated data following an AFT model $$\log(T) = 2 + X_1 + X_2 + \epsilon, $$ where $X_1$ is a Bernoulli random variable with $p = 0.5$, $X_2$ is a standard normal random variable, and $\epsilon$ is an independent standard normal random variable.

datgen <- function(n = 100) {
  x1 <- rbinom(n, 1, 0.5)
  x2 <- rnorm(n)
  e <- rnorm(n)
  tt <- exp(2 + x1 + x2 + e)
  cen <- runif(n, 0, 100)
  data.frame(Time = pmin(tt, cen), status = 1 * (tt < cen),
             x1 = x1, x2 = x2, id = 1:n)
}
set.seed(1); head(dat <- datgen(n = 50))

The observed data, $(Y, \Delta, X)$, correspond to Time, status, x1, and x2. The following gives the induced smoothing Gehan estimate (default) of $\beta$ with the variance matrix estimated by the induced smoothing approach and Zeng and Lin's approach.

```{R, cache = TRUE} system.time(fit1 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, se = c("ISMB", "ZLMB"))) summary(fit1)

The non-smooth counterpart presented below yield similar results.
```{R, cache = TRUE}
system.time(fit2 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, 
                           eqType = "ns", se = c("ISMB", "ZLMB")))
summary(fit2)

The Prentice-Wilcoxon estimator obtained by the iterative procedure also yields similar results. ```{R, cache = TRUE} system.time(fit3 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, rankWeights = "PW", se = c("ISMB", "ZLMB"))) summary(fit3)

The following provides an example with the variance estimate obtained by bootstrap while taking advantage of parallel computing. 
The bootstrap variance estimator is compatible with the sandwich variance estimator.

```{R, cache = TRUE}
system.time(fit4 <- aftsrr(Surv(Time, status) ~ x1 + x2, data = dat, se = "bootstrap", 
                           control = list(parallel = TRUE)))
summary(fit4)

National Wilm's Tumor Study

We demonstrate the performance of the weighted approach with an application to the cohort study conducted by the National Wilm's Tumor Study Group (NWTSG) [@breslow1999design]. The interest of the study was to assess the relationship between the tumor histology and the outcome, time to tumor relapse. The data set can be loaded with the following code.

data(nwtco, package = "survival")
head(nwtco)

The variables of interest are the following.

There were a total of 4028 subjects in the full cohort. Among them, 571 were cases who experienced the relapse of tumor; a censoring rate of about 86\%. The following codes converts the variables, histol, stage, and study into a factor and transform age to year.

nwtco$age <- nwtco$age / 12
nwtco$histol <- factor(nwtco$histol)
nwtco$stage <- factor(nwtco$stage)
nwtco$study <- factor(nwtco$study)

We first fit the full cohort data. ```{R, cache = T} system.time(fit.full <- aftsrr(Surv(edrel, rel) ~ histol + age + study, data = nwtco, se = "ISMB")) summary(fit.full)

We now create an artificial case-cohort data to evaluate the performance of the weighted approach.
The following creates a case0cohort with 668 patients selected as sub-cohort sample to form the total case-cohort sample size of 1154.
We also created the case-cohort weight accordingly.
```{R, cache = T}
set.seed(1)
subinx <- sample(1:nrow(nwtco), 668, replace = FALSE)
nwtco$subcohort <- 0
nwtco$subcohort[subinx] <- 1
pn <- mean(nwtco$subcohort)
nwtco$hi <- nwtco$rel + ( 1 - nwtco$rel) * nwtco$subcohort / pn

The case-cohort analysis is provided below. Point estimates are close to these in case-cohort analysis, with their standard errors taken into consideration. All the standard errors decrease compared to the case-cohort analyses, which is expected as full information became available for all covariates. {R, cache = T} system.time(fit.case <- aftsrr(Surv(edrel, rel) ~ histol + age + study, weights = hi, data = nwtco, subset = subcohort > 0, se = "ISMB")) summary(fit.case)

Reference



stc04003/aftgee documentation built on Oct. 12, 2024, 4:36 a.m.