Home

/

GitHub

/

In stc04003/aftgee: Accelerated Failure Time Model with Generalized Estimating Equations

\newcommand{\dif}{\mathrm{d}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\X}{\mathbf{X}} \newcommand{\Y}{\mathbf{Y}} \newcommand{\C}{\mathbf{C}} \newcommand{\T}{\mathbf{T}} \newcommand{\bfDelta}{\mathbf{\Delta}}

knitr::opts_chunk$set(prompt = TRUE, comment = "")

In this vignette, we demonstrate the usage of the aftgee() function from the aftgee package[@R:aftgee; @chiou2014fitting]. The vignette is based on aftgee version r packageVersion("aftgee"). {R aftsrr-args} library(aftgee) packageVersion("aftgee")

AFT model

The univariate semiparametric semiparametric accelerated failure time (AFT) model is specified as $$\log(T_i) = X_i^\top\beta + \epsilon_i, i = 1, \ldots, n, $$ where $T_i$ is the failure time, $X_i$ is the $p\times1$ covariate vector, $\beta$ is the $p\times1$ regression coefficient, and $\epsilon_i$'s are independent and identically distributed random variables with an unspecified distribution. In the presence of right censoring, the observed data are independent copies of $(Y_i, \Delta_i, X_i), i = 1, \ldots, n$, where $Y_i = \min(T_i, C_i), \Delta_i = I(T_i < C_i)$, and $I(\cdot)$ is the indicator function.

Least-squares estimator

With survival data from right censoring, @buckley1979linear proposed a least-squares approach that replaces each response $T_i$ with the conditional expectation $Y_i(\beta) = E_\beta(T_i|Y_i, \Delta_i, X_i)$. The theoretical properties of the Buckley and James estimator have been studied by @ritov1990estimation and @lai1991large. However, the Buckley and James estimator is rarely used in practice due to numerical challenges. @jin2006least proposed an iterative procedure to generalize the Buckley and James estimator, making the latter more practical. Recently, @chiou2014marginal embedded the generalized estimating equation (GEE) procedure to generalized the iterative procedure to accommodates within-cluster dependence in cluster survival data. The GEE embedded approach accounts for multivariate dependence through working correlation structures to improve efficiency. To illustrate the idea, we expand the notations to multivariate AFT model, $$\log(T_{ik}) = X_{ik}^\top\beta + \epsilon_{ik}, i = 1, \ldots, n, k = 1, \ldots, K_i, $$ where $K_i$ is the $i$ cluster size, $\beta$ is an unknown $p\times1$ vector of regression parameters and the error terms, $\epsilon_{i}= {\epsilon_{i1}, \ldots, \epsilon_{iK_i}}$ are independent and identically distributed random variables with an unspecified distribution throughout clusters. In the presence of censoring, the observed data consists of copies of ${Y_{ik}, \Delta_{ik}, X_{ik}}$ for $i = 1, \ldots, n$, and $k = 1, \ldots, K_i$, where $Y_{ik} = \min(T_{ik}, C_{ik})$ and $\Delta_{ik} = I(T_{ik} < C_{ik})$.

Define $\Omega_{i}^{-1}\big(\alpha(b)\big)$ to be an $K_i\times K_i$ nonsingular working weight matrix that may involve additional working parameters $\alpha$ that depends on an initial value $b$ of $\beta_0$. For $i = 1, \ldots, n$ and $k = 1, \ldots, K_i$, let $\widehat{\Y}i(b)$, $\Y_i$, $\T_i$, $\C_i$ and $\bfDelta_i$ be $K_i\times 1$ vector formed by stacking $\widehat{Y}{ik}(b)$, $Y_{ik}$, $C_{ik}$ and $\Delta_{ik}$, where \begin{equation} \widehat{Y}{ik}(b) = \Delta{ik}Y_{ik} + (1 - \Delta_{ik}) \left[\frac{\int_{e_{ik}(b)}^\infty u \dif \widehat{F}{e{ik}(\beta)}}{1 - \widehat{F}{e{ik}(\beta)}{e_{ik}(b)}} + X_{ik}^\top b \right]. \end{equation} A generalization of iterative least-square estimating equation is \begin{equation} \label{equ:gee} U_{n, GEE}(\beta, b, \alpha) = \sum_{i=1}^n(\X_i - \bar{\X})^\top\Omega_i^{-1}\left{\alpha(b)\right}{\widehat{\Y}i(b) - \X_i\beta} = 0, \end{equation} where $\bar{\X}_i = \sum{i=1}^n X_i / n$. Given $\alpha$ and $b$, the solution to $U_{n, GEE}(\beta, b, \alpha) = 0$ has a closed form \begin{equation} L_n(b, \alpha) = \left [ \sum_{i=1}^{n}(\X_{i}-\bar{\X})^{\top} \Omega_{i}^{-1}\big(\alpha(b)\big) (\X_{i}-\bar{\X}) \right ]^{-1} \left [ \sum_{i=1}^{n}(\X_{i}-\bar{X})^\top \Omega_{i}^{-1}\big(\alpha(b)\big) \left(\widehat{\Y}_{i}(b)-\bar{\Y}(b)\right)\right ], \end{equation} where $\bar{\Y}(b) = \sum_i^n\widehat{\Y}_i(b)/n$.

The GEE estimator, denoted by $\widehat{\beta}_{n, GEE}$, can be obtained from an iterative procedure:

Obtain an initial estimate $\widehat{\beta}^{(0)}_{n, GEE} = b_n$ of $\beta$ and initialize with $m = 1$.
Obtain an estimate $\widehat\alpha_n$ of $\alpha$ given $\widehat\beta^{(m-1)}{n, GEE}$, $\widehat\alpha_n(\widehat{\beta}{n, GEE}^{(m-1)})$.
Update with $\widehat\beta^{(m)}{n, GEE} = L_n\big(\widehat\beta^{(m-1)}{n, GEE}, \widehat\alpha_n(\widehat\beta^{(m-1)}_{n, GEE})\big)$.
Increase $m$ by one and repeat 2 and 3 until convergence.

The iteration proceeds with the aid of function \code{geese} in \pkg{geepack} \citep{Rpkg:geepack, Hale:Hojs:Yan:gee:2006}. The estimator reduces to the least squares estimator of \citet{Jin:Lin:Wei:Ying:rank:2006} when the working weight matrix $\Omega_i$'s are the identity matrix. We refer to \citet{Chio:Kang:Kim:Yan:marg:2014} for more details. The working parameter estimate $\widehat{\alpha}_n$ does not affect the consistency of the GEE estimator, but may affect its efficiency. Higher efficiency can be achieved if $\Omega_i$ is closer to the covariance matrix of $\widehat{\Y}_i(b)$ and even an imperfect working weight still improves the efficiency \citep{Chio:Kang:Kim:Yan:marg:2014}. The variance of the estimator can again be estimated by resampling procedures.

The `aftgee()` interface

The aftgee() is used to fit a semiparametric AFT model with rank-based approaches. The function interface of aftgee() is

args(aftgee)

The arguments are:

formula: a formula expression, of the form response ~ predictors. The response is a Surv object object with right censoring.
data: an optional data.frame in which to interpret the variables occurring in the formula.
subset: an optional vector specifying a subset of observations to be used in the fitting process.
id: an optional vector used to identify the clusters. If missing, each individual row of data is presumed to represent a distinct subject.
contrasts: an optional list.
weights: an optional vector of observation weights.(This is still under development).
margin: an optional vector to specify the margins; default at 1.
corstr: a character string specifying the correlation structure. The following are permitted: "independence", "exchangeable", "ar", "unstructured", "fixed".
binit: an optional vector can be either a numeric vector or a character string specifying the initial slope estimator. When binit is a vector, its length should be the same as the dimension of covariates. When binit is a character string, it should be either lm for simple linear regression, or srrgehan for smoothed Gehan weight estimator. The default value is "srrgehan".
B: a numeric value specifies the resampling number. When B = 0, only the beta estimate will be displayed.
control: controls maxiter and tolerance.

In the uncensored case, aftgee() with independent working correlation structure will return an ordinary least squares estimate. In the multivariate case, efficiency can be improved in aftgee when the working correlation structure is close to the true correlation even in the absent of censoring.

An simulated example

The following function generates simulated data following an AFT model $$\log(T) = 2 + X_1 + X_2 + \epsilon, $$ where $X_1$ is a Bernoulli random variable with $p = 0.5$, $X_2$ is a standard normal random variable, and $\epsilon$ is an independent standard normal random variable.

datgen <- function(n = 100, tau = 0.3, dim = 2) {
  x1 <- rbinom(dim * n, 1, 0.5)
  x2 <- rnorm(dim * n)
  e <- c(t(exp(MASS::mvrnorm(n = n, mu = rep(0, dim), Sigma = tau + (1 - tau) * diag(dim)))))
  tt <- exp(2 + x1 + x2 + e)
  cen <- runif(n, 0, 100)
  data.frame(Time = pmin(tt, cen), status = 1 * (tt < cen),
             x1 = x1, x2 = x2, id = rep(1:n, each = dim))
}

In datgen(), the error term $\epsilon_i$ follows a multivariate normal distribution with mean 0 and variance covariance matrix $(1 - \tau) I_{K_i} + \tau J_{K_i}$, where $I_d$ is the $d$-dimensional identity matrix and $J_d$ is a $d\times d$ matrix of ones. The arguments dim controls the cluster size, $K_i$. The observed data, $(Y, \Delta, X)$, correspond to Time, status, x1, and x2. The following gives an example a simulated data from an AFT model generated by datgen().

set.seed(1); head(dat <- datgen(n = 100, dim = 2))

The following fits the GEE embedded with independence and exchangeable correlation structures. {R, cache = T} system.time(fit.ind <- aftgee(Surv(Time, status) ~ x1 + x2, data = dat, id = id, corstr = "ind")) summary(fit.ind) system.time(fit.ex <- update(fit.ind, corstr = "ex")) summary(fit.ex) The standard errors decrease slightly when the exchangeable correlation structure is specified over the independent correlation structure.

Reference

stc04003/aftgee documentation built on Oct. 12, 2024, 4:36 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stc04003/aftgee
Accelerated Failure Time Model with Generalized Estimating Equations

In stc04003/aftgee: Accelerated Failure Time Model with Generalized Estimating Equations

AFT model

Least-squares estimator

The `aftgee()` interface

An simulated example

Reference

R Package Documentation

Browse R Packages

We want your feedback!

stc04003/aftgee Accelerated Failure Time Model with Generalized Estimating Equations

In stc04003/aftgee: Accelerated Failure Time Model with Generalized Estimating Equations

AFT model

Least-squares estimator

The aftgee() interface

An simulated example

Reference

R Package Documentation

Browse R Packages

We want your feedback!

stc04003/aftgee
Accelerated Failure Time Model with Generalized Estimating Equations

The `aftgee()` interface