\newcommand{\dif}{\mathrm{d}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\X}{\mathbf{X}} \newcommand{\Y}{\mathbf{Y}} \newcommand{\C}{\mathbf{C}} \newcommand{\T}{\mathbf{T}} \newcommand{\bfDelta}{\mathbf{\Delta}}
knitr::opts_chunk$set(prompt = TRUE, comment = "")
In this vignette, we demonstrate the usage of the aftgee()
function from the aftgee
package[@R:aftgee; @chiou2014fitting].
The vignette is based on aftgee
version r packageVersion("aftgee")
.
{R aftsrr-args}
library(aftgee)
packageVersion("aftgee")
The univariate semiparametric semiparametric accelerated failure time (AFT) model is specified as $$\log(T_i) = X_i^\top\beta + \epsilon_i, i = 1, \ldots, n, $$ where $T_i$ is the failure time, $X_i$ is the $p\times1$ covariate vector, $\beta$ is the $p\times1$ regression coefficient, and $\epsilon_i$'s are independent and identically distributed random variables with an unspecified distribution. In the presence of right censoring, the observed data are independent copies of $(Y_i, \Delta_i, X_i), i = 1, \ldots, n$, where $Y_i = \min(T_i, C_i), \Delta_i = I(T_i < C_i)$, and $I(\cdot)$ is the indicator function.
With survival data from right censoring, @buckley1979linear proposed a least-squares approach that replaces each response $T_i$ with the conditional expectation $Y_i(\beta) = E_\beta(T_i|Y_i, \Delta_i, X_i)$. The theoretical properties of the Buckley and James estimator have been studied by @ritov1990estimation and @lai1991large. However, the Buckley and James estimator is rarely used in practice due to numerical challenges. @jin2006least proposed an iterative procedure to generalize the Buckley and James estimator, making the latter more practical. Recently, @chiou2014marginal embedded the generalized estimating equation (GEE) procedure to generalized the iterative procedure to accommodates within-cluster dependence in cluster survival data. The GEE embedded approach accounts for multivariate dependence through working correlation structures to improve efficiency. To illustrate the idea, we expand the notations to multivariate AFT model, $$\log(T_{ik}) = X_{ik}^\top\beta + \epsilon_{ik}, i = 1, \ldots, n, k = 1, \ldots, K_i, $$ where $K_i$ is the $i$ cluster size, $\beta$ is an unknown $p\times1$ vector of regression parameters and the error terms, $\epsilon_{i}= {\epsilon_{i1}, \ldots, \epsilon_{iK_i}}$ are independent and identically distributed random variables with an unspecified distribution throughout clusters. In the presence of censoring, the observed data consists of copies of ${Y_{ik}, \Delta_{ik}, X_{ik}}$ for $i = 1, \ldots, n$, and $k = 1, \ldots, K_i$, where $Y_{ik} = \min(T_{ik}, C_{ik})$ and $\Delta_{ik} = I(T_{ik} < C_{ik})$.
Define $\Omega_{i}^{-1}\big(\alpha(b)\big)$ to be an $K_i\times K_i$ nonsingular working weight matrix that may involve additional working parameters $\alpha$ that depends on an initial value $b$ of $\beta_0$. For $i = 1, \ldots, n$ and $k = 1, \ldots, K_i$, let $\widehat{\Y}i(b)$, $\Y_i$, $\T_i$, $\C_i$ and $\bfDelta_i$ be $K_i\times 1$ vector formed by stacking $\widehat{Y}{ik}(b)$, $Y_{ik}$, $C_{ik}$ and $\Delta_{ik}$, where \begin{equation} \widehat{Y}{ik}(b) = \Delta{ik}Y_{ik} + (1 - \Delta_{ik}) \left[\frac{\int_{e_{ik}(b)}^\infty u \dif \widehat{F}{e{ik}(\beta)}}{1 - \widehat{F}{e{ik}(\beta)}{e_{ik}(b)}} + X_{ik}^\top b \right]. \end{equation} A generalization of iterative least-square estimating equation is \begin{equation} \label{equ:gee} U_{n, GEE}(\beta, b, \alpha) = \sum_{i=1}^n(\X_i - \bar{\X})^\top\Omega_i^{-1}\left{\alpha(b)\right}{\widehat{\Y}i(b) - \X_i\beta} = 0, \end{equation} where $\bar{\X}_i = \sum{i=1}^n X_i / n$. Given $\alpha$ and $b$, the solution to $U_{n, GEE}(\beta, b, \alpha) = 0$ has a closed form \begin{equation} L_n(b, \alpha) = \left [ \sum_{i=1}^{n}(\X_{i}-\bar{\X})^{\top} \Omega_{i}^{-1}\big(\alpha(b)\big) (\X_{i}-\bar{\X}) \right ]^{-1} \left [ \sum_{i=1}^{n}(\X_{i}-\bar{X})^\top \Omega_{i}^{-1}\big(\alpha(b)\big) \left(\widehat{\Y}_{i}(b)-\bar{\Y}(b)\right)\right ], \end{equation} where $\bar{\Y}(b) = \sum_i^n\widehat{\Y}_i(b)/n$.
The GEE estimator, denoted by $\widehat{\beta}_{n, GEE}$, can be obtained from an iterative procedure:
The iteration proceeds with the aid of function \code{geese} in \pkg{geepack} \citep{Rpkg:geepack, Hale:Hojs:Yan:gee:2006}. The estimator reduces to the least squares estimator of \citet{Jin:Lin:Wei:Ying:rank:2006} when the working weight matrix $\Omega_i$'s are the identity matrix. We refer to \citet{Chio:Kang:Kim:Yan:marg:2014} for more details. The working parameter estimate $\widehat{\alpha}_n$ does not affect the consistency of the GEE estimator, but may affect its efficiency. Higher efficiency can be achieved if $\Omega_i$ is closer to the covariance matrix of $\widehat{\Y}_i(b)$ and even an imperfect working weight still improves the efficiency \citep{Chio:Kang:Kim:Yan:marg:2014}. The variance of the estimator can again be estimated by resampling procedures.
aftgee()
interfaceThe aftgee()
is used to fit a semiparametric AFT model with rank-based approaches.
The function interface of aftgee()
is
args(aftgee)
The arguments are:
formula
: a formula expression, of the form response ~ predictors
. The response
is a Surv
object object with right censoring. data
: an optional data.frame
in which to interpret the variables occurring in the formula
.subset
: an optional vector specifying a subset of observations to be used in the fitting process.id
: an optional vector used to identify the clusters. If missing, each individual row of data
is presumed to represent a distinct subject. contrasts
: an optional list.weights
: an optional vector of observation weights.(This is still under development).margin
: an optional vector to specify the margins; default at 1. corstr
: a character string specifying the correlation structure. The following are permitted:
"independence"
, "exchangeable"
, "ar"
, "unstructured"
, "fixed"
.binit
: an optional vector can be either a numeric vector or a character string specifying the
initial slope estimator.
When binit
is a vector, its length should be the same as the dimension of covariates.
When binit
is a character string, it should be either lm
for simple linear regression,
or srrgehan
for smoothed Gehan weight estimator. The default value is "srrgehan".B
: a numeric value specifies the resampling number. When B = 0
, only the beta estimate will be displayed.control
: controls maxiter and tolerance.In the uncensored case, aftgee()
with independent working correlation structure will return
an ordinary least squares estimate. In the multivariate case, efficiency can be improved in
aftgee when the working correlation structure is close to the true correlation even in the
absent of censoring.
The following function generates simulated data following an AFT model $$\log(T) = 2 + X_1 + X_2 + \epsilon, $$ where $X_1$ is a Bernoulli random variable with $p = 0.5$, $X_2$ is a standard normal random variable, and $\epsilon$ is an independent standard normal random variable.
datgen <- function(n = 100, tau = 0.3, dim = 2) { x1 <- rbinom(dim * n, 1, 0.5) x2 <- rnorm(dim * n) e <- c(t(exp(MASS::mvrnorm(n = n, mu = rep(0, dim), Sigma = tau + (1 - tau) * diag(dim))))) tt <- exp(2 + x1 + x2 + e) cen <- runif(n, 0, 100) data.frame(Time = pmin(tt, cen), status = 1 * (tt < cen), x1 = x1, x2 = x2, id = rep(1:n, each = dim)) }
In datgen()
, the error term $\epsilon_i$ follows a multivariate normal distribution with mean 0 and
variance covariance matrix $(1 - \tau) I_{K_i} + \tau J_{K_i}$,
where $I_d$ is the $d$-dimensional identity matrix and $J_d$ is a $d\times d$ matrix of ones.
The arguments dim
controls the cluster size, $K_i$.
The observed data, $(Y, \Delta, X)$, correspond to Time
, status
, x1
, and x2
.
The following gives an example a simulated data from an AFT model generated by datgen()
.
set.seed(1); head(dat <- datgen(n = 100, dim = 2))
The following fits the GEE embedded with independence and exchangeable correlation structures.
{R, cache = T}
system.time(fit.ind <- aftgee(Surv(Time, status) ~ x1 + x2, data = dat, id = id, corstr = "ind"))
summary(fit.ind)
system.time(fit.ex <- update(fit.ind, corstr = "ex"))
summary(fit.ex)
The standard errors decrease slightly when the exchangeable correlation structure is specified
over the independent correlation structure.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.