knitr::opts_chunk$set(cache = FALSE, fig.align = "center") suppressPackageStartupMessages({ library(dplyr) library(SurvUtils) })
The perturbation bootstrap (Zhezhen 2001, Das 2019) is a method of approximating the distribution of $\hat{\theta} - \theta$. The general approach is first to obtain an influence function representation: $$ \sqrt{n}\big(\hat{\theta} - \theta\big) = \frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi_{i} + o_{p}(1) $$
Next, the influence contributions of all subjects ${\hat{\psi}{i}}{i=1}^{n}$ are estimated. For $b$ in $1, \dots, B$ iterations, mean 1, variance 1 weights ${\omega_{i}^{(b)}}{i=1}^{n}$ are sampled from a known distribution, independently of the observed data. These weights are used to perturb the sum: $$ (\theta{(b)}^{*} - \hat{\theta}) = \frac{1}{n}\sum_{i=1}^{n}\hat{\psi}{i}\omega{i}^{(b)} $$
Upon evaluating the right-hand side, each set of weights provides a realization of $(\theta_{(b)}^{} - \hat{\theta})$, the deviation of a parameter estimate $\theta_{(b)}^{}$ based on the perturbed data around the observed value $\hat{\theta}$. It has been shown that the distribution of $(\theta_{(b)}^{} - \hat{\theta})$ can be used to approximate that of $(\hat{\theta} - \theta)$ (Zhezhen 2001, Das 2019). For example, the standard deviation of the $B$ realizations ${(\theta_{(b)}^{} - \hat{\theta})}_{b=1}^{B}$ is an estimate for the standard error of $\hat{\theta}$.
Consider survival data of the form $\big{(U_{i}, \delta_{i})\big}{i=1}^{n}$, where $U{i} = \min(T_{i}, C_{i})$ is the minimum of the event time $T_{i}$ and the censoring time $C_{i}$, and $\delta_{i} = \mathbb{I}(T_{i} \leq C_{i})$ is the status indicator. The Kaplan-Meier estimate has an influence function representation of the form (Andersen 1993): $$ \begin{gathered} \sqrt{n}\big{\hat{S}(t) - S(t)\big} = \frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi_{i}(t) + o_{p}(1), \ \psi_{i}(t) = -S(t) \int_{0}^{t}\frac{dM_{i}(u)}{n^{-1}\sum_{i=1}^{n}Y_{i}(u)}. \end{gathered} $$
Here $Y_{i} = \mathbb{I}(U_{i} \geq t)$ is an at risk indicator, $dM_{i}(t) = dN_{i}(t) - Y_{i}(t)dA(t)$ is the increment in the counting process martingale, $N_{i}(t) = \mathbb{I}(U_{i} \leq t, \delta_{i} = 1)$ is the event counting process, and $A(t) = \int_{0}^{t}\alpha(u)du$ is the cumulative hazard.
Data subject to censoring are generated according to an exponential distribution with rate $\lambda = 1$. Consider estimation of the survival probability at time $\tau = 1$. The exact calculation is $S(1) = \exp(-1) = 0.368$.
set.seed(101) # Generate data. n <- 1e3 data <- SurvUtils::GenData(n = n) tau <- 1.0 # Estimated survival probability. prob <- SurvUtils::OneSampleRates(data, tau = tau) %>% dplyr::mutate_if(is.numeric, function(x) {round(x, digits = 3)}) show(prob)
The influence contributions at time $t$ can be estimated as:
$$
\begin{gathered}
\hat{\psi}{i}(t) = -\hat{S}(t) \sum{u \leq t}\frac{dN_{i}(u) - Y_{i}(u)d\hat{A}(u)}{n^{-1}\sum_{j=1}^{n}Y_{j}(u)}, \
d\hat{A}(u) = \frac{\sum_{j=1}^{n}dN_{j}(u)}{\sum_{j=1}^{n}Y_{j}(u)}
\end{gathered}
$$
where the sum is taken over unique event times $u$ up to time $t$, and $d\hat{A}(u)$ is the estimated increment in the cumulative hazard at time $u$. The influence function contributions can be estimated using KMInfluence
:
# Estimate influence contributions. psi <- SurvUtils::KMInfluence(data = data, tau = tau) head(round(psi, digits = 3))
The mean of the influence function contributions will approximate zero:
head(round(mean(psi), digits = 10))
Writing: $$ \hat{S}(t) - S(t) = \frac{1}{n}\sum_{i=1}^{n}\psi_{i}(t) + o_{p}(n^{-1/2}) $$
the variance of $\hat{S}(t) - S(t)$ is: $$ \mathbb{V}\big{\hat{S}(t) - S(t)\big} = \frac{1}{n^2}\sum_{i=1}^{n}\mathbb{V}(\psi_{i}) + o_{p}(n^{-1}) $$ As the influence function contributions have mean zero $\mathbb{E}(\psi_{i}) = 0$, their variance $\mathbb{V}(\psi_{i}) = \mathbb{E}(\psi_{i}^{2})$. The sampling variance of $\hat{S}(t)$ can be approximated as: $$ \hat{\mathbb{V}}\big{\hat{S}(t) - S(t)\big} = \frac{1}{n} \cdot \frac{1}{n}\sum_{i=1}^{n}\psi_{i}^{2}. $$
\newpage Empirically, the estimated standard error of $\hat{S}(t)$ based on ${\hat{\psi}{i}}{i=1}^{n}$ is:
# Estimate sampling variance and standard error. sampling_var <- (1 / n) * mean(psi^2) se <- sqrt(sampling_var) show(round(se, digits = 4))
For comparison, the standard error provided by the survival
package is:
# Reference standard error. km_fit <- survival::survfit( survival::Surv(time, status) ~ 1, data = data) km_fit <- summary(km_fit) closest_time <- max(km_fit$time[km_fit$time <= tau]) ref_se <- km_fit$std.err[km_fit$time == closest_time] show(round(ref_se, digits = 4))
When estimating standard errors with survfit
, note that those provided by default (i.e. before applying summary
) are not properly scaled, see the discussion by Magirr 2022.
To estimate the standard error by perturbation bootstrap, the summand is perturbed with mean 1, variance 1 weights ${\omega_{i}^{(b)}}{i=1}^{n}$, generated from a known distribution, such as $\Gamma(1, 1)$, independently of the observed data: $$ (\theta{(b)}^{*} - \hat{\theta}) = \frac{1}{n}\sum_{i=1}^{n}\hat{\psi}{i}\omega{i}^{(b)} $$
Upon taking the weighted sum, each set of weights yields a realization of $(\theta_{(b)}^{} - \hat{\theta})$. From the collection of ${(\theta_{(b)}^{} - \hat{\theta})}_{b=1}^{B}$, properties of $(\hat{\theta} - \theta)$ may be approximated, notably the standard error:
# Perturbation bootstrap. bootstraps <- 2e3 perturb <- rep(0, bootstraps) for (b in 1:bootstraps) { draw <- rgamma(n = n, shape = 1) perturb[b] <- mean(psi * draw) } perturb_se <- sd(perturb) show(round(perturb_se, digits = 4))
The case of the restricted mean survival time (RMST) is similar. Based on the simulated exponential $\lambda = 1$, the analytical RMST at time $\tau = 1$ is $\int_{0}^{1}e^{-t}dt = 1 - e^{-1} = 0.632$. The estimated RMST is:
# Estimated RMST. tau <- 1 rmst <- SurvUtils::OneSampleRMST(data, tau = tau) show(round(rmst, digits = 3))
The influence function of the RMST at time $t$ is (Andersen 1993): $$ \begin{gathered} \sqrt{n}\big{\hat{R}(t) - R(t)\big} = \frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi_{i}(t) + o_{p}(1), \ \psi_{i}(t) = -\int_{0}^{t}\frac{\mu_{t}(u)dM_{i}(u)}{n^{-1}\sum_{i=1}^{n}Y_{i}(u)}, \ \mu_{t}(u) = \int_{u}^{t}S(t)dt. \end{gathered} $$
The influence function contributions may be estimated using RMSTInfluence
:
# Estimate influence contributions. psi <- SurvUtils::RMSTInfluence(data = data, tau = tau) head(round(psi, digits = 3))
The standard error estimated directly from the influence function is:
# Estimate sampling variance and standard error. sampling_var <- (1 / n) * mean(psi^2) se <- sqrt(sampling_var) show(round(se, digits = 4))
The standard error estimated by perturbation bootstrap is:
# Perturbation bootstrap. bootstraps <- 2e3 perturb <- rep(0, bootstraps) for (b in 1:bootstraps) { draw <- rgamma(n = n, shape = 1) perturb[b] <- mean(psi * draw) } perturb_se <- sd(perturb) show(round(perturb_se, digits = 4))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.