knitr::knit_hooks$set( plot_2x1 = function(before, options, envir){ if(before){ options$fig.width = options$fig.height * 2 par(mfcol = c(2, 1), mar = c(5, 4, 2, 2) + 0.1) } }, plot_2x3 = function(before, options, envir){ if(before){ options$fig.width = options$fig.height * 2 par(mfcol = c(2, 3)) } }) knitr::opts_chunk$set( echo = TRUE, fig.height = 5, dpi = 144, dev = "png", cache = TRUE, cache.path = "cache/bootstrap/", fig.path = "figure/bootstrap/")
library(survival) library(dynamichazard) library(boot) library(timereg)
This vignette will show how to use bootstrap a ddhazard
object. It is recommended to read or skim vignette("ddhazard", "dynamichazard")
first. You can get the version used to make this vignette by calling:
# install.packages("remotes") remotes::install_github("boennecd/dynamichazard")
You can also get the latest version on CRAN by calling:
install.packages("dynamichazard")
CRAN requires that options
is reset somewhere.
Thus, we get the old settings and reset them at the end.
old_options <- options() # set digits options(ddhazard_max_threads = max(parallel::detectCores() - 2, 1))
We will use the TRACE
data set from the timreg
package. See ?TRACE
for a description of the dataset and ?timereg::aalen
for a additive hazard models used with data set -- at least in version 1.9.1. Some of them are (kinda) like the models we fit next in the sequal. The TRACE
data set is used here to illustrate the bootstrap methods and not as example of how to analysis the data set. I have not looked at the details of the data set or the model fits. We fit the model as follows:
library(dynamichazard) data(TRACE, package = "timereg") dd_fit <- ddhazard( Surv(time, status == 9) ~ ddFixed_intercept() + ddFixed(age) + ddFixed(sex) + ddFixed(diabetes) + chf + vf, TRACE, max_T = 7, by = .25, model = "exponential", Q_0 = diag(10, 2), Q = diag(.1^2, 2), control = ddhazard_control(eps = .001, n_max = 25))
We use the exponential arrival times models with the extended Kalman filter (the default) estimation method. A plot of the estimates is given below. The dashed lines are 95% point-wise confidence intervals using the variances estimates from the Extended Kalman filter with smoothing
plot(dd_fit) summary(dd_fit)
We can bootstrap the estimates in the model by using the ddhazard_boot
function as done below:
set.seed(7451) R <- 999 # number of bootstrap samples boot_out <- ddhazard_boot(dd_fit, R = R)
The list has the same structure and class as the list returned by boot::boot
a few other elements:
class(boot_out) str(boot_out)
Above, we bootstrap the model by sampling the individuals. I.e. individuals will have weights of $0,1,2,\dots$ in the estimation. We can plot 95% confidence bounds from the bootstrap coefficients with the percentile bootstrap method as follows:
plot(dd_fit, ddhazard_boot = boot_out, level = 0.95)
The completely black line is the original estimates, the dashed lines are 5% and 95% quantiles of the bootstrap coefficient taken at each point and the transparent black lines each represent a bootstrap estimate. Linear interpolation on the normal quantile scale is used if we do not have a quantile that match exactly.
Recall that the fixed effects are estimated to be:
dd_fit$fixed_effects
We can get confidence bounds for these with the boot.ci
function from the boot
library as shown below:
library(boot) # print confidence intervals for colnames(boot_out$t)[ncol(boot_out$t)] # this variable boot.ci( boot_out, index = ncol(boot_out$t), type = c("norm", "basic", "perc"))
You can provide a strata variable to perform stratified sampling. This is done by setting the strata
argument in the call to ddhazard_boot
. Notice that this has to be on an individual level (one indicator variable per individual) not observation level (not one indicator variable per row in the data set). You can use the unique_id
argument to match the individual entries with the entries in strata
. Though, this is not needed for this data set as we do not have time-varying covariates. As an example, we stratify by the chf
value below:
# all observations are unique. I.e. all other individuals have one record. # Otherwise we had to make a strata with an entry for each individual -- not # each record in the data.frame used in the estimation sum(duplicated(TRACE$id)) # use strafied bootstrap set.seed(101) boot_out_with_strata <- ddhazard_boot( dd_fit, R = R, unique_id = TRACE$id, strata = TRACE$chf) plot(dd_fit, ddhazard_boot = boot_out_with_strata, level = 0.95)
We may also want to get simultaneous confidence intervals. An easy way to get such confidence intervals is with the envelope
function in the boot
library. For instance, we can get simultaneous confidence intervals for the vf
coefficient as follows:
# find the indices that correspondents to the coefficient we want is_vf <- grep("^vf:", colnames(boot_out$t)) # use the envelope envelopes <- envelope(boot_out, level = 0.95 ,index = is_vf) # plot curves plot(dd_fit, ylim = c(-1.5, 4), cov_index = grep("^vf$", colnames(dd_fit$state_vecs))) lines(dd_fit$times, envelopes$point[1, ], col = "blue") lines(dd_fit$times, envelopes$point[2, ], col = "blue") lines(dd_fit$times, envelopes$overall[1, ], col = "red") lines(dd_fit$times, envelopes$overall[2, ], col = "red")
The dashed black lines are from the smoothed covariance matrix. The blue lines are pointwise confidence intervals using the percentile method from the envelope
function. The red line is the simultaneous confidence bounds using the envelope method in equation (4.17) of [@davison97]. The latter curves are formed by creating an envelope over each of the pointwise confidence intervals and hence the name.
rm(list = ls()) gc()
In this section, we will test the coverage of the pointwise confidence intervals using the smoothed covariance matrix and the bootstrap percentile method. We will test these in a simulation study where:
The simulation is to mimic a situation where we assume that the coefficients are not random (as the model implies) but we do not know the shape of the coefficient curves across time. We setup the parameters for the experiment below and plot the coefficients without noise:
if(requireNamespace("dichromat", quietly = TRUE, warn.conflicts = FALSE) && requireNamespace("colorspace", quietly = TRUE, warn.conflicts = FALSE) ){ cols <- dichromat::colorschemes$Categorical.12[(1:5)*2] cols1 <- colorspace::readhex( file = textConnection(paste(cols, collapse = "\n")), class = "RGB") #transform to hue/lightness/saturation colorspace cols1 <- as(cols1, "HLS") #multiplicative decrease of lightness cols1@coords[, "L"] <- cols1@coords[, "L"] * 0.5 #going via rgb seems to work better cols1 <- as(cols1, "RGB") cols <- colorspace::hex(cols1) } else cols <- c("#BC5C00", "#BEBE00", "#23BC00", "#0082BC", "#3500C1")
tmax <- 22 # Number of periods n_start_grps <- 3 # Number of "start group" - see text # Number of multiple of tmax - 1 in each mlt <- 30 # start group n <- (tmax - 1) * mlt * n_start_grps # Total number of individuals n # Define the noise free coefficients beta <- cbind( x1 = rep(-2, (tmax - 1) + 1), x2 = (0:(tmax - 1) - (tmax - 1)/2) / ((tmax - 1) / 2), x3 = ((tmax - 1):0 - (tmax - 1)/2) / ((tmax - 1) / 2), x4 = - sin(pi / 7 * (0:(tmax - 1))), x5 = sin(pi / 7 * (0:(tmax - 1)))) # Plot noise free coefficients cols <- c("#BC5C00", "#BEBE00", "#23BC00", "#0082BC", "#3500C1") matplot(beta, type = "l", lty = 1, ylab = "coefficient without noise", col = cols)
There will be a total of n =
r n
individuals in groups of three. We start observing each group at time 0
, 7
and 14
. I.e. we have random random delayed entry. We do so to have a "stable" number of individual through the experiment. The experiment ends after tmax =
r tmax
.
base_mat <- do.call( rbind, sapply(1:n, function(i){ n_vals <- ceiling(i / ((tmax - 1) * mlt)) * ((tmax - 1) / n_start_grps) + 1 data.frame(id = rep(i, n_vals), tstart = tmax - n_vals:1) }, simplify = FALSE)) base_mat$tstop <- base_mat$tstart + 1 nrows <- nrow(base_mat) beta_sd <- .1 x_range <- 1 base_mat$x1 <- 1 for(s in c("x2", "x3", "x4", "x5", "eta", "dies")) base_mat[s] <- NA sim_func <- function(){ # Simulate covariates sim_mat <- base_mat for(i in 1:n) sim_mat[sim_mat$id == i, c("x2", "x3", "x4", "x5")] <- sapply(runif(4, min = -x_range, max = x_range), rep, times = sum(sim_mat$id == i)) # Add noise to coefficients beta_w_err <- beta + rnorm(length(beta), sd = beta_sd) # Compute linear predictors and simulate outcome for(i in 1:tmax){ is_in <- which(base_mat$tstop == i) sim_mat[is_in, "eta"] <- as.matrix(sim_mat[is_in, c("x1", "x2", "x3", "x4", "x5")]) %*% beta_w_err[i, ] } sim_mat$dies <- 1/(1 + exp(-sim_mat$eta)) > runif(nrows) # Remove rows after they have died has_died <- unlist(tapply( sim_mat$dies, sim_mat$id, function(x){ is_death <- which(x) if(length(is_death) == 0) rep(FALSE, length(x)) else 1:length(x) > is_death[1] })) sim_mat <- sim_mat[!has_died, ] # Find last time max_time <- tapply(sim_mat$tstop, sim_mat$id, max) # Remove redundant rows do_die <- tapply(sim_mat$dies, sim_mat$id, any) is_first <- unlist(tapply(sim_mat$tstart, sim_mat$id, function(x) x == min(x))) sim_mat <- sim_mat[is_first, ] sim_mat$tstop <- max_time sim_mat$dies <- do_die rownames(sim_mat) <- NULL list(sims = sim_mat, beta_w_err = beta_w_err) }
We add a bit of normally distributed noise to the coefficients with mean zero and standard deviation r beta_sd
. The individuals' covariates are simulated from the uniform distribution from the range [r -x_range
, r x_range
]. The function sim_func
is used to make the simulation. The definition of the function can be found in the markdown file for this vignette on the github site. We simulate a series below, illustrate the data matrix and plot the coefficients with noise added to them:
old_digits <- getOption("digits") options(digits = 2)
# Simulate set.seed(122044) sim_list <- sim_func() # Show data matrix head(sim_list$sims, 10) tail(sim_list$sims, 10) # Plot coefficients with noise matplot(sim_list$beta_w_err, type = "l", lty = 1, ylab = "coefficient with noise", col = cols)
options(digits = old_digits)
We are now able to estimate the model as follows:
# Estimate model fit_expression <- expression({ fit <- ddhazard( Surv(tstart, tstop, dies) ~ -1 + x1 + x2 + x3 + x4 + x5, data = sim_list$sims, id = sim_list$sims$id, max_T = tmax, by = 1, Q_0 = diag(1e4, 5), Q = diag(.1, 5), a_0 = rep(0, 5), control = ddhazard_control(eps = .001, n_max = 25)) }) eval(fit_expression) # Plot estimates with pointwise confidence bounds from smoothed covariance # matrix for(i in 1:5){ plot(fit, cov_index = i) points(sim_list$beta_w_err[, i], pch = 16, col = "red") }
The plots shows the estimated coefficient with 95% pointwise confidence intervals from the smoothed covariance matrix. The dots are the actual values (i.e. those with noise added to them). A bootstrap estimate of the confidence bounds is made below:
```{R, plot_2x3=TRUE}
boot_out <- ddhazard_boot(fit, R = 999)
for(i in 1:5){ plot(fit, cov_index = i, ddhazard_boot = boot_out) points(sim_list$beta_w_err[, i], pch = 16, col = "red") }
```r alpha <- .05 compute_coverage <- function(fit, boot_out, actual){ # Compute coverage of error bounds from smoothed covariance matrix bds <- qnorm(1 - alpha / 2) * sqrt(t(apply(fit$state_vars, 3, diag))[-1, ]) res_smooth <- colMeans( actual >= fit$state_vecs[-1, ] - bds & actual <= fit$state_vecs[-1, ] + bds) # Compute coverage from bootstrap enve <- sapply(1:ncol(boot_out$t), function(i){ bt_ci <- boot.ci(boot_out, index = i, type = c("perc"), conf = 1 - alpha) c(bt_ci$percent[, 4:5]) }) enve_lb <- enve[1, ] enve_ub <- enve[2, ] dim(enve_lb) <- dim(enve_ub) <- dim(fit$state_vecs) res_boot <- colMeans( actual >= enve_lb[-1, ] & actual <= enve_ub[-1, ]) list(smooth = res_smooth, boot = res_boot) }
We can now pose the question how the pointwise coverage is for each coefficient. For this reason, we have defined the function compute_coverage
which is not included but can be found in the markdown for this vignette on the github site:
compute_coverage(fit, boot_out, sim_list$beta_w_err)
compute_coverage
outputs a list of the true coverage of the r (1 - alpha)*100
% confidence intervals from the smoothed covariance matrix and the percentile method from the bootstrap. That is, the fractions of red dots from the previous plot that are within the 95% confidence interval. The two elements of the list is for the the percentile method from the bootstrap. These are respectively the smooth
and boot
elements of the list. We can now repeat the above M
times (defined below) as follows:
set.seed(520920) R <- 999 # Number of bootstrap estimates in each trials M <- 100 # Number of trials # Define matrix for output coverage_boot <- coverage_smooth <- matrix( NA_real_, nrow = M, ncol = ncol(fit$state_vecs)) # Sometimes estimations fails. We use this counter to keep track of the number # of times n_fails <- 0 LRs <- 1.1^(0:(-6)) # Learning rates to try in order to get a fit # We save this as an epxression as we will re-run it later boot_exp <- expression({ for(i in 1:M){ # Simulate data set sim_list <- sim_func() # Fit on whole data set did_succed <- F try({ eval(fit_expression) did_succed <- T }) if(!did_succed){ n_fails <- n_fails + 1 next } # Bootstrap fits boot_out <- ddhazard_boot(fit, strata = as.factor(sim_list$sims$tstart), do_stratify_with_event = FALSE, do_sample_weights = FALSE, R = R, LRs = LRs) # Compute coverage and add to output coverage <- compute_coverage(fit, boot_out, sim_list$beta_w_err) coverage_smooth[i, ] <- coverage$smooth coverage_boot[i, ] <- coverage$boot } }) eval(boot_exp) n_fails # number of failed estimations
The mean of the fraction of the overages for the two methods are printed below. That is, the mean of the fraction for each coefficient from each run that did not fail:
colMeans(coverage_smooth, na.rm = TRUE) colMeans(coverage_boot, na.rm = TRUE)
Finally, we can make a boxplot of the fraction of coverage in each trail as follows:
boxplot(coverage_smooth, ylim = c(.6, 1), main = "Smoothed covariance") abline(h = .95, lty = 1) boxplot(coverage_boot, ylim = c(.6, 1), main = "Bootstrap") abline(h = .95, lty = 1)
We do alter the learning rate in the previous simulation in order to get a fit when we bootstrap. An alternative could be not to allow for this as done below where failed fits are excluded:
n_fails <- 0 LRs <- 1 # Changed to one value only eval(boot_exp) n_fails # number of failed estimations
The means and box plot are given below:
colMeans(coverage_smooth, na.rm = TRUE) colMeans(coverage_boot, na.rm = TRUE)
boxplot(coverage_smooth, ylim = c(.6, 1), main = "Smoothed covariance") abline(h = .95, lty = 1) boxplot(coverage_boot, ylim = c(.6, 1), main = "Bootstrap") abline(h = .95, lty = 1)
We reset options
here as per CRAN policy.
options(old_options)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.