knitr::knit_hooks$set(
  plot_2x1 = function(before, options, envir){
    if(before){
      options$fig.width = options$fig.height * 2
      par(mfcol = c(2, 1), mar = c(5, 4, 2, 2) + 0.1)
    }
  },
  plot_2x3 = function(before, options, envir){
    if(before){
      options$fig.width = options$fig.height * 2
      par(mfcol = c(2, 3))
      }
    })

knitr::opts_chunk$set(
  echo = TRUE, fig.height = 5, dpi = 144, 
  dev = "png",
  cache = TRUE, 
  cache.path = "cache/bootstrap/",
  fig.path = "figure/bootstrap/")
library(survival)
library(dynamichazard)
library(boot)
library(timereg)

Introduction

This vignette will show how to use bootstrap a ddhazard object. It is recommended to read or skim vignette("ddhazard", "dynamichazard") first. You can get the version used to make this vignette by calling:

# install.packages("remotes")
remotes::install_github("boennecd/dynamichazard")

You can also get the latest version on CRAN by calling:

install.packages("dynamichazard")

Has to be done

CRAN requires that options is reset somewhere. Thus, we get the old settings and reset them at the end.

old_options <- options()

# set digits
options(ddhazard_max_threads = max(parallel::detectCores() - 2, 1))

TRACE

We will use the TRACE data set from the timreg package. See ?TRACE for a description of the dataset and ?timereg::aalen for a additive hazard models used with data set -- at least in version 1.9.1. Some of them are (kinda) like the models we fit next in the sequal. The TRACE data set is used here to illustrate the bootstrap methods and not as example of how to analysis the data set. I have not looked at the details of the data set or the model fits. We fit the model as follows:

library(dynamichazard)
data(TRACE, package = "timereg")
dd_fit <- ddhazard(
  Surv(time, status == 9) ~ ddFixed_intercept() + 
    ddFixed(age) + ddFixed(sex) + ddFixed(diabetes) + chf + vf,
  TRACE, max_T = 7, by = .25, model = "exponential", 
  Q_0 = diag(10, 2), Q = diag(.1^2, 2), 
  control = ddhazard_control(eps = .001, n_max = 25))

We use the exponential arrival times models with the extended Kalman filter (the default) estimation method. A plot of the estimates is given below. The dashed lines are 95% point-wise confidence intervals using the variances estimates from the Extended Kalman filter with smoothing

plot(dd_fit)
summary(dd_fit)

Sampling individuals

We can bootstrap the estimates in the model by using the ddhazard_boot function as done below:

set.seed(7451)
R <- 999 # number of bootstrap samples
boot_out <- ddhazard_boot(dd_fit, R = R) 

The list has the same structure and class as the list returned by boot::boot a few other elements:

class(boot_out)
str(boot_out)

Above, we bootstrap the model by sampling the individuals. I.e. individuals will have weights of $0,1,2,\dots$ in the estimation. We can plot 95% confidence bounds from the bootstrap coefficients with the percentile bootstrap method as follows:

plot(dd_fit, ddhazard_boot = boot_out, level = 0.95)

The completely black line is the original estimates, the dashed lines are 5% and 95% quantiles of the bootstrap coefficient taken at each point and the transparent black lines each represent a bootstrap estimate. Linear interpolation on the normal quantile scale is used if we do not have a quantile that match exactly.

Fixed effects

Recall that the fixed effects are estimated to be:

dd_fit$fixed_effects

We can get confidence bounds for these with the boot.ci function from the boot library as shown below:

library(boot)

# print confidence intervals for 
colnames(boot_out$t)[ncol(boot_out$t)] # this variable
boot.ci(
  boot_out, index = ncol(boot_out$t), type = c("norm", "basic", "perc"))

Strata

You can provide a strata variable to perform stratified sampling. This is done by setting the strata argument in the call to ddhazard_boot. Notice that this has to be on an individual level (one indicator variable per individual) not observation level (not one indicator variable per row in the data set). You can use the unique_id argument to match the individual entries with the entries in strata. Though, this is not needed for this data set as we do not have time-varying covariates. As an example, we stratify by the chf value below:

# all observations are unique. I.e. all other individuals have one record. 
# Otherwise we had to make a strata with an entry for each individual -- not 
# each record in the data.frame used in the estimation
sum(duplicated(TRACE$id))

# use strafied bootstrap
set.seed(101)
boot_out_with_strata <- ddhazard_boot(
  dd_fit, R = R, unique_id = TRACE$id, strata = TRACE$chf)

plot(dd_fit, ddhazard_boot = boot_out_with_strata, level = 0.95)

Boot envelope

We may also want to get simultaneous confidence intervals. An easy way to get such confidence intervals is with the envelope function in the boot library. For instance, we can get simultaneous confidence intervals for the vf coefficient as follows:

# find the indices that correspondents to the coefficient we want
is_vf <- grep("^vf:", colnames(boot_out$t))

# use the envelope
envelopes <- envelope(boot_out, level = 0.95 ,index = is_vf)

# plot curves
plot(dd_fit, ylim = c(-1.5, 4), 
     cov_index = grep("^vf$", colnames(dd_fit$state_vecs)))
lines(dd_fit$times, envelopes$point[1, ], col = "blue")
lines(dd_fit$times, envelopes$point[2, ], col = "blue")

lines(dd_fit$times, envelopes$overall[1, ], col = "red")
lines(dd_fit$times, envelopes$overall[2, ], col = "red")

The dashed black lines are from the smoothed covariance matrix. The blue lines are pointwise confidence intervals using the percentile method from the envelope function. The red line is the simultaneous confidence bounds using the envelope method in equation (4.17) of [@davison97]. The latter curves are formed by creating an envelope over each of the pointwise confidence intervals and hence the name.

How good is the coverage

rm(list = ls())
gc()

In this section, we will test the coverage of the pointwise confidence intervals using the smoothed covariance matrix and the bootstrap percentile method. We will test these in a simulation study where:

The simulation is to mimic a situation where we assume that the coefficients are not random (as the model implies) but we do not know the shape of the coefficient curves across time. We setup the parameters for the experiment below and plot the coefficients without noise:

if(requireNamespace("dichromat", quietly = TRUE, warn.conflicts = FALSE) && 
   requireNamespace("colorspace", quietly = TRUE, warn.conflicts = FALSE)
   ){
  cols <- dichromat::colorschemes$Categorical.12[(1:5)*2]

  cols1 <- colorspace::readhex(
    file = textConnection(paste(cols, collapse = "\n")),
    class = "RGB")

  #transform to hue/lightness/saturation colorspace
  cols1 <- as(cols1, "HLS")
  #multiplicative decrease of lightness
  cols1@coords[, "L"] <- cols1@coords[, "L"] * 0.5
  #going via rgb seems to work better  
  cols1 <- as(cols1, "RGB")
  cols <- colorspace::hex(cols1)
} else
  cols <- c("#BC5C00", "#BEBE00", "#23BC00", "#0082BC", "#3500C1")
tmax <- 22                           # Number of periods
n_start_grps <- 3                    # Number of "start group" - see text
                                     # Number of multiple of tmax - 1 in each 
mlt <- 30                            # start group
n <- (tmax - 1) * mlt * n_start_grps # Total number of individuals
n

# Define the noise free coefficients
beta <- cbind(
  x1 = rep(-2, (tmax - 1) + 1), 
  x2 = (0:(tmax - 1) - (tmax - 1)/2) / ((tmax - 1) / 2),
  x3 = ((tmax - 1):0 - (tmax - 1)/2) / ((tmax - 1) / 2),
  x4 = - sin(pi / 7 * (0:(tmax - 1))), 
  x5 = sin(pi / 7 * (0:(tmax - 1))))

# Plot noise free coefficients
cols <- c("#BC5C00", "#BEBE00", "#23BC00", "#0082BC", "#3500C1")
matplot(beta, type = "l", lty = 1, ylab = "coefficient without noise", 
        col = cols)

There will be a total of n = r n individuals in groups of three. We start observing each group at time 0, 7 and 14. I.e. we have random random delayed entry. We do so to have a "stable" number of individual through the experiment. The experiment ends after tmax = r tmax.

base_mat <- do.call(
  rbind, 
  sapply(1:n, function(i){
    n_vals <- ceiling(i / ((tmax - 1) * mlt)) * ((tmax - 1) / n_start_grps) + 1
    data.frame(id = rep(i, n_vals),
               tstart = tmax - n_vals:1)
  }, simplify = FALSE))
base_mat$tstop <- base_mat$tstart + 1
nrows <- nrow(base_mat)

beta_sd <- .1
x_range <- 1

base_mat$x1 <- 1
for(s in c("x2", "x3", "x4", "x5", "eta", "dies"))
  base_mat[s] <- NA

sim_func <- function(){
  # Simulate covariates
  sim_mat <- base_mat
  for(i in 1:n)
    sim_mat[sim_mat$id == i, c("x2", "x3", "x4", "x5")] <- 
      sapply(runif(4, min = -x_range, max = x_range), rep, times =  sum(sim_mat$id == i))

  # Add noise to coefficients
  beta_w_err <- beta + rnorm(length(beta), sd = beta_sd)

  # Compute linear predictors and simulate outcome
  for(i in 1:tmax){
    is_in <- which(base_mat$tstop == i)
    sim_mat[is_in, "eta"] <- 
      as.matrix(sim_mat[is_in, c("x1", "x2", "x3", "x4", "x5")]) %*% 
      beta_w_err[i, ]
  }
  sim_mat$dies <- 1/(1 + exp(-sim_mat$eta)) > runif(nrows)

  # Remove rows after they have died
  has_died <- unlist(tapply(
    sim_mat$dies, sim_mat$id, function(x){
      is_death <- which(x)
      if(length(is_death) == 0) rep(FALSE, length(x)) else 1:length(x) > is_death[1]
    }))
  sim_mat <- sim_mat[!has_died, ]

  # Find last time
  max_time <- tapply(sim_mat$tstop, sim_mat$id, max)

  # Remove redundant rows
  do_die <- tapply(sim_mat$dies, sim_mat$id, any)
  is_first <- unlist(tapply(sim_mat$tstart, sim_mat$id, function(x) x == min(x)))
  sim_mat <- sim_mat[is_first, ]
  sim_mat$tstop <- max_time
  sim_mat$dies <- do_die

  rownames(sim_mat) <- NULL

  list(sims = sim_mat, beta_w_err = beta_w_err)
}

We add a bit of normally distributed noise to the coefficients with mean zero and standard deviation r beta_sd. The individuals' covariates are simulated from the uniform distribution from the range [r -x_range, r x_range]. The function sim_func is used to make the simulation. The definition of the function can be found in the markdown file for this vignette on the github site. We simulate a series below, illustrate the data matrix and plot the coefficients with noise added to them:

old_digits <- getOption("digits")
options(digits = 2)
# Simulate
set.seed(122044)
sim_list <- sim_func()

# Show data matrix
head(sim_list$sims, 10) 
tail(sim_list$sims, 10)

# Plot coefficients with noise
matplot(sim_list$beta_w_err, type = "l", lty = 1, ylab = "coefficient with noise", 
        col = cols)
options(digits = old_digits)

We are now able to estimate the model as follows:

# Estimate model
fit_expression <- expression({
  fit <- ddhazard(
    Surv(tstart, tstop, dies) ~ -1 + x1 + x2 + x3 + x4 + x5,
    data = sim_list$sims, id = sim_list$sims$id, max_T = tmax, 
    by = 1, Q_0 = diag(1e4, 5), Q = diag(.1, 5),
    a_0 = rep(0, 5), control = ddhazard_control(eps = .001, n_max = 25))
})
eval(fit_expression)

# Plot estimates with pointwise confidence bounds from smoothed covariance
# matrix
for(i in 1:5){
  plot(fit, cov_index = i)
  points(sim_list$beta_w_err[, i], pch = 16, col = "red")
}

The plots shows the estimated coefficient with 95% pointwise confidence intervals from the smoothed covariance matrix. The dots are the actual values (i.e. those with noise added to them). A bootstrap estimate of the confidence bounds is made below:

```{R, plot_2x3=TRUE}

bootstrap with resampling individuals

boot_out <- ddhazard_boot(fit, R = 999)

Plot estimated confidence bounds

for(i in 1:5){ plot(fit, cov_index = i, ddhazard_boot = boot_out) points(sim_list$beta_w_err[, i], pch = 16, col = "red") }

```r
alpha <- .05

compute_coverage <- function(fit, boot_out, actual){
  # Compute coverage of error bounds from smoothed covariance matrix
  bds <- qnorm(1 - alpha / 2) * sqrt(t(apply(fit$state_vars, 3, diag))[-1, ])
  res_smooth <- colMeans(
    actual >= fit$state_vecs[-1, ] - bds & actual <= fit$state_vecs[-1, ] + bds)

  # Compute coverage from bootstrap
  enve <- sapply(1:ncol(boot_out$t), function(i){
      bt_ci <- boot.ci(boot_out, index = i, type = c("perc"), conf = 1 - alpha)
      c(bt_ci$percent[, 4:5])
    })
  enve_lb <- enve[1, ]
  enve_ub <- enve[2, ]
  dim(enve_lb) <- dim(enve_ub) <- dim(fit$state_vecs)

  res_boot <- colMeans(
    actual >= enve_lb[-1, ] & actual <= enve_ub[-1, ])

  list(smooth = res_smooth, boot = res_boot)
}

We can now pose the question how the pointwise coverage is for each coefficient. For this reason, we have defined the function compute_coverage which is not included but can be found in the markdown for this vignette on the github site:

compute_coverage(fit, boot_out, sim_list$beta_w_err)

compute_coverage outputs a list of the true coverage of the r (1 - alpha)*100% confidence intervals from the smoothed covariance matrix and the percentile method from the bootstrap. That is, the fractions of red dots from the previous plot that are within the 95% confidence interval. The two elements of the list is for the the percentile method from the bootstrap. These are respectively the smooth and boot elements of the list. We can now repeat the above M times (defined below) as follows:

set.seed(520920)
R <- 999  # Number of bootstrap estimates in each trials 
M <- 100  # Number of trials

# Define matrix for output
coverage_boot <- coverage_smooth <- matrix(
  NA_real_, nrow = M, ncol = ncol(fit$state_vecs))

# Sometimes estimations fails. We use this counter to keep track of the number
# of times
n_fails <- 0
LRs <- 1.1^(0:(-6)) # Learning rates to try in order to get a fit

# We save this as an epxression as we will re-run it later
boot_exp <- expression({
  for(i in 1:M){
    # Simulate data set
    sim_list <- sim_func()

    # Fit on whole data set
    did_succed <- F
    try({
      eval(fit_expression)
      did_succed <- T
    })
    if(!did_succed){
      n_fails <- n_fails + 1
      next
    }

    # Bootstrap fits
    boot_out <- ddhazard_boot(fit,
                              strata = as.factor(sim_list$sims$tstart),
                              do_stratify_with_event = FALSE,
                              do_sample_weights = FALSE, R = R, 
                              LRs = LRs)

    # Compute coverage and add to output
    coverage <- compute_coverage(fit, boot_out, sim_list$beta_w_err)
    coverage_smooth[i, ] <- coverage$smooth
    coverage_boot[i, ] <- coverage$boot
  }

  })
eval(boot_exp)

n_fails # number of failed estimations 

The mean of the fraction of the overages for the two methods are printed below. That is, the mean of the fraction for each coefficient from each run that did not fail:

colMeans(coverage_smooth, na.rm = TRUE)
colMeans(coverage_boot, na.rm = TRUE)

Finally, we can make a boxplot of the fraction of coverage in each trail as follows:

boxplot(coverage_smooth, ylim = c(.6, 1), main = "Smoothed covariance")
abline(h = .95, lty = 1)

boxplot(coverage_boot, ylim = c(.6, 1), main = "Bootstrap")
abline(h = .95, lty = 1)

We do alter the learning rate in the previous simulation in order to get a fit when we bootstrap. An alternative could be not to allow for this as done below where failed fits are excluded:

n_fails <- 0
LRs <- 1     # Changed to one value only

eval(boot_exp)

n_fails # number of failed estimations 

The means and box plot are given below:

colMeans(coverage_smooth, na.rm = TRUE)
colMeans(coverage_boot, na.rm = TRUE)
boxplot(coverage_smooth, ylim = c(.6, 1), main = "Smoothed covariance")
abline(h = .95, lty = 1)

boxplot(coverage_boot, ylim = c(.6, 1), main = "Bootstrap")
abline(h = .95, lty = 1)

Has to be done

We reset options here as per CRAN policy.

options(old_options)

References



boennecd/dynamichazard documentation built on Oct. 11, 2022, 2:41 p.m.