knitr::opts_chunk$set(
  echo = TRUE, 
  fig.height = 4, fig.width = 7, dpi = 128,
  cache.path = "cache/firstOrderVAR-cache/", fig.path = "fig/firstOrderVAR-fig/", 
  error = FALSE)
options(digits = 4, scipen = 10, width = 70)

Introduction

We will simulate and estimate a first order vector auto-regression model in this example using the particle filter and smoother. For details see this vignette which can also be found by calling vignette("Particle_filtering", package = "dynamichazard"). The models we are going to simulate from and estimate are of the form

$$ \begin{split} y_{it} &\sim g(\cdot\vert\eta_{it}) & \ \vec{\eta}t &= X_tR^+\vec{\alpha}_t + Z_t\vec{\beta} + \vec{o}_t \ \vec{\alpha}_t &= F\vec{\alpha}{t - 1} + R\vec{\epsilon}_t & \quad \vec{\epsilon}_t \sim N(\vec{0}, Q) \ & & \quad \vec{\alpha}_0 \sim N(\vec{a}_0, Q_0) \end{split}, \qquad \begin{array}{l} i = 1, \dots, n_t \ t = 1, \dots, d \end{array} $$

where the $y_{it}$ is individual $i$'s indicator at time $t$ for whether he dies between time $(t - 1, t]$. The indicators, $y_{it}$, are binomial distributed with the complementary log-log link function conditional on knowing the log times at risk, $\vec{o}_1,\cdots,\vec{o}_d$, covariates, $X_t$ and $Z_t$, and latent states, $\vec{\alpha}_1,\dots,\vec{\alpha}_d$. The total survival time of individual $i$ is $T_i$ which is piecewise constant exponentially distributed conditional on knowing the latent states. Further, we set $Z_t = X_t$ so the states have a non-zero mean. The true values are

$$ F = \begin{pmatrix}0.9 & 0 \ 0 & 0.9 \end{pmatrix}, \quad Q = \begin{pmatrix}0.33^2 & 0 \ 0 & 0.33^2 \end{pmatrix}, \quad R = I_2, \quad \vec{\beta} = (-6.5, -2)^\top $$

git_key <- system("git rev-parse --short HEAD", intern = TRUE)
git_bra <- system("git branch", intern = TRUE)
regexp <- "^(\\*\ )(.+)$"
git_bra <- git_bra[grepl(regexp, git_bra)]
git_bra <- gsub(regexp, "\\2", git_bra)

where $I_2$ is the two-dimensional identity matrix and $\vec{a}_0$ and $Q_0$ are given by the invariant distribution. The unknown parameters to be estimated is everything but $Q_0$ and $R$ (since we fix $Q_0$ doing the estimation and we set $\vec{a}_0 = (0, 0)^\top$). This example is run on the git branch "r git_bra" with ID "r git_key". The code can be found on the github site for the package. All functions which assignments are not shown and are not in the dynamichazard package can be found on the github site.

Simulation

We start by simulating the data. Feel free to skip this part as the specifications are given above. First we assign the parameters for the simulation

library(dynamichazard)
n_obs     <- 1000L
n_periods <- 200L

Fmat <- matrix(c(.9, 0, 0, .9), 2)
Rmat <- diag(1    , 2)
Qmat <- diag(.33^2, 2)
Q_0  <- get_Q_0(Qmat, Fmat)
beta <- c(-6.5, -2)

get_Q_0 is a function to get the covariance matrix for the invariant distribution. Then we simulate and plot the latent states

set.seed(54432125)
betas <- matrix(nrow = n_periods + 1, ncol = 2)
betas[1, ] <- rnorm(2) %*% chol(Q_0)
for(i in 1:n_periods + 1)
  betas[i, ] <- Fmat %*% betas[i - 1, ] + drop(rnorm(2) %*% chol(Qmat))

betas <- t(t(betas) + beta)

# plot of latent variables
cols <- c("black", "darkblue")
matplot(betas, type = "l", lty = 1, col = cols)
for(i in 1:2)
  abline(h = beta[i], lty = 2, col = cols[i])

We simulate the observations as follows

df <- replicate(n_obs, {
  # left-censoring
  tstart <- max(0L, sample.int((n_periods - 1L) * 2L, 1) - n_periods + 1L)

  # covariates
  x <- runif(1, -1, 1)
  covars <- c(1, x)

  # outcome (stop time and event indicator)
  y <- FALSE
  for(tstop in (tstart + 1L):n_periods){
    fail_time <- rexp(1) / exp(covars %*% betas[tstop + 1L, ])
    if(fail_time <= 1){
      y <- TRUE
      tstop <- tstop - 1L + fail_time
      break
    }
  }

  c(tstart = tstart, tstop = tstop, x = x, y = y)
})
df <- data.frame(t(df))
head(df, 10)

We left-censor the observations since we otherwise may end up with a low number of observations towards the end.

Model without latent variables

We can fit a model without the latent variables (i.e., a constant coefficient model) as follows

surv_fit <- survreg(Surv(tstop - tstart, y) ~ x, df, dist = "exponential")
summary(surv_fit) # signs are flipped
logLik(surv_fit)

The signs are flipped as stated in help("survreg"). They are though close to $\vec{\beta}$ as expected. Further, we can compare the log-likelihood with this models with the log-likelihood approximation we get from the particle filter in the next section.

Particle filter and smoother

We start off by using the generalized two-filter smoother from @briers09. We estimate the parameters with an EM algorithm by calling PF_EM as follows

n_threads <- 6

# you can replace this with e.g.,  
# max(parallel::detectCores(logical = FALSE), 2)))
set.seed(30520116)
system.time(pf_Brier <- PF_EM(
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE), df,
  Q_0 = diag(1, 2), Q = diag(1, 2), Fmat = matrix(c(.1, 0, 0, .1), 2), 
  by = 1, type = "VAR", model = "exponential", max_T = n_periods,
  control = PF_control(
    N_fw_n_bw = 750, N_smooth = 1, N_first = 2000, eps = .001,
    method = "AUX_normal_approx_w_cloud_mean", fix_seed = FALSE,
    n_max = 100, smoother = "Brier_O_N_square",
    # we add some extra variation to the proposal distributions
    nu = 8, covar_fac = 1.1, n_threads = n_threads)))

system.time is used to show the computation time. The estimated parameters are

pf_Brier$F
show_covar <- function(Q){
  cat("Standard deviations\n")
  print(sqrt(diag(Q)))
  cat("Lower correlation matrix\n")
  tmp <- cov2cor(Q)
  print(tmp[-1, -ncol(tmp)])
}
show_covar(pf_Brier$Q)
pf_Brier$fixed_effects

The effective sample size in the smoothing step at each time point is

plot(pf_Brier$effective_sample_size$smoothed_clouds, type = "h", 
     ylim = range(0, pf_Brier$effective_sample_size$smoothed_clouds), 
     ylab = "Effective sample size")
plot_cloud <- function(
  pf_fit, type = "smoothed_clouds", start_at_zero = FALSE){
  par_old <- par(no.readonly = TRUE)
  on.exit(par(par_old))

  # plot random effects
  jpeg(tempfile()) # avoid output from the plot
  out <- plot(pf_fit, type = type, col = c("black", "blue"))
  dev.off()
  par(par_old)

  # then we use it and add the fixed effects
  y <- t(t(out$mean) + pf_fit$fixed_effects)
  ylim <- range(betas, y)
  x <- if(start_at_zero) 1:nrow(out$mean) - 1  else 1:nrow(out$mean)
  matplot(x, t(t(out$mean) + pf_fit$fixed_effects), type = "l",
          lty = 1, col = c("black", "blue"), ylim = ylim)

  # add actual points
  matplot(0:n_periods, betas, type = "l", add = TRUE,
          lty = 2, col = c("black", "blue"), ylim = ylim)

  # show places where we had a hard time sampling
  ess <- pf_fit$effective_sample_size[[type]]
  idx <- which(ess <= 100)
  for(i in idx - start_at_zero)
    abline(v = i, lty = 2)
  invisible(ess)
}

The smoothed estimates of the latent states looks as follows (the vertical dashed lines are points where we did not sample well)

plot_cloud(pf_Brier)

The dashed non-vertical lines are the true curve and the continuous is the smoothed estimate. The approximate log-likelihoods from the particle filter at each EM iteration are

plot(pf_Brier$log_likes)

We can compare the output above with the smoother from @fearnhead10

set.seed(30520116)
system.time(pf_Fear <- PF_EM(
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE), df,
  Q_0 = diag(1, 2), Q = diag(1, 2), Fmat = matrix(c(.1, 0, 0, .1), 2), 
  by = 1, type = "VAR", model = "exponential", max_T = n_periods,
  control = PF_control(
    N_fw_n_bw = 1000, N_smooth = 2000, N_first = 2000, eps = .001,
    method = "AUX_normal_approx_w_cloud_mean", fix_seed = FALSE,
    n_max = 100, smoother = "Fearnhead_O_N",
    nu = 8, covar_fac = 1.1, n_threads = n_threads)))

The estimates are similar

pf_Fear$F
show_covar(pf_Fear$Q)
pf_Fear$fixed_effects

The effective sample size are better now though

plot(pf_Fear$effective_sample_size$smoothed_clouds, type = "h", 
     ylim = range(0, pf_Fear$effective_sample_size$smoothed_clouds), 
     ylab = "Effective sample size")

However, there are many options to reduce the computation time for the former smoother as e.g, mentioned in @briers09 but those are not implemented. The smoothed estimates are also rather close to what we saw before

plot_cloud(pf_Fear)

The log-likelihoods are almost the same

plot(pf_Fear$log_likes)

The log-likelihood where flat toward the end and not monotonically increasing

plot(tail(pf_Fear$log_likes, 20))

This will not happen in an EM algorithm but may happen in an MCEM algorithm due to the Monte Carlo error. We may want to take a few more EM iterations with more particles. We can do this as follows

set.seed(30520116)
pf_Fear_more <- PF_EM(
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE), df,
  Q_0 = diag(1, 2), Q = pf_Fear$Q, Fmat = pf_Fear$F, 
  fixed_effects = pf_Fear$fixed_effects, by = 1, type = "VAR", 
  model = "exponential", max_T = n_periods,
  control = PF_control(
    N_fw_n_bw = 5000, N_smooth = 10000, N_first = 5000, eps = .001,
    method = "AUX_normal_approx_w_cloud_mean", fix_seed = FALSE,
    n_max = 10, smoother = "Fearnhead_O_N",   
    nu = 8, covar_fac = 1.1, n_threads = n_threads))

The log-likelihoods from these iterations are

plot(pf_Fear_more$log_likes)

and the final estimates are

pf_Fear_more$F
show_covar(pf_Fear_more$Q)
pf_Fear_more$fixed_effects

Averaging

Averaging is sometimes argued for in MCEM algorithm. E.g., see @cappe05 section 11.1.2.2. The idea is to take an average the estimates or sufficient sufficient statistic after some iteration number. As of this writing, the present implementation does not allow one to change the number of particles at each iteration as suggested in @cappe05.

set.seed(30520116)
system.time(pf_Fear_avg <- PF_EM(
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE), df,
  Q_0 = diag(1, 2), Q = diag(1, 2), Fmat = matrix(c(.1, 0, 0, .1), 2), 
  by = 1, type = "VAR", model = "exponential", max_T = n_periods, 
  control = PF_control(
    N_fw_n_bw = 200, N_smooth = 400, N_first = 1000, eps = 1e-5,
    method = "AUX_normal_approx_w_cloud_mean", fix_seed = FALSE, 
    n_max = 1000L, smoother = "Fearnhead_O_N", averaging_start = 150L,
    nu = 8, covar_fac = 1.1, n_threads = n_threads)))
pf_Fear_avg$F
show_covar(pf_Fear_avg$Q)
pf_Fear_avg$fixed_effects

plot(pf_Fear_avg$log_likes, type = "l")

Stochastic gradient descent

We will run stochastic gradient descent as in @polyak92 which is described in @cappe05 [page 414].

#####
# fit to start from
set.seed(30520116)
pf_start <- PF_EM(
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE), df,
  Q_0 = diag(1, 2), Q = diag(1, 2), Fmat = matrix(c(.1, 0, 0, .1), 2), 
  by = 1, type = "VAR", model = "exponential", max_T = n_periods, 
  control = PF_control(
    N_fw_n_bw = 200, N_smooth = 400, N_first = 1000, eps = 1e-5,
    method = "AUX_normal_approx_w_cloud_mean", fix_seed = FALSE,
    n_max = 1L, smoother = "Fearnhead_O_N",
    nu = 8, covar_fac = 1.1, n_threads = n_threads))
# Function to perform stochastic gradient descent. Uses Adam algorithm 
# suggested by Kingma et al. (2015).
# 
# Args: 
#   object: an object of class PF_EM. 
#   n_runs: number of iterations. 
#   ns: number of particles at each iteration. 
#   debug: TRUE if information should be printed during estimation. 
#   use_O_n_sq: TRUE if O(N^2) method should be used. 
#   lr: learning rate. 
#   mp: decay rate for first moment. 
#   vp: decay rate for second moment.
# 
# Returns: 
#   list with final estimates, log-likelihood approximations at each iteration,
#   and estimates at each iteration.
sgd <- function(object, n_runs, ns, debug = FALSE, use_O_n_sq = FALSE, 
                lr = .001, mp = .9, vp = .999)
{
  #####
  # get object to perform computation and matrices and vectors for output
  comp_obj <- PF_get_score_n_hess(object, use_O_n_sq = use_O_n_sq)
  state <- matrix(
    NA_real_, n_runs + 1L, length(object$F) + length(object$Q))

  # setup matrix to map from gradient to the state (the covariance part is only
  # the lower diagonal)
  dF <- length(object$F)
  dQ <- NCOL(object$Q)
  nQ <- (dQ * (dQ + 1L)) / 2L
  K <- matrix(0., ncol(state), length(object$F) + nQ)
  K[  1:dF ,   1:dF] <- diag(dF)
  # can use e.g., matrixcal instead to get the duplication matrix
  K[-(1:dF), -(1:dF)] <- dynamichazard:::.get_dup_mat(dQ)

  dfix <- length(object$fixed_effects)
  obs   <- matrix(
    NA_real_, n_runs + 1L, dfix)
  state[1, ] <- c(object$F, object$Q)
  obs  [1, ] <- object$fixed_effects
  lls <- rep(NA_real_, n_runs)

  mv <- vv <- rep(0., ncol(K) + dfix)

  for(i in 1:n_runs){
    comp_obj$set_n_particles(N_fw = ns[i], N_first = ns[i])
    fw <- comp_obj$run_particle_filter()
    lls[i] <- logLik(fw)
    score <- comp_obj$get_get_score_n_hess(only_score = TRUE)
    sc <- score$score

    mv <- mp * mv + (1 - mp) * sc
    vv <- vp * vv + (1 - vp) * sc^2
    muse <- mv / (1 - mp^i)
    vuse <- vv / (1 - vp^i)
    direc <- muse / (sqrt(vuse) + 1e-8)

    # step-half until we have a valid set a of parameters 
    n_halfs <- 0L
    n_halfs_max <- 25L
    lri <- lr
    while((n_halfs <- n_halfs + 1L) <= n_halfs_max){
      state_i <- 
        state[i + 1L, ]  <- state[i, ]  + lri * drop(K %*% direc[-(1:dfix)])
      obs_i   <- 
        obs  [i + 1L, ]  <- obs  [i, ]  + lri * direc[1:dfix]
      lri <- lri / 2

      # check that system is stationary
      Ftmp <- matrix(state_i[1:dQ^2], dQ, dQ)
      if(any(Mod(eigen(Ftmp)$values) >= 1))
        next

      # set parameters
      o <- comp_obj$set_parameters(state = state_i, obs = obs_i)

      # Q is positive definite 
      if(all(eigen(o$Q)$values > 0.))
        break
    }

    if(n_halfs > n_halfs_max)
      stop("step halvings failed")

    if(debug){
      msg <- paste0(sprintf("It %4d Log-likelihood %10.2f (max %10.2f)", i, 
                            lls[i], max(lls, na.rm = TRUE))) 
      cat(msg, "\n", rep("-", nchar(msg)), "\n", sep = "")
      cat(sprintf("Gradient norm %14.4f\n", norm(t(sc))))
      print(o$Fmat)
      print(o$Q)
      print(o$fixed_effects)
      cat("\n")
    }
  }

  list(o = o, lls = lls, state = state, obs = obs)
}

# number of iterations
n_runs <- 400L
# number of particles to use
ns <- ceiling(exp(seq(log(100), log(1000), length.out = n_runs)))

out <- sgd(n_runs = n_runs, ns = ns, object = pf_start, lr = .005)
lls <- out$lls
o <- out$o

The results are illustrated below

# approximate log-likelihood at each iteration
par(mar = c(5, 4, 1, 1))
plot(lls, xlab = "Iteration", ylab = "Log-likelihood", type = "l")
print(max(lls), digits = 5)

# the estimates
o$Fmat
show_covar(o$Q)
o$fixed_effects

We can also use the method from @Poyiadjis11. The present implementation scales poorly in the number of particles but the method has a variance that may be linear in the number of time periods instead of at least quadratic. This is important for long time series.

# number of iterations
n_runs <- 400L
# number of particles to use
ns <- ceiling(exp(seq(log(200), log(1000), length.out = n_runs)))

out <- sgd(n_runs = n_runs, ns = ns, use_O_n_sq = TRUE,  object = pf_start, 
           lr = .005)
lls <- out$lls
o <- out$o

The results are illustrated below (the log-likelihood approximations from the forward particle filter are poor approximations due to few particles).

# approximate log-likelihood at each iteration
par(mar = c(5, 4, 1, 1))
plot(lls, xlab = "Iteration", ylab = "Log-likelihood", type = "l")
print(max(lls), digits = 5)

# the estimates
o$Fmat
show_covar(o$Q)
o$fixed_effects

We can approximate standard errors as follows

<button onclick="show_obs_info()" style=" color: #494949 !important; background: #ffffff; padding: 5px; border: 4px solid #494949 !important; border-radius: 6px; display: inline-block; margin: 5px 0;"

Show/hide code and result

Log-likelihood evaluation

We may question what the log-likelihood is at the true parameters. We can use the PF_forward_filter function to perform approximate log-likelihood evaluation

fw_precise <- PF_forward_filter( 
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE),
  N_fw = 100000, N_first = 100000, df, type = "VAR",
  model = "exponential", max_T = n_periods, by = 1,
  control = PF_control(
    N_fw_n_bw = 1, N_smooth = 1, N_first = 1,
    method = "AUX_normal_approx_w_cloud_mean",
    smoother = "Fearnhead_O_N",
    n_threads = n_threads, nu = 8, covar_fac = 1.1),
  Fmat = Fmat, a_0 = c(0, 0), Q = Qmat, Q_0 = Q_0, R = diag(1, 2),
  fixed_effects = beta)
logLik(fw_precise)

The log-likelihood from the final model we estimated is

print(tail(pf_Fear_more$log_likes, 1), digits = 7)
print(tail(pf_Brier$log_likes, 1), digits = 7)

or we can get a more precise estimate by calling (though the log-likelihood is evaluated at the final parameter estimates and not the iteration one step prior as above)

print( 
  logLik(PF_forward_filter(pf_Fear_more, N_fw = 20000, N_first = 20000)), 
  digits = 7)

A question is what happens if we start at the true parameter values? We may expect that we only take a few EM iterations and end up at the MLE or another local maximum not to far from the true parameters

set.seed(30520116)
pf_Fear_close <- PF_EM(
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE), df,
  Fmat = Fmat, a_0 = c(0, 0), Q = Qmat, Q_0 = Q_0, fixed_effects = beta,
  by = 1, type = "VAR", model = "exponential", max_T = n_periods,
  control = PF_control(
    N_fw_n_bw = 500, N_smooth = 2000, N_first = 2000, eps = .001,
    method = "AUX_normal_approx_w_cloud_mean", fix_seed = FALSE,
    n_max = 100, smoother = "Fearnhead_O_N",
    nu = 8, covar_fac = 1.1, n_threads = n_threads))

The final estimates are

pf_Fear_close$F
show_covar(pf_Fear_close$Q)
pf_Fear_close$fixed_effects

The log-likelihoods at the EM iterations look as follows

plot(pf_Fear_close$log_likes, type = "l")

Restricted model

The model can be written as in terms of 2 parameters for the state model by writing it as

$$ F = \begin{pmatrix} \theta & 0 \ 0 & \theta \end{pmatrix}, \quad Q = \begin{pmatrix} \exp(2\psi) & 0 \ 0 & \exp(2\psi) \end{pmatrix} $$

We can estimate the restricted model as follows (see the vignette mentioned in the beginning for details about the arguments to PF_EM).

G <- matrix(0, 2^2, 1)
G[1, 1] <- G[4, 1] <- 1
J <- matrix(1, 2, 1)
K <- matrix(nrow = 2 * (2 - 1) / 2, ncol = 0)

pf_Fear_restrict <- PF_EM(
  Surv(tstart, tstop, y) ~ x + ddFixed(x) + ddFixed_intercept(TRUE), df,
  Q_0 = diag(1, 2), G = G, J = J, K = K, theta = .1, psi = 0, phi = numeric(),
  by = 1, type = "VAR", model = "exponential", max_T = n_periods,
  control = PF_control(
    N_fw_n_bw = 500, N_smooth = 2000, N_first = 2000, eps = .001,
    method = "AUX_normal_approx_w_cloud_mean", fix_seed = FALSE,
    n_max = 100, smoother = "Fearnhead_O_N",
    nu = 8, covar_fac = 1.1, n_threads = n_threads))

The final estimates are

pf_Fear_restrict$F
show_covar(pf_Fear_restrict$Q)
pf_Fear_restrict$fixed_effects

and a log-likelihood approximation is

print( 
  logLik(PF_forward_filter(pf_Fear_restrict, N_fw = 20000, N_first = 20000)), 
  digits = 7)

We have $2^2 + 2(2 + 1) / 2 - 4 = 3$ fewer parameters so the difference in log-likelihood to the full model seems reasonable.

References



boennecd/dynamichazard documentation built on Oct. 11, 2022, 2:41 p.m.