knitr::opts_chunk$set( echo = TRUE, fig.height = 4, fig.width = 7, dpi = 128, cache.path = "cache/RW-cache/", fig.path = "fig/RW-fig/", error = FALSE) options(digits = 4, scipen = 10, width = 70)
palette(c("black", "darkblue"))
We will simulate and estimate a random walk model in this example using the
particle filter and smoother. For details see
this vignette which can also be found by
calling vignette("Particle_filtering", package = "dynamichazard")
. The models
we are going to simulate from and estimate are of the form
$$ \begin{split} y_{it} &\sim g(\cdot\vert\eta_{it}) & \ \vec{\eta}t &= X_tR^+\vec{\alpha}_t \ \vec{\alpha}_t &= F\vec{\alpha}{t - 1} + R\vec{\epsilon}_t & \quad \vec{\epsilon}_t \sim N(\vec{0}, Q) \ & & \quad \vec{\alpha}_0 \sim N(\vec{a}_0, Q_0) \end{split}, \qquad \begin{array}{l} i = 1, \dots, n_t \ t = 1, \dots, d \end{array} $$
where the $y_{it}$'s are binary event indicators for individual $i$ at
time $t$. It is one if individual $i$ dies at in the time period $(t - 1, t]$.
$X_t$ are covariates and $\vec{\alpha}0,\dots,\vec{\alpha}_d$ are the latent states.
The probability that $y{it}$ is one conditional on no prior event ($y_{ij} = 0$ for $j = 0,\dots, t - 1$)
is the inverse logit function of $\eta_{it}$.
Thus, the model is related to a so-called conditional logistic regression. The difference is
that we have a distributional assumption about the intercept and we allow some
of the other coefficients to vary. The true values are
$$ F = \begin{pmatrix}1 & 0 \ 0 & 1 \end{pmatrix}, \quad Q = \begin{pmatrix}0.33^2 & -3/4 \cdot .33^2 \ -3/4 \cdot .33^2 & 0.33^2 \end{pmatrix}, \quad R = I_2, \quad \vec{\alpha} = (-6, -2)^\top $$
git_key <- system("git rev-parse --short HEAD", intern = TRUE) git_bra <- system("git branch", intern = TRUE) regexp <- "^(\\*\ )(.+)$" git_bra <- git_bra[grepl(regexp, git_bra)] git_bra <- gsub(regexp, "\\2", git_bra)
where $I_2$ is the two-dimensional identity matrix. The unknown
parameters to be estimated is $Q$ and maybe $\vec{a}_0$. This
example is run on the git branch "r git_bra
" with ID "r git_key
". The code
can be found on
the github site for the package.
All functions which assignments are not shown and are not in the
dynamichazard
package can be found on the github site.
We start by simulating the data. Feel free to skip this part as the specifications are given above. First we assign the parameters for the simulation
n_obs <- 1000L n_periods <- 100L Qmat <- matrix(c(.33^2, -.75 * .33^2, -.75 * .33^2, .33^2), 2) a_0 <- c(-6, -2)
Then we simulate and plot the latent states
set.seed(98734059) alphas <- matrix(nrow = n_periods + 1, ncol = 2) alphas[1, ] <- a_0 for(i in 1:n_periods + 1) alphas[i, ] <- alphas[i - 1, ] + drop(rnorm(2) %*% chol(Qmat)) # plot of latent variables cols <- c("black", "darkblue") matplot(alphas, type = "l", lty = 1, col = cols) for(i in 1:2) abline(h = a_0[i], lty = 2, col = cols[i])
We simulate the observations as follows
df <- replicate(n_obs, { # left-censoring tstart <- max(0L, sample.int((n_periods - 1L) * 2L, 1) - n_periods + 1L) # covariates x <- runif(1, 0, 2) covars <- c(1, x) # outcome (stop time and event indicator) y <- FALSE for(tstop in (tstart + 1L):n_periods){ inv_logit <- 1 / (1 + exp(-covars %*% alphas[tstop + 1L, ])) if(inv_logit > runif(1)){ y <- TRUE break } } c(tstart = tstart, tstop = tstop, x = x, y = y) }) df <- data.frame(t(df)) head(df, 10)
We left-censor the observations since we otherwise may end up with a low number of observations towards the end.
We can fit a model without the latent variables (i.e., a constant coefficient model) as follows
library(dynamichazard) stat_fit <- static_glm( Surv(tstart, tstop, y) ~ x, data = df, by = 1, max_T = n_periods, family = "logit") summary(stat_fit)$coefficients logLik(stat_fit)
The estimated coefficients seems plausible when looking at the previous plot. Further, we can compare the log-likelihood with this models with the log-likelihood approximation we get from the particle filter in the next section.
We start by using the particle filter and smoother from [@fearnhead10]
n_threads <- 6 # you can replace this with e.g., # max(parallel::detectCores(logical = FALSE), 2))) # assign control variable dd_ctrl <- PF_control( N_fw_n_bw = 500, N_smooth = 10000, N_first = 5000, eps = .0001, method = "AUX_normal_approx_w_cloud_mean", est_a_0 = FALSE, n_max = 500, smoother = "Fearnhead_O_N", averaging_start = 100L, nu = 6, covar_fac = 1.1, n_threads = n_threads)
set.seed(26564868) system.time(pf_fearn <- PF_EM( Surv(tstart, tstop, y) ~ x, df, Q_0 = diag(3^2, 2), Q = diag(1, 2), by = 1, type = "RW", model = "logit", max_T = n_periods, control = dd_ctrl))
system.time
is used to show the computation time. The final estimates and true values are
print_Q_est <- function(x) list(est = x$Q, actual = Qmat, est_chol = chol(x$Q), actual_chol = chol(Qmat)) print_Q_est(pf_fearn)
The effective sample size at each time point looks as follows
plot(pf_fearn$effective_sample_size$smoothed_clouds, type = "h", ylim = range(0, pf_fearn$effective_sample_size$smoothed_clouds))
The approximate log-likelihoods at each EM iteration are
plot(pf_fearn$log_likes)
We can plot the smoothed estimates as follows
plot(pf_fearn, qlvls = c(.05, .5, .95)) matplot(0:n_periods, alphas, type = "l", lty = 2, add = TRUE)
The crosses are the 5%, 50% and 95% ("pointwise") quantiles estimates (due to the value we pass
to qlvls
argument of plot.PF_EM
). The continuous lines are the
smoothed mean estimates. The dashed lines are the true state variables.
The pointwise confidence bounds are broader toward the end. The cause may be that there are few events towards the end. E.g., there are the following number of events after time 70
sum(df$y & df$tstop > 70)
while there are quite a few who are still alive past time 70
sum(df$tstop > 70)
We can also estimate a model where we try to estimate $\vec{a}_0$ (though, I do not think we can expect to do this consistently)
set.seed(26564868) pf_fearn_w_a_0 <- PF_EM( Surv(tstart, tstop, y) ~ x, df, Q_0 = diag(3^2, 2), Q = diag(1, 2), by = 1, type = "RW", model = "logit", max_T = n_periods, control = dd_ctrl)
The estimates are
print_Q_est(pf_fearn_w_a_0) pf_fearn_w_a_0$a_0
The approximate log-likelihoods at each EM iteration are
plot(pf_fearn_w_a_0$log_likes)
We can plot the smoothed estimates like before
plot(pf_fearn_w_a_0, qlvls = c(.05, .5, .95)) matplot(0:n_periods, alphas, type = "l", lty = 2, add = TRUE)
We can compare the results above with the extended Kalman filter which is implemented in the package
dd_fit <- ddhazard( Surv(tstart, tstop, y) ~ x, df, model = "logit", max_T = n_periods, Q_0 = diag(3^2, 2), Q = diag(1, 2), by = 1, order = 1, control = ddhazard_control(eps = .001, method = "EKF", NR_eps = 1e-4))
The estimates are
print_Q_est(dd_fit) dd_fit$state_vecs[1, ]
The smoothed estimates looks as follows
plot(dd_fit, cov_index = 1, level = .9) lines(0:n_periods, alphas[, 1], lty = 2) plot(dd_fit, cov_index = 2, col = 2, level = .9) lines(0:n_periods, alphas[, 2], lty = 2, col = 2)
Above, the "outer" dashed lines are 5% and 95% pointwise confidence bounds.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.