knitr::opts_chunk$set( echo = TRUE, cache = TRUE, message = FALSE, warning = FALSE)
library(dynamichazard); library(survival); library(mgcv) options(ddhazard_max_threads = parallel::detectCores(), digits = 4)
This note has four objectives. The first objective is to test how the ddhazard
fits compare with a Generalized Additive models (GAM) and a "static" logistic model with simulated data. We will look at the following models/estimation methods from ddhazard
function in the dynamichazard
package:
The second objective is to show how to estimate various models with the function ddhazard
. For this reason, the note contains intermediate R
code which is not needed to understand the simulation results. Thus, we will use * in the headers of section to distinguish the content. The headers marked with no * indicates sections with results of simulation or contains important comments. Headers with an * and ** shows increasingly less important code to understand the simulation. Consequently, you can skip to the headers with no * if you are only interested in the results.
The third objective is to illustrate how the various methods performs for out-of-time prediction (forecasting). By out-of-time we mean that we only observe outcomes up to given time, $d$, and then predict the outcome for future observations at time $d+1$.
The fourth objective is to show that both the EKF and UKF scales linearly with the number individuals (series).
All method use the logistic link function. We will do three runs of experiments in the following order:
where correct or incorrect binning intervals refers to whether or not we bin at the same time where the coefficient are simulated to change. For example, we bin correctly where we simulate the coefficients to change at time $1,2,\dots,d$ and we estimate the coefficient at time $1,2,\dots,d$. The models will be compared in terms of Brier score, median absolute residuals and standard deviation of the absolute residuals. All metrics will be reported on out-sample data or out-of-time data. All plots will have true coefficients as continuous lines while dashed lines are estimates.
You can install the version of the library used to make this vignettes from github with the remotes
library as follows:
# install.packages("remotes") remotes::install_github("boennecd/dynamichazard")
You can also get the latest version on CRAN by calling:
install.packages("dynamichazard")
Moreover, you will also find the source code for the vignette at the github page. The note is not meant to be self contained. It is recommend to see ddhazard vignette for an introduction to the models and methods in the dynamichazard
package.
For clarity, here is a list of used notions:
The findings are:
You will see that the the estimation sometimes fails. It is worth stressing that it is my experience that you can always do "trail-and-error" with the initial covariance matrix in the state equation, the covariance matrix at time zero and tuning parameters in order to get a model to fit a given dataset. Of course, it is a disadvantage that any given data set may require some tuning by the user. Although as will be shown, tuning by the user is not often needed with data sets like those presented here.
The following values will be used in the simulation:
ns <- c(200, 2000) # Number of series n_beta <- 5 # Number of covariates T_max <- 20 # The last time we observe n_sims <- 100 # Number of simulation in each run gsub("(^.+)(/dynamichazard.+$)", "...\\2", getwd()) source("../../R/test_utils.R")
ns
is the number of series (individuals) we will estimate in each of the simulation in each of the runs. Thus, we will perform simulations with a total of r paste(ns[-length(ns)], collapse = ", ")
and r tail(ns, 1)
series in each. Each simulation will have n_beta =
r n_beta
covariates plus an intercept. Each run will simulate n_sims =
r n_sims
times. Finally, we source the test.utils.R
file to define the simulation function. You can find this script on the github site. T_max
is the number of bins/intervals we observe. Thus, we have 1, 2, ..., T_max
+ 1 covariate vectors (+1 for the time zero coefficient vector).
We will make runs for various number of individuals in this section where we estimate a model where all effects are time varying and we use the correct binning intervals. Thus, the only models that are miss specified are the model with one time varying effect (which will be x2
) and the model were we use a second order random walk.
do.call
functionWe will use do.call
in this vignette. To my knowledge, do.call
is not standard so this section is included to give a brief introduction to do.call
for users who are not familiar with do.call
. We can take an example with the mean
function. We will make the following call where we set na.rm
to TRUE
:
mean(x = c(1, 2, NA, 6), na.rm = T)
This call can also be made as follows do.call
:
arg_ex <- list(x = c(1, 2, NA, 6), na.rm = T) do.call(mean, arg_ex)
Hence, do.call
is useful in situation where we make calls where almost all the arguments are the same. For example, in the setting where we have arguments a1
, a2
, ...
, up to a1000
and we only want to change argument a101
say. This can then be done as follows:
# Not runable arg_ex <- list(a1 = x, a2 = y, ..., # enter all the other values a1000 = z) do.call(some_func, arg_ex) # change only a101 argument and keep the rest as the arguments as is arg_ex$a101 <- some_specific_value do.call(some_func, arg_ex)
Below, we define a list of default_args
(default arguments) to our simulation function which we can later use using do.call
.
# Default arguments for simulation default_args <- list( n_vars = n_beta, # Number of betas not including intercept beta_start = c(-1, -.5, 0, 1.5, 2), # start value of coeffecients intercept_start = -5, # start value of intercept sds = c(.1, rep(.5, n_beta)), # std. deviations in state equation t_max = T_max, # Largest time we observe x_range = 1, # range of covariates x_mean = .5, # mean of covariates tstart_sampl_func = # we randomly draw the start time of each serie function(...) max(runif(1, min = -10, max = 18), 0) )
Let $\vec{\beta}t$ denote the time varying covariates at time $t$. Then the beta_start
is the time 0 values of the coefficients and intercept_start
is the starting value of the intercept. The sds
are the standard deviations, $\sigma_i$, in the state equation. Hence,
$$\beta{j,t} = \beta_{j,t - 1} + \epsilon_{j,t},\qquad \epsilon_{j,t} \sim N(0,\sigma_j^2)$$
where each margin is independent of the others. The x_mean
and x_range
defines how the covariate values are simulated. The above setting implies that $x_{itj} = \text{Unif}(0.5 - 1/2, 0.5 + 1/2)$ where $x_{itj}$ is the $i$'th individuals covariate $j$ at time $t$. The covariate vector $\vec{x}_{it}$ is updated at time differences of $1+\eta$ where $\eta \sim \text{Exp}(1)$ and $\eta$s are drawn separately for each individual for each covariate vector. The motivation for this behavior is that we can have different covariate update times than our binning time in a given study. For instance, say we are looking at a medical study and the covariates are laboratory values. The time of laboratory values from an individual's visit the doctor can differ from whatever binning periods we use in the state-space model. Further, the time when laboratory values are updated can differ between patients. One might see his doctor every week or so while another only sees his doctor every year.
Below we illustrate how the coefficients vectors from a simulation can look:
# We can simulate by set.seed(51231) sims <- do.call(test_sim_func_logit, c(list(n_series = max(ns)), default_args)) # This is how the state vectors look # We define a function so we can re-use it later plot_func <- function(ylim = c()){ # we define a function here so we can use it later matplot(sims$betas, type = "l", lty = 1, ylab = expression(beta), xlab = "t", ylim = range(sims$betas, ylim), col = 1:(n_beta + 1)) # Add rug plot to illustrate when people die rug(jitter(sims$res$tstop[sims$res$event==1], amount = .25) + 1, col = rgb(0, 0, 0, .05)) } plot_func()
The black line is the intercept while the colored lines are the coefficients for the covariates. The lines on the x-axis illustrate when we observe that individuals die. There is one line for each death. Next, we can look at the number of failures in each simulation:
# We get a "decent" amount of failures and survivers in some of the simulations # We use do.call to avoid repeating the above argument list set.seed(468249) n_fails_in_sim <- rep(NA_real_, 15) for(i in seq_along(n_fails_in_sim)){ sims <- do.call(test_sim_func_logit, c( default_args, c(list(n_series = max(ns))))) # Take largest amount of series n_fails_in_sim[i] <- sum(sims$res$event) } n_fails_in_sim # number of failures in each simulation
We will define functions to estimate the different models with a data frame as the first argument where the data frame is from a test_sim_func_logit
call. This will reduce the amount of code later.
Below, we define function to fit a model where the coefficients are fixed ($\vec{\beta}_t = \vec{\beta}$). It is estimated using glm.fit
:
library(survival); library(dynamichazard) # Set up function for static fit fit_funcs = list() fit_funcs$static <- function(s = sims$res) static_glm(formula = Surv(tstart, tstop, event) ~ x1 + x2 + x3 + x4 + x5, data = s, max_T = T_max, by = 1, id = s$id, speedglm = FALSE) fit <- fit_funcs$static() class(fit) # returns a glm object # Estimates seems plausible plot_func(ylim = fit$coefficients) abline(h = fit$coefficients, col = 1:(n_beta + 1), lty = 2)
ddhazard
fit functionsBelow, we define a function to fit a first order random walk model with a given learning rate and potential extra iterations in the scoring step (see the ddhazard vignette for details):
library(survival); library(dynamichazard) # Set up function ddhazard fit function for convenience # LR: learning rate in correction step # NR_eps: tolerance in correction step. NULL yields no extra iterations fit_funcs$dd <- function(s = sims$res, LR = 1, NR_eps = NULL) tryCatch({ ddhazard( formula = Surv(tstart, tstop, event) ~ x1 + x2 + x3 + x4 + x5, data = s, max_T = T_max, by = 1, id = s$id, Q_0 = diag( # We set the Q_0 argument lower when we take multiple iterations # See the ddhazard vignette under the GMA model for arguments herefor if(is.null(NR_eps)) 1000000 else 1, n_beta + 1), Q = diag(.01, n_beta + 1), control = list(LR = LR, NR_eps = NR_eps, eps = 0.01)) }, error = function(...) NA) # Return NA if fails fit <- fit_funcs$dd() # Plot estimates and actual coffecients plot_func(ylim = fit$state_vecs) matplot(fit$state_vecs, col = 1:(n_beta + 1), lty = 2, type = "l", add = T) # Same call with extra iterations fit <- fit_funcs$dd(LR = .5, NR_eps = .01) # Look at new plot plot_func(ylim = fit$state_vecs) matplot(fit$state_vecs, col = 1:(n_beta + 1), lty = 2, type = "l", add = T)
Below, we define a function to fit a first order random walk model with the UKF method:
# Fitting with UKF fit_funcs$dd_UKF <- function(s = sims$res, alpha = 1, beta = 0){ tryCatch({ ddhazard( formula = Surv(tstart, tstop, event) ~ x1 + x2 + x3 + x4 + x5, data = s, max_T = T_max, by = 1, id = s$id, Q_0 = diag(1, n_beta + 1), Q = diag(.01, n_beta + 1), control = list( eps = 0.1, alpha = alpha, # Set tuning parameter beta = beta, # Set tuning parameter method = "UKF")) # Set estimation method (EKF is default) }, error = function(...) NA) # Return NA if fails } fit <- fit_funcs$dd_UKF() # Look at new plot plot_func(ylim = fit$state_vecs) matplot(fit$state_vecs, col = 1:(n_beta + 1), lty = 2, type = "l", add = T)
Below, we define a function to estimate a first order random walk model where only one parameter (x2
) is time varying:
# Fitting with fixed effects fit_funcs$dd_fixed <- function( s = sims$res, LR = 1, NR_eps = NULL, fixed_terms_method = "M_step"){ # The method to use to estimate the fixed # fixed effects tryCatch({ ddhazard( formula = Surv(tstart, tstop, event) ~ ddFixed_intercept() + # Fix intercept ddFixed(x1) + x2 + # Note x2 is time varying ddFixed(x3) + ddFixed(x4) + ddFixed(x5), data = s, max_T = T_max, by = 1, id = s$id, Q_0 = diag(1, 1), Q = diag(.01, 1), control = list(LR = LR, NR_eps = NR_eps, eps = 0.1, fixed_terms_method = fixed_terms_method)) }, error = function(...) NA) # Return NA if fails } fit <- fit_funcs$dd_fixed() # Look at new plot plot_func(ylim = range(fit$state_vecs, fit$fixed_effects)) matplot(fit$state_vecs, col = 3, lty = 2, type = "l", add = T) abline(h = fit$fixed_effects, col = c(1:2, 4:6), lty = 2)
Next, we define a function to fit the model with a second order random walk:
# Fitting with second order fit_funcs$dd_2_order <- function(s = sims$res, LR = 1, NR_eps = NULL){ tryCatch({ ddhazard( formula = Surv(tstart, tstop, event) ~ x1 + x2 + x3 + x4 + x5, data = s, max_T = T_max, by = 1, id = s$id, # Q_0 and Q needs more elements Q_0 = diag(c(rep(1, n_beta + 1), rep(0.5, n_beta + 1))), Q = diag(c(rep(.01, n_beta + 1))), order = 2, # specify the order control = list(LR = LR, NR_eps = NR_eps, eps = 0.1)) }, error = function(...) NA) # Return NA if fails } fit <- fit_funcs$dd_2_order() # Look at new plot plot_func(ylim = fit$state_vecs) matplot(fit$state_vecs[, 1:6], col = 1:(n_beta + 1), lty = 2, type = "l", add = T)
We define the estimation method for the Generalized additive model in the next code snippet. We use bam
function from the mgcv
package which corresponds to gam
but for large datasets.
library(mgcv) fit_funcs$gam <- function(s = sims$res){ # get data frame for fitting dat_frame <- get_survival_case_weights_and_data( formula = Surv(tstart, tstop, event) ~ x1 + x2 + x3 + x4 + x5, data = s, max_T = T_max, by = 1, id = s$id, use_weights = F)$X # fit model bam( formula = Y ~ # cr is cubic basis with k knots s(t, bs = "cr", k = 10, by = x1) + s(t, bs = "cr", k = 10, by = x2) + s(t, bs = "cr", k = 10, by = x3) + s(t, bs = "cr", k = 10, by = x4) + s(t, bs = "cr", k = 10, by = x5), family = binomial, data = dat_frame, method = "GCV.Cp", control = gam.control(nthreads = parallel::detectCores() - 1)) # Use parallel } # fit model fit <- fit_funcs$gam() # Compare plot layout(matrix(1:6, nrow = 2)) for(i in 1:n_beta){ plot(fit, pages = 0, rug = F, col = i + 1, select = i, lty = 2, main = paste0("(", i, ")")) lines(sims$betas[-1, i + 1], col = i + 1) }
The following code snippets define predictions methods for each of the estimation methods. We start off by defining a split function such that we can sample individuals (series) into a test set and a training test:
split_func <- function(s = sims$res){ # Sample ids test_ids <- sample( unique(s$id), floor(length(unique(s$id)) / 2), replace = F) # Return seperate data frames return(list(test_dat = s[s$id %in% test_ids, ], fit_dat = s[!s$id %in% test_ids, ])) } # Illustrate use tmp <- split_func() # No ids intersect in the two sets length(intersect(tmp$test_dat$id, tmp$fit_dat$id)) # The union is exactly the number of ids we simulated length(union(tmp$test_dat$id, tmp$fit_dat$id))
Having defined the splitting method, we turn to the prediction functions. The idea is to define the brier_funcs$general
function which takes in a prediction function, a fit and a data frame. Next, we then define individual prediction functions for each of the models which will be passed to brier_funcs$general
:
# Define general prediction function brier_funcs <- list() brier_funcs$general <- function(brier_func, fit, eval_data_frame){ d_frame <- get_survival_case_weights_and_data( formula = Surv(tstart, tstop, event) ~ x1 + x2 + x3 + x4 + x5, data = eval_data_frame, max_T = T_max, by = 1, id = eval_data_frame$id, use_weights = F)$X # Change start and stop times d_frame$tstart <- d_frame$t - 1 d_frame$tstop <- d_frame$t # Compute residuals resids <- brier_func(fit, d_frame) # Return estimates with( resids, list(brier = mean(response^2), median_abs_res = median(abs(response)), sd_res = sd(response), median_deviance = median(deviance), sd_deviance = sd(deviance))) } # Prediction method for static model brier_funcs$static <- function(fit, d_frame){ preds <- predict(fit, newdata = d_frame, type = "response") list( response = d_frame$Y - preds, deviance = d_frame$Y * log(preds) + (1 - d_frame$Y) * log(1 - preds)) } # Test function fit <- fit_funcs$static(tmp$fit_dat) cols <- c("brier", "median_abs_res", "sd_res", "median_deviance", "sd_deviance") unlist( # in sample stats brier_funcs$general(brier_funcs$static, fit, tmp$fit_dat)[cols]) unlist( # out sample stats brier_funcs$general(brier_funcs$static, fit, tmp$test_dat)[cols])
# Define prediction function for ddhazard model brier_funcs$dd <- function(fit, d_frame){ preds <- predict(fit, new_data = d_frame, tstart = "tstart", tstop = "tstop") # We truncate as we can get zero-one outcome if the model diverges preds$fits <- pmax(pmin(preds$fits, 1 - 1e-14), 1e-14) list( response = d_frame$Y - preds$fits, deviance = d_frame$Y * log(preds$fits) + (1 - d_frame$Y) * log(1 - preds$fits)) } fit <- fit_funcs$dd(tmp$fit_dat) unlist( # in sample stats brier_funcs$general(brier_funcs$dd, fit, tmp$fit_dat)[cols]) unlist( # out sample stats brier_funcs$general(brier_funcs$dd, fit, tmp$test_dat)[cols])
# Define prediction function for gam model brier_funcs$gam <- function(fit, d_frame){ preds <- predict(fit, newdata = d_frame, type = "response") list( response = d_frame$Y - preds, deviance = d_frame$Y * log(preds) + (1 - d_frame$Y) * log(1 - preds)) } fit <- fit_funcs$gam(tmp$fit_dat) unlist( # in sample stats brier_funcs$general(brier_funcs$gam, fit, tmp$fit_dat)[cols]) unlist( # out sample stats brier_funcs$general(brier_funcs$gam, fit, tmp$test_dat)[cols])
To make things easier, we define a function that takes in a function to simulate from. Given a function to simulate with, the new function perform n_sims =
r n_sims
simulations for each of values ns
(r ns[-length(ns)]
and r ns[length(ns)]
):
simulate_n_print_res <- function( sim_func, # Function that takes one argment which is number of series NR_eps = c(.0001, NA), # Tolerance in scoring step file_prefix) # file_prefix for output { for(n in ns){ file_name <- paste0(file_prefix, "_", n, ".RDS") do_compute <- !file.exists(file_name) if(do_compute){ out <- array(NA_real_, dim = c(n_sims, 8, 5), dimnames = list( NULL, c("static", "Extra correction", "Single correction", "2 order EKF", "Fixed E-step", "Fixed M-step", "UKF", "gam"), c("Brier", "Abs res", "Sd res", "Dev", "Sd dev"))) n_failures_and_surviers <- array( NA_integer_, dim = c(2, n_sims), dimnames = list(c("# failures", "# survivers"), NULL)) #******* # Progress bar for inpatient people (me) pb <- winProgressBar( paste("Estimating with n =", n), "", 0, n_sims, 50) #******* for(i in 1:n_sims){ #******* info <- sprintf("%.2f%% done", 100 * (i - 1) / n_sims) setWinProgressBar(pb, i - 1, paste("Estimating with n =", n), info) #******* # Sample until we get an outcome have sufficient amount of deaths and # survivers repeat{ sims <- sim_func(n) # We want some survivers and some deaths if(sum(sims$res$event) > 25 && n - sum(sims$res$event) > 25) break } n_failures_and_surviers["# failures", i] <- sum(sims$res$event) n_failures_and_surviers["# survivers", i] <- n - sum(sims$res$event) # Split data sim_split <- split_func(sims$res) # Fit static model static_fit <- fit_funcs$static(sim_split$fit_dat) # Fit dd model dd_fits <- list(rep(NA, length(NR_eps))) for(k in seq_along(NR_eps)){ dd_fits[[k]] <- fit_funcs$dd( sim_split$fit_dat, NR_eps = if(is.na(NR_eps[k])) NULL else NR_eps[k]) } # Fit second order dd_2_order <- fit_funcs$dd_2_order(sim_split$fit_dat) # Fit fixed effect dd_fixed_E_step <- fit_funcs$dd_fixed(sim_split$fit_dat, fixed_terms_method = "E_step") dd_fixed_M_step <- fit_funcs$dd_fixed(sim_split$fit_dat, fixed_terms_method = "M_step") # UKF fit dd_UKF <- fit_funcs$dd_UKF(sim_split$fit_dat) # Fit gam model gam_fit <- fit_funcs$gam(sim_split$fit_dat) # Evalute on test data models <- c(list(static_fit), dd_fits, list(dd_2_order, dd_fixed_E_step, dd_fixed_M_step, dd_UKF, gam_fit)) eval_funcs = c(brier_funcs$static, replicate(length(dd_fits) + 4, brier_funcs$dd), brier_funcs$gam) for(j in seq_along(models)){ if(length(models[[j]]) == 1 && is.na(models[[j]])) next # We have to skip model fits that failed metrics <- brier_funcs$general( eval_funcs[[j]], models[[j]], sim_split$test_dat) out[i, j, "Brier"] <- metrics$brier out[i, j, "Abs res"] <- metrics$median_abs_res out[i, j, "Sd res"] <- metrics$sd_res out[i, j, "Dev"] <- metrics$median_deviance out[i, j, "Sd dev"] <- metrics$sd_deviance } } #******* close(pb) #******* # Save results saveRDS(out, file = file_name) } # Load results out <- readRDS(file = file_name) # Print results did_fit <- apply(out[, , 1], 2, function(x) n_sims - sum(is.na(x))) n_cases_all_success <- sum(complete.cases(out[, , 1])) metric_where_all_fit <- t(apply(out[complete.cases(out[, , 1]), , , drop = F], 3, colMeans)) metric_where_all_fit <- formatC(metric_where_all_fit ,format="f", digits=3) n_cases_all_success <- formatC(n_cases_all_success, format="d") print(knitr::kable(cbind( t(metric_where_all_fit), "# succesful fits" = did_fit), caption = paste( "Mean of metrics with", n/2, "series in test and fit data. 'Abs res' is the median of the absolute residuals, 'Sd res' is the standard deviation of the residuals, 'Dev' is median of the deviance residuals and 'Sd dev' is the standard deviation of deviance residuals. Only simulations that succeeds for all setups are included. There are", n_cases_all_success, "of these simulations. The last column shows the number of successful fits for each setup."), align = "r")) cat("\n") # Make boxplot of std deviance residuals par(mar = c(5, 8, 4, 2)) boxplot( out[complete.cases(out[, , 1]), , "Sd dev"], main = paste0("Std of deviance residuals w/ ", n/2 ," series"), cex.axis = .75, horizontal = TRUE, las = 2, ylim = c(min(out[,, "Sd dev"], na.rm = T), min(max(out[,, "Sd dev"], na.rm = T), 2))) } }
We are now able to simulate from the model where all effects are time varying and we use the correct binning intervals with the code below:
set.seed(1243) # Use simulation function simulate_n_print_res( sim_func = function(n) do.call(test_sim_func_logit, c(default_args, c(list(n_series = n)))), file_prefix = "logit_sim_all_varying")
Notice that we only compare across methods with mean metrics where all succeeded to fit.
All models perform better than the static model in terms of Brier score. The UKF and EKF with extra iterations also does well on the Brier score compared with the GAM model. The finding is slight different if we look at the box plot of the standard deviations of the deviance residuals. The GAM model does worse here compared to the other methods.
In this part, we will look at the performs when only singe coefficient (x2
) varies. Thus, we can see if the models where only (x2
) is modeled as varying performs performs better.
We start by defining the simulation function. The main change here is that we only set a single standard deviation and that we set it larger than before:
# Use simulation function set.seed(9999) sim_one_varying <- function(n){ test_sim_func_logit( n_series = n, sds = c(sqrt(3)), # Large variance is_fixed = c(1:2, 4:6), # All but param three (x2) is fixed # Same values as before n_vars = n_beta, beta_start = c(-1, -.5, 0, 1.5, 2), intercept_start = -4, t_max = T_max, x_range = 1, x_mean = .5) }
# We get a more variable number of failures and survivers (we simulate 200 # series) replicate(10, sum(sim_one_varying(200)$res$event)) # print number of failures # Here is an example of a series tmp <- sim_one_varying(200) matplot(tmp$betas, type = "l", lty = 1, ylab = "Beta", xlab = "Time")
We can simulate with the following call:
# Use simulation function set.seed(8080) simulate_n_print_res( sim_func = sim_one_varying, file_prefix = "logit_sim_one_variying")
The main interest here is how the models labeled Fixed ...
The results though are similar to what we saw before.
Now, what happens if we get the binning (intervals lengths) wrong? This is the next experiment we will perform. Specifically, we will set the bin length to 0.1
instead 1
when we simulate. Thus, coefficients are updated at time $0, 0.1, 0.2, \dots$ and whether an individual dies is evaluated at the same times when we simulate. However, the fitted model will still be based on bins of length $1$.
set.seed(9001) sim_finer_binning <- function(n){ time_denom = 10 # how much finer do we want to bin? res <- test_sim_func_logit( n_series = n, # We multiply through appropiately beta_start = c(-1, -.5, 0, 1.5, 2), intercept_start = - 8, # Note, we changed the intercept sds = c(.1, rep(1, n_beta)) / sqrt(time_denom), t_max = T_max * time_denom, lambda = 1 / time_denom, # We change the time when covariates are updated # (the lambda parem in the rate ~ Exp(.) in the # time increaments) n_vars = n_beta, x_range = 1, x_mean = .5) # Change time denominator res$res$tstart <- res$res$tstart / time_denom res$res$tstop <- res$res$tstop / time_denom res }
# We get more variable outcomes (we simulate 200 series) replicate(10, sum(sim_finer_binning(200)$res$event)) # Number of failures # Here is an example of the series tmp <- sim_finer_binning(200) matplot((1:nrow(tmp$betas) - 1) / 10, tmp$betas, type = "l", lty = 1, ylab = "Beta", xlab = "Time")
We are now able to simulate with the following call:
# Use simulation function set.seed(747) simulate_n_print_res( sim_func = sim_finer_binning, file_prefix = "logit_sim_diff_binning")
The UKF and extra iteration seems to perform well in both settings in terms of Brier score.
In the following paragraphs, we will investigate how the different estimation method performs when the following period have to be predicted. Thus, we cannot use the GAM model because it uses in-sample splines. Though, we can still use the state-space models as we can predict the following state vector given the previous. Further, we can use the static model to compare with.
We start by defining a simulation function and a function to split the data into the first time period which we will use for estimation and the later time period which we will use for the test:
# Define simulation function out_sample_args <- default_args out_sample_args$t_max <- 21 sim_func <- function(n_series = 200) do.call(test_sim_func_logit, c(list(n_series = n_series), out_sample_args)) # Define split function split_data_func <- function(d_frame, split_time = 20){ # Find data before split_time and set event flag and stop time in_sample <- d_frame[d_frame$tstart < split_time, ] in_sample$event <- in_sample$event & in_sample$tstop <= split_time in_sample$tstop <- pmin(in_sample$tstop, split_time) # Find data that ends after split_time and set start time out_sample <- d_frame[split_time < d_frame$tstop, ] out_sample$tstart <- pmax(out_sample$tstart, split_time) # Return list(in_sample = in_sample, out_sample = out_sample) }
We extend the period (t_max
) by one which is the only difference in the simulation. Notice that individuals can be in both estimation data and test data. Any failure beyond time 20 will only count as a failure in the test data. Thus, we need to change the event flag for these in the in_sample
data if the stop time is beyond time 20. Below, we illustrate how this looks for an individual who do die beyond time 20:
set.seed(1119) tmp <- sim_func() library(testthat) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$id)), c(25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$tstart)), c( 0.00, 1.31, 4.33, 6.06, 7.29, 9.24, 12.00, 13.66, 15.59, 16.61, 19.48 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$tstop)), c( 1.31, 4.33, 6.06, 7.29, 9.24, 12.00, 13.66, 15.59, 16.61, 19.48, 21.00 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$event)), c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$x1)), c(0.6626980784349144, 0.8070358897093683, 0.2006717855110765, 0.1358326515182853, 0.3608680919278413, 0.2015429562889040, 0.9212381672114134, 0.6043020864017308, 0.1504346556030214, 0.6358598717488348, 0.6404316008556634 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$x2)), c(0.98269869317300618, 0.20332843926735222, 0.88774081016890705, 0.46823416533879936, 0.74556523724459112, 0.69450344028882682, 0.09217978385277092, 0.69848375208675861, 0.14431357802823186, 0.04437874257564545, 0.52934150700457394 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$x3)), c(0.5245066978968680, 0.8488548814784735, 0.8259815813507885, 0.1891250370536000, 0.8190492528956383, 0.5959769308101386, 0.5131845688447356, 0.4787189916241914, 0.9336126816924661, 0.6532274778001010, 0.3123779222369194 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$x4)), c(0.86037779343314469, 0.23903768113814294, 0.96499744337052107, 0.01255022338591516, 0.67799357161857188, 0.07062705862335861, 0.54526174115017056, 0.92422958696261048, 0.10567326820455492, 0.85825872048735619, 0.70302122551947832 )) expect_equal(unname(c(tmp$res[tmp$res$id == 25, ]$x5)), c(0.04008953366428614, 0.14597435668110847, 0.62326966854743659, 0.04519627522677183, 0.04682684061117470, 0.08569942740723491, 0.90543313208036125, 0.84058298333548009, 0.98582848836667836, 0.17005247063934803, 0.72474981611594558 ))
# Illustrate with example set.seed(1119) tmp <- sim_func() # Illustrate for individual 25 tmp$res[tmp$res$id == 25, ] # Split data d_split <- split_data_func(tmp$res) # In sample data (notice event flag is changed and last tstop) d_split$in_sample[d_split$in_sample$id == 25, ] # Out sample data (notice tstart is changed) d_split$out_sample[d_split$out_sample$id == 25, ]
We can now run the simulation with the following code. We end the code by printing the mean Brier score for the test data:
# Setup N <- 100 # number of simulations n <- 1000 # number of series out <- matrix(NA_real_, nrow = N, ncol = 4) # matrix for output # Run simulation set.seed(42) for(i in 1:N){ # Simulate data and split repeat{ sims <- sim_func(n) # We want some survivers and some deaths if(sum(sims$res$event) > 50 && n - sum(sims$res$event) > 50) break } d_split <- split_data_func(sims$res) # Estimate models static_fit <- fit_funcs$static(d_split$in_sample) ekf_fit <- fit_funcs$dd(d_split$in_sample) ekf_extra_fit <- fit_funcs$dd(d_split$in_sample, NR_eps = .01) ukf_fit <- fit_funcs$dd_UKF(d_split$in_sample) # Predict outcome error <- list( static = predict(static_fit, d_split$out_sample, type = "response"), ekf = if(is.na(ekf_fit)) NA else predict(ekf_fit, new_data = d_split$out_sample, tstart = "tstart", tstop = "tstop")$fits, ekf_extra = if(is.na(ekf_extra_fit)) NA else predict(ekf_extra_fit, new_data = d_split$out_sample, tstart = "tstart", tstop = "tstop")$fits, ukf = if(is.na(ukf_fit)) NA else predict(ukf_fit, new_data = d_split$out_sample, tstart = "tstart", tstop = "tstop")$fits) # Compute Brier score error <- unlist(lapply( error, function(x) if(is.na(x)) NA else mean.default((x - d_split$out_sample$event)^2))) # Save results out[i, ] <- error } # Print mean for cases where all could fit colnames(out) <- c("Static", "EKF", "EKF with extra correction", "UKF") colMeans(out[complete.cases(out), ]) # Print median apply(out[complete.cases(out), ], 2, median) # Print number of cases where all methods succeed to estimate sum(complete.cases(out))
Above, we do r N
simulations with r n
series in each simulation. It seems to that the different EKF methods and the UKF performs comparably. Another question is how often the various method got a given rank within a simulation in terms of their Brier score. We answer this question below (the rank are given as the first printed value such that one implies being the lowest Brier score in a given simulation):
# Look at number of cases where each method got each rank knitr::kable(apply(t(apply(out[complete.cases(out), ], 1, rank)), 2, function(x) xtabs(~x)), caption = "Number of times each set got a given rank in terms of Brier Score", row.names = T)
The EKF method does better with these specification in terms of getting the lowest mean out-sample Brier score and getting the lowest Brier score in most of the simulation.
We will illustrate that the EKF and UKF have linear time complexity in the number of observation. This is particularly easy because the simulation function start of by simulating the coefficients as shown below (hence, variation will not be due to different coefficients vectors and only the number of series):
some_seed <- 69284 set.seed(some_seed) res_1 <- test_sim_func_logit(100) set.seed(some_seed) res_2 <- test_sim_func_logit(1000) # different number of series all.equal(res_1$betas, res_2$betas) # Coeffecients are equal
Next, we plot the computation time versus the number of simulation for the EKF and UKF method. Further, we print the linear regression slope for the log-log regression. The slope is close to one implying that the linear time complexity is linear in the number of observations
# Define function to record run time for a given number of series run_time_func <- function(n, sim_args = default_args){ set.seed(7851348) # Use the same seed sim_args$n_series <- n sims <- do.call(test_sim_func_logit, sim_args) time_EKF <- system.time(fit_EKF <- fit_funcs$dd(sims$res)) time_UKF <- system.time( fit_UKF <- ddhazard( formula = Surv(tstart, tstop, event) ~ x1 + x2 + x3 + x4 + x5, data = sims$res, max_T = T_max, by = 1, id = sims$res$id, Q_0 = diag(.1, n_beta + 1), Q = diag(.1, n_beta + 1), control = list( eps = 0.1, alpha = 1, beta = 0, method = "UKF"))) # Check that both succed to fit if(is.na(fit_EKF) || is.na(fit_UKF)) stop() list(time_EKF = time_EKF, time_UKF = time_UKF) } n_for_test <- 2^(10:19) run_time <- sapply(n_for_test, run_time_func) # Plot EKF and print log-log regression slope ekf_time <- sapply(run_time["time_EKF", ], function(x) x[["user.self"]]) plot(n_for_test, ekf_time, type = "p", log = "xy", xlab = "Number of series", ylab = "Computation time for EKF") coef(lm(log(ekf_time) ~ log(n_for_test))) # log-log slope is roughly one # Plot UKF and print log-log regression slope ukf_time <- sapply(run_time["time_UKF", ], function(x) x[["user.self"]]) plot(n_for_test, ukf_time, type = "p", log = "xy", xlab = "Number of series", ylab = "Computation time for UKF") coef(lm(log(ukf_time) ~ log(n_for_test))) # log-log slope is roughly one
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.