ddhazard Diagnostics

hook_output <- knitr::knit_hooks$get("output")
knitr::knit_hooks$set(output = function(x, options) {
   lines <- options$output.lines
   if (is.null(lines)) {
     return(hook_output(x, options))  # pass to default hook
   }
   x <- unlist(strsplit(x, "\n"))
   more <- "... output abbreviated ..."
   if(is.list(lines)){
      x <- lapply(lines, function(z) x[z])
      for(i in 1:(length(x) - 1))
        x[[i]] <- c(x[[i]], more)
      x <- unlist(x)
   }
   else if (length(lines)==1) {        # first n lines
     if (length(x) > lines) {
       # truncate the output, but add ....
       x <- c(head(x, lines), more)
     }
   } else {
     x <- c(more, x[lines], more)
   }
   # paste these lines together
   x <- paste(c(x, ""), collapse = "\n")
   hook_output(x, options)
 })

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, 
                      dpi = 100, cache = FALSE, dev = "png")
knitr::opts_knit$set(warning = FALSE, message = FALSE)

knit_print.data.frame <- function(
  x, ..., row.names = FALSE, digits = getOption("digits")){
  res <- paste(c(
    "", "", knitr::kable(x, ..., row.names = row.names, digits = digits)), 
    collapse = "\n")
  knitr::asis_output(res)
}
knit_print.matrix <- function(x, ..., digits = getOption("digits"), row.names = FALSE, escape = TRUE){
  col.names <- rep("", ncol(x))
  col.names <- if(is.null(colnames(x)))
    rep("", ncol(x)) else colnames(x)
  # The long table does not do well with the ( or [ symbol 
  # See http://tex.stackexchange.com/questions/228755/undefined-control-sequence-on-left-parenthesis-after-midrule-in-longtable
  col.names[1] <- gsub("^\\(", "\\\\relax(", col.names[1], perl = TRUE)
  col.names[1] <- gsub("^\\[", "\\\\relax[", col.names[1], perl = TRUE)
  res <- paste(c(
    "", "", knitr::kable(x, ..., row.names = row.names, col.names = col.names, 
                         escape = escape, digits = digits)),
    collapse = "\n")
  knitr::asis_output(res)
}

\renewcommand{\vec}[1]{\bm{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\Lparen}[1]{\left( #1\right)} \newcommand{\Lbrack}[1]{\left[ #1\right]} \newcommand{\Lbrace}[1]{\left { #1\right }} \newcommand{\Lceil}[1]{\left \lceil #1\right \rceil} \newcommand{\Lfloor}[1]{\left \lfloor #1\right \rfloor} \newcommand{\propp}[1]{P\Lparen{#1}} \newcommand{\proppCond}[2]{P\Lparen{\left. #1 \right\vert #2}} \newcommand{\expecp}[1]{E\Lparen{#1}} \newcommand{\expecpCond}[2]{E\Lparen{\left. #1 \right\vert #2}} \newcommand{\varp}[1]{\textrm{Var}\Lparen{#1}} \newcommand{\varpCond}[2]{\textrm{Var} \Lparen{\left. #1 \right\vert #2}} \newcommand{\corpCond}[2]{\textrm{Cor} \Lparen{\left. #1 \right\vert #2}} \newcommand{\covp}[1]{\textrm{Cov} \Lparen{#1}} \newcommand{\covpCond}[2]{\textrm{Cov} \Lparen{\left. #1 \right\vert #2}} \newcommand{\emNotee}[3]{#1_{\left. #2 \right\vert #3}} \newcommand{\emNote}[4]{#1_{\left. #2 \right\vert #3}^{(#4)}} \newcommand{\ukfNote}[2]{\mat{P}{\vec{#1}, \vec{#2}}} \newcommand{\ukfNotee}[3]{\mat{P}{\vec{#1}{#3}, \vec{#2}{#3}}} \newcommand{\diag}[1]{\text{diag}{\Lparen{#1}}} \newcommand{\wvec}[1]{\widehat{\vec{#1}}} \newcommand{\wmat}[1]{\widehat{\mat{#1}}} \newcommand{\wtvec}[1]{\widetilde{\vec{#1}}} \newcommand{\bvec}[1]{\bar{\vec{#1}}} \newcommand{\deter}[1]{\left| #1 \right|} \newcommand{\MyInd}[2]{\Lparen{#1}_{#2}} \newcommand{\xyzp}[2]{#1\Lparen{#2}} \newcommand{\xyzpCond}[3]{#1\Lparen{\left. #2 \right\vert #3}}

Introduction

This vignette will show examples of how the residuals and hatvalues functions can be used for an object returned by ddhazard. See vignette("ddhazard", "dynamichazard") for details of the computations. Firstly, we will present all the tools that turned out to be useful for data sets. This is in the section called "First round of checks". Then we return to the data sets using the other methods. The latter part is included to illustrate how the function in this package work. This is in the section called "Second round of checks".

Has to be done

CRAN requires that the par options and options are reset somewhere. Thus, we get the old settings and reset them at the end.

old_par <- par(no.readonly = TRUE)

# set digits
old_options <- options(digits = 3)

First round of checks

Data set 1: Prisoner recidivism

The first data set we will look at is an experimental study of recidivism of 432 male prisoners a year after being released from prison. The details of the data set are in [@rossi80]. The study involved randomly giving financial aid to the prisoners when they where released to see if this affected recidivism. The variables we will look at are:

A .pdf file called Appendix-Cox-Regression.pdf was previously on CRAN where they analyze this data set with the Cox regression model. They found:

Loading the data set

We load the data set with the next line. The details of how the .RData file is made is on the github site in the vignettes/Diagnostics/ folder.

load("Diagnostics/Rossi.RData")

# We only keep some of the columns
Rossi <- Rossi[
  , c("id","start","stop","event", "fin", "age", "prio", "employed.cumsum")]

The data is in the typical start-stop form for Survival analysis. We print the number of individuals and show an example of how the data looks for one of the individuals:

# Number of unique individauls
length(unique(Rossi$id)) 

# Show example for a given individual
Rossi[Rossi$id == 2, ]

Next, we illustrate which of the variables are and which are not time-varying:

# See the varying and non-varying covariates
# The output shows the mean number of unique values for each individual
tmp <- 
  by(Rossi[, ], Rossi$id, function(dat) 
    apply(dat, 2, function(x) sum(!duplicated(x))))
colMeans(do.call(rbind, as.list(tmp)))
# Indivuals are observed for at most one year
stopifnot(setequal(unique(Rossi$stop), 1:52))

The events happens more less evenly across time after the first 8 weeks as the next plot shows. Hence, we may see an increasing intercept later since all individuals start at time zero. Thus, the risk sets only get smaller as time progress while roughly the same number of people gets rearrested:

# Individuals have event through out the period
plot(xtabs(~ Rossi$stop[Rossi$event == 1]), xlab = "Week number", 
     ylab = "Number of recidivism")
# All individuals have gabs of one week 
stopifnot(all(
  unlist(tapply(Rossi$stop, Rossi$id, diff)) == 1))

# People have at most one event
is_event <- Rossi[Rossi$id %in% Rossi$id[Rossi$event == 1], ]
stopifnot(all(tapply(is_event$event, is_event$id, sum) == 1))

# People alwas get arrested on the last record
stopifnot(all(
  tapply(1:nrow(is_event), is_event$id, function(is){
    rows <- is_event[is, ]
    max(rows$stop) == rows$stop[rows$event == 1]
  })))

Estimation

We estimate the model with 4 terms and a intercept as follows:

library(dynamichazard)
dd_rossi <- ddhazard(
  Surv(start, stop, event) ~ fin + age + prio + employed.cumsum, 
  data = Rossi, id = Rossi$id, by = 1, max_T = 52, 
  Q_0 = diag(10000, 5), Q = diag(.01, 5),
  control = ddhazard_control(eps = .001, n_max = 250))

Then we plot the predicted coefficients:

plot(dd_rossi)

The dashed lines are 95% confidence bounds from the smoothed covariance matrices. Both the intercept and the age seems to have time-varying coefficients.

Hat values

We start by looking at the "hat-like" values which are suggested in the ddhazard vignette. These are computed by calling the hatvalues function as follows:

hats <- hatvalues(dd_rossi)

The returned object is list with a matrix for each interval. Each matrix has a column for the hat values, the row number in the original data set and the id of the individual the hat values belongs to:

str(hats[1:3]) # str of first three elements 
head(hats[[1]], 10) # Print the head of first matrix
stack_hats <- function(hats){
  # Stack
  resids_hats <- data.frame(do.call(rbind, hats), row.names = NULL)

  # Add the interval number
  n_rows <- unlist(lapply(hats, nrow))
  interval_n <- unlist(sapply(1:length(n_rows), function(i) rep(i, n_rows[i])))

  resids_hats <- cbind(resids_hats, interval_n = interval_n)

  # Sort by id and interval number
  resids_hats <- 
    resids_hats[order(resids_hats$id, resids_hats$interval_n), ]

  resids_hats
}

We have defined a function to stack the hat values for each individual called stack_hats such that we get a single matrix instead of list of matrices. Further, the function also adds an extra column for the interval number. The definition of the function is printed at the end of this vignette. The stacked values looks as follows:

hats_stacked <- stack_hats(hats)

head(hats_stacked)

The hat values are skewed with a few large values as shown in the histogram below:

# Draw histogram of hat values
hist(log10(hats_stacked$hat_value), main = "",
     xlab = "Histogram of log10 of Hat values")

We find the maximum hat value for each individual and print the largest of them:

# Print the largest values 
max_hat <- tapply(hats_stacked$hat_value, hats_stacked$id, max)
head(sort(max_hat, decreasing = TRUE), 5)

The names of the returned vector is the id of the individual. One seems to stand out. Next, we plot the hat values against time and highlight the individuals with the largest maximum hat value by giving them a red color and adding a number for the rank of their maximum hat value:

# We will highlight the individuals with the highest hatvalues
is_large <- 
  names(head(sort(max_hat, decreasing = TRUE), 5))

# Plot hat values
plot(range(hats_stacked$interval_n), c(0, 0.03), type = "n",
     xlab = "Interval number", ylab = "Hat value")

invisible(
  by(hats_stacked, hats_stacked$id, function(rows){
    has_large <- rows$id[1] %in% is_large
    col <- if(has_large) "Red" else "Black"
    lines(rows$interval_n, rows$hat_value, lty = 2, 
          col = col)

    if(has_large){
      pch <- as.character(which(rows$id[1] == is_large))
      points(rows$interval_n, rows$hat_value, pch = pch, col = col)
    }
  }))

We print the last record for each of the above shown in red to get an idea of why their hat values are large:

# These are the individuals id
is_large

# We print the last record each of these
Rossi_subset <- Rossi[
  unlist(sapply(is_large, function(x) which(x == Rossi$id))), ]
Rossi_subset <- Rossi_subset[nrow(Rossi_subset):1, ]
Rossi_subset[!duplicated(Rossi_subset$id), ]

Some have a large number of prior convictions as shown in the next plot. Moreover, we can see by the next histogram that the number of prior convictions is skewed:

tmp <- xtabs(~Rossi$prio[!duplicated(Rossi$id)]) 
plot(as.numeric(names(tmp)), c(tmp), ylab = "frequency", type = "h", 
     xlab = "Number of prior convictions")

This could suggest a transformation of the variables. Thus, we try with the logarithm of the value plus one with one added to avoid $\log(0)$ (with one arbitrarily chosen):

tmp <- xtabs(~log(Rossi$prio[!duplicated(Rossi$id)] + 1))
plot(as.numeric(names(tmp)), c(tmp), ylab = "frequency", type = "h", 
     xlab = "Log(Number of prior convictions + 1)")

Below, we make a fit where we use this transformation of prior convictions instead:

dd_rossi_trans <- ddhazard(
  Surv(start, stop, event) ~ fin + age + log(prio + 1) + employed.cumsum, 
  data = Rossi, id = Rossi$id, by = 1, max_T = 52, 
  Q_0 = diag(10000, 5), Q = diag(.01, 5), 
  control = ddhazard_control(eps = .001, n_max = 250))

plot(dd_rossi_trans)

computing the hat values and making a similar plot to the one before shows that the individuals are much less influential.

hats <- hatvalues(dd_rossi_trans)

hats_stacked <- stack_hats(hats)

A question is whether the new model fits better. Thus, we compare the mean logistic loss of the two models in-sample:

# Fit the two models
f1 <- ddhazard(
  Surv(start, stop, event) ~ fin + age + prio + employed.cumsum, 
  data = Rossi, id = Rossi$id, by = 1, max_T = 52, 
  Q_0 = diag(10000, 5), Q = diag(.01, 5), 
  control = ddhazard_control(eps = .001, n_max = 250))

f2 <- ddhazard(
  Surv(start, stop, event) ~ fin + age + log(prio + 1) + employed.cumsum , 
  data = Rossi, id = Rossi$id, by = 1, max_T = 52, 
  Q_0 = diag(10000, 5), Q = diag(.01, 5), 
  control = ddhazard_control(eps = .001, n_max = 250))

# Compute residuals
res1 <- residuals(f1, type = "pearson")
res2 <- residuals(f2, type = "pearson")

# Compute logistic loss 
log_error1 <- unlist(
  lapply(res1$residuals, function(x) 
    ifelse(x[, "Y"] == 1, log(x[, "p_est"]), log(1 - x[, "p_est"]))))
log_error2 <- unlist(
  lapply(res2$residuals, function(x) 
    ifelse(x[, "Y"] == 1, log(x[, "p_est"]), log(1 - x[, "p_est"]))))

# Compare mean
print(c(res1 = mean(log_error1), res2 = mean(log_error2)),
      digits = 8)

The difference is very small.

Data set 2: Worcester Heart Attack Study

Next, We will look at the Worcester Heart Attack Study. The dataset contains individuals who had a heart attack and is then followed up to check if they die within the following days. Below, we load the the data and plot the date of deaths:

load("Diagnostics/whas500.RData")

hist(whas500$lenfol[whas500$fstat == 1], breaks = 20, 
     xlab = "Time of death", main = "")

The large peak at the start is due to a lot of individuals who dies doing the hospital stay shortly after their first heart attack. All covariates in the dataset are time-invariant records from the day that the individual had the first heart attack. We will look at gender, age, BMI, binary for whether the individual has a history of cardiovascular disease (cvd) and heart rate (hr). The first entries and summary stats are printed below:

# We only keep some of the columns
whas500 <- whas500[
  , c("id", "lenfol", "fstat", "gender",  "age", "bmi", "hr", "cvd")]

# First rows
head(whas500, 10)

# Summary stats
summary(whas500[, c("age", "bmi", "hr", "gender",  "cvd")])

Estimation

We estimate the model as follows:

dd_whas <- ddhazard(
  Surv(lenfol, fstat) ~ gender + age + bmi + hr + cvd,
  data = whas500, by = 100, max_T = 2000, 
  Q_0 = diag(10000, 6), Q = diag(.1, 6),
  control = ddhazard_control(eps = .001))

plot(dd_whas)

The intercept drops in the first period which possibly is due to the initial high number of deaths right after the first heart attack. We further simplify the model the model by removing gender variable which seems to be zero:

dd_whas <- ddhazard(
  Surv(lenfol, fstat) ~ age + bmi + hr + cvd,
  data = whas500, by = 100, max_T = 2000, 
  Q_0 = diag(10000, 5), Q = diag(.1, 5), 
  control = ddhazard_control(eps = .001))

plot(dd_whas)

Leaving out a covaraite

Suppose that we had not included the age to start with:

dd_whas_no_age <- ddhazard(
  Surv(lenfol, fstat) ~ bmi + hr + cvd,   # No age
  data = whas500, by = 100, max_T = 1700, 
  Q_0 = diag(10000, 4), Q = diag(.1, 4), 
  control = ddhazard_control(eps = .001))

plot(dd_whas_no_age)

Then we could think that we would be able to see that the covariate was left out using the Pearson residuals as in a regular logistic regression. We compute the Pearson residuals to see if this would work:

obs_res <- residuals(dd_whas_no_age, type = "pearson")

The returned object is a list with two elements. One element denotes the type of the residuals and another contains the residuals. The latter is a list with a matrix for each interval. Each matrix has four columns for the residuals, the computed likelihood of an event, the outcome and the row number in the initial data matrix for those that were at risk in the interval. This is illustrated in the next lines:

# We have matrix for each interval
length(obs_res$residuals)    

# Shows the structure of the matrices. We only print take the first 5 matrices
str(obs_res$residuals[1:5])  

# Print the first entries of the first interval
head(obs_res$residuals[[1]]) 

The list of matrices is un-handy so we have defined a function called stack_residuals to stack the matrices, add the interval number and the id of that the residuals belong to in. The definition of the function is given at the end of this vignette.

stack_residuals <- function(fit, resids){
  if(!inherits(resids, "ddhazard_residual"))
    stop("Residuals must have class 'ddhazard_residual'")
  if(!inherits(fit, "ddhazard"))
    stop("fit must have class 'ddhazard'")

  # Stack the residuals
  resids_stacked <- 
    data.frame(do.call(rbind, resids[[1]]), row.names = NULL)

  # Add the interval number and id
  n_rows <- unlist(lapply(resids$residuals, nrow))
  interval_n <- unlist(sapply(1:length(n_rows), function(i) rep(i, n_rows[i])))

  resids_stacked <- cbind(
    resids_stacked, 
    interval_n = interval_n,
    id = fit$id[resids_stacked$row_num])

  # Sort by id and interval number
  resids_stacked <- 
    resids_stacked[order(resids_stacked$id, resids_stacked$interval_n), ]

  resids_stacked
}
resids_stacked <- stack_residuals(fit = dd_whas_no_age, resids = obs_res)

# print the first entries
head(resids_stacked, 10)

Next, we add the age variable to the stacked residuals, stratify the age variable, compute cumulated mean across each stratum in each interval and plot against time:

# Add age variable
resids_stacked$age <- 
  whas500$age[resids_stacked$row_num]

# Stratify
age_levels <- quantile(whas500$age, seq(0, 1, by = .2))
age_levels
resids_stacked$age_cut <- cut(resids_stacked$age, age_levels)

# Compute the means 
cut_means <- 
  tapply(resids_stacked$residuals, 
         list(resids_stacked$interval_n, resids_stacked$age_cut), 
         mean)

head(cut_means)

# Plot against time
colfunc <- colorRampPalette(c("Black", "Blue"))
cols <- colfunc(ncol(cut_means))

matplot(dd_whas_no_age$times[-1], apply(cut_means, 2, cumsum), 
        type = "l", col = cols, xlab = "Time", lwd = 2,
        lty = 1, ylab = "Cumulated mean Pearson residuals")
abline(h = 0, lty = 2)

legend("topleft", bty = "n", 
       lty = rep(1, ncol(cut_means)),
       legend = colnames(cut_means), 
       col = cols, lwd = 2,
       cex = par()$cex * .8)

We see that the older and youngest strata stand out and deviates from zero. Hence, suggesting that the age variable should have been in the model.

Second round of checks

We will illustrate some other uses of the residuals function in this section. This part is included more to show how the function works as they do not "discover" anything new about the data sets.

Residuals from state space vector

We start by looking at the standardized state space errors as explained in the ddhazard vignette. We may expect these to be standard iid normal distributed. First, we compute the values with the residuals by passing type = "std_space_error" with the first fit we made with the Rossi data set:

stat_res <- residuals(dd_rossi, type = "std_space_error")

str(stat_res)

The output is a list with the residuals, smoothed covariance matrices for the errors and a binary variable to indicate whether or not the residuals are standardized. Next, we can plot the residuals as follows:

stopifnot(all(abs(stat_res$residuals) < 1.96))
plot(stat_res, mod = dd_rossi, p_cex = .75, ylim = c(-2, 2))

The variables appears to be nothing like standard normal (we have $52 \cdot 6$ residuals with no one outside $\pm 1.96$). Another idea is only marginally standardize (that is, not rotate the residuals). This can be done as follows:

# Get non-standardized residuals
stat_res <- residuals(dd_rossi, type = "space_error")

# Standardize marginally
for(i in 1:nrow(stat_res$residuals))
  stat_res$residuals[i, ] <- stat_res$residuals[i, ] /
    sqrt(diag(stat_res$Covariances[, , i]))

# Plot
par(mfcol = c(2, 3))
for(i in 1:ncol(stat_res$residuals))
  plot(stat_res, mod = dd_rossi, p_cex = .75, cov_index = i, 
       ylab = colnames(stat_res$residuals)[i], 
       ylim = c(-2, 2))

which I again find hard to draw any conclusion from. It seems like there is structure in the errors for the intercept and fin. However, this might be what we expected. For instance, we may expect the intercept to increase through time (i.e. not be random as assumed by the model). Further, assuming that the covariance estimate are conservative then there might be an error for prio in interval 20-25 and an error in age in interval 10-12 that seem extreme. We can do the same thing for the first fit with the Rossi data set:

stat_res <- residuals(dd_whas, type = "std_space_error")

plot(stat_res, mod = dd_whas, ylim = c(-4, 4), p_cex = .8)

Again, the errors seems to have to low variance to be standard normal apart from one which is more than three standard deviations away. I am not sure what to conclude from this. A question is whether we would see the same for a simulated data set where the true coefficients follows a random walk. We check this in the next paragraph.

Simmulation

if(requireNamespace("dichromat", quietly = TRUE, warn.conflicts = FALSE)){
  cols <- c("#000000", dichromat::colorschemes$Categorical.12[c(6, 8, 10)])
} else
  cols <- c("#000000", "#009E73", "#e79f00", "#9ad0f3")

We start by getting a the definition of the test_sim_func_logit function in the dynamichazard package which is not exported. We will use it to simulate the individuals series:

sim_func <- with(environment(ddhazard), test_sim_func_logit)

Then we simulate the coefficients and plot them:

# Simulate the coefficients
set.seed(556189)
betas <- matrix(rnorm(21 * 4), ncol = 4)
betas[, 1] <- betas[, 1] * 0.25  # reduce the variance of the intercept
betas <- apply(betas, 2, cumsum) # accumulate the innovations
betas[, 1] <- betas[, 1] - 4     # we reduce the intercept

# Plot the simulated coefficients
matplot(betas, col = cols, lty = 1, type = "l")

We reduce the variance of the intercept and decrease the intercept to yield a lower baseline risk of an event. The individuals and their outcomes are simulated as follows:

This is done in the following call:

# Simulate series
sim_dat <- sim_func(
  n_series = 500,      # number of individuals
  t_max = 20,          # the last stop time
  x_range = 2,         # what is the uniform range to draw from
  x_mean = 0,          # the mean of the uniform covariates
  n_vars = 3,          # 4 - 1 for the intercept
  lambda = 1/10,       # lambda in the exponential distribution for time 
                       # between updates of covariate vectors
  betas = betas)

The first rows of the simulation looks as follows

head(sim_dat$res, 10)

Next, we estimate the model and compare the estimated coefficients with the fit:

f1 <- ddhazard(
  Surv(tstart, tstop, event) ~ x1 + x2 + x3,
  sim_dat$res, by = 1, max_T = 20, id = sim_dat$res$id, 
  Q_0 = diag(10000, 4), Q = diag(.1, 4), 
  control = ddhazard_control(eps = .001))

matplot(betas, col = cols, lty = 1, type = "l")
matplot(f1$state_vecs, col = cols, lty = 2, type = "l", add = TRUE)

The full lines are the true coefficients and the dashed lines are the estimates. The question is then how the standardized state space errors look. We plot these below

stat_res <- residuals(f1, type = "std_space_error")

plot(stat_res, mod = f1, p_cex = .8)

The errors seems more variable than before. We make a QQ-plot below. Apart from one error it seems quite close to a normal distribution.

qqnorm(c(stat_res$residuals), pch = 16, cex = .8)
qqline(c(stat_res$residuals))

Re-running the above three times gives the following plots:

# CHECK: arguments below match those above
par(mfcol = c(3, 3))

set.seed(189780)
for(i in 1:3){
  # Simulate the coefficients
  betas <- matrix(rnorm(21 * 4), ncol = 4)
  betas[, 1] <- betas[, 1] * 0.25  
  betas <- apply(betas, 2, cumsum) 
  betas[, 1] <- betas[, 1] - 4

  # Simulate series
  sim_dat <- sim_func(
    n_series = 500,      
    t_max = 20,          
    x_range = 2,         
    x_mean = 0,          
    n_vars = 3,          
    lambda = 1/10,        

    betas = betas)

  # Fit
  f1 <- ddhazard(
    Surv(tstart, tstop, event) ~ x1 + x2 + x3,
    sim_dat$res, by = 1, max_T = 20, id = sim_dat$res$id, 
    Q_0 = diag(10000, 4), Q = diag(.1, 4), 
    control = ddhazard_control(eps = .001))

  # Plot coefficients
  matplot(betas, col = cols, lty = 1, type = "l",
          ylim = range(betas, f1$state_vecs))
  matplot(f1$state_vecs, col = cols, lty = 2, type = "l", add = TRUE)

  # Plot errors
  stat_res <- residuals(f1, type = "std_space_error")
  plot(stat_res, mod = f1, p_cex = .8)

  # QQ-plos
  qqnorm(c(stat_res$residuals), pch = 16, cex = .8)
  qqline(c(stat_res$residuals))
}

It is worth stressing that neither the initial seed nor the parameters have been selected here to yield a particular result. The take away for me is that the previous finding with the Rossi data set and WHAS data set is an artifact of the data set and model specification and not something we would see if we have an actual random walk model it seems.

Hat values for Worcester Heart Attack Study

In the following paragraphs, we check hat values for Worcester Heart Attack Study data set. First, we compute them:

hats <- hatvalues(dd_whas)
hats_stacked <- stack_hats(hats)

Then we compute the cumulated hat values for each individuals and plot against time

# Compute cumulated hat values
hats_stacked$hats_cum <- unlist(tapply(
  hats_stacked$hat_value, hats_stacked$id, cumsum))

# Plot the cumulated residuals for each individual
plot(c(1, 20), range(hats_stacked$hats_cum), type = "n", 
     xlab = "Interval number", ylab = "Cumulated hat values")
invisible(
  tapply(hats_stacked$hats_cum, hats_stacked$id, lines, 
         col = gray(0, alpha = .2)))

Three individuals seems to stand out. We look at these in the next line:

max_cum <- tapply(hats_stacked$hats_cum, hats_stacked$id, max)
is_max <- order(max_cum, decreasing = TRUE)[1:3]
is_max

# The records for these
whas500[is_max, ]

One of them has a high heart rate while the two others have a low BMI (below 18.5 is underweight). Another idea is to look at the average up to the given time of the hat values. The motivation is the two lines that end around the 5'th interval either because they die or are right censored. Hence, we normalize by the interval number and plot against time

# Averages of hat values
hats_stacked$hats_avg <- hats_stacked$hats_cum / hats_stacked$interval_n

# Plot against time
plot(c(1, 20), range(hats_stacked$hats_avg), type = "n", 
     xlab = "Interval number", ylab = "Avg. hat values")
invisible(
  tapply(hats_stacked$hats_avg, hats_stacked$id, lines, 
         col = gray(0, alpha = .2)))

Indeed the two stands. Hence, we look further at the five largest values:

max_avg <- tapply(hats_stacked$hats_avg, hats_stacked$id, max)
is_max_avg <- order(max_avg, decreasing = TRUE)[1:5]
is_max_avg

# The records for these
whas500[is_max_avg, ]

The two new ones with the highest values are one who is old and another with an extreme heart rate (a typical rule rule of thumb is that the maximum heart rate is $220$ less your age!). In order to show this, we make the following plots:

# Setup parameters for the plot
cols <- rep("Black", 500)
cols[1:500 %in% is_max_avg] <- "Blue"
cols[1:500 %in% is_max] <- "Red"

cexs <- ifelse(cols != "Black", par()$cex * 1.25, par()$cex * .75)
pchs <- ifelse(whas500$fstat == 1 & whas500$lenfol <= 2000, 16, 1)

plot(whas500[, c("age", "hr", "bmi")], pch = pchs, cex = cexs, col = cols)

Filled circles are cases and non-filled circles are censored. The blue dots are the ones with a high maximum average hat value while the red one have both a high maximum average and a high cumulative hat values. Plotting against time shows the censoring is is clustered in time as shown in the next two plots:

plot(whas500$lenfol, whas500$hr, col = cols, pch = pchs, 
     xlab = "Follow-up time", ylab = "Heart rate")
plot(whas500$lenfol, whas500$age, col = cols, pch = pchs, 
     xlab = "Follow-up time", ylab = "Age")

This could motivate us to stop the slightly before the last cluster of censoring of individuals occurs. Thus, we set the final time (the max_T argument to ddhazard) to 1700 instead of 2000 in the next code block where we re-estimate the model. Further, we make a fit without the "extreme" individuals:

dd_whas <- ddhazard(
  Surv(lenfol, fstat) ~ age + bmi + hr + cvd,
  data = whas500, by = 100, max_T = 1700, 
  Q_0 = diag(10000, 5), Q = diag(.1, 5), 
  control = ddhazard_control(eps = .001))

dd_whas_no_extreme <- ddhazard(
  Surv(lenfol, fstat) ~ age + bmi + hr + cvd,

  data = whas500[-c(is_max, is_max_avg), ], # we exclude the "extreme" persons

  by = 100, max_T = 1700, 
  Q_0 = diag(10000, 5), Q = diag(.1, 5))

We plot the two sets of predicted coefficients next:

par(mfcol = c(2,3))
for(i in 1:5){
  plot(dd_whas, cov_index = i)
  plot(dd_whas_no_extreme, cov_index = i, add = TRUE, col = "DarkBlue")
}

The blue line is the predicted coefficients without the "extreme" individuals. The difference seems minor

Function definitions

The functions used in this vignette that are no included in the package are defined below:



Has to be done

We reset par and options here as per CRAN policy.

par(old_par)
options(old_options)

References



Try the dynamichazard package in your browser

Any scripts or data that you put into this service are public.

dynamichazard documentation built on Oct. 6, 2022, 1:08 a.m.