knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  fig.width = 4,
  fig.height = 4,
  warning = FALSE,
  message = FALSE
)
library(rightTruncation)

Likelihood fitting of Weibull delay distribution to right-truncated data

This vignette should be self-contained (without going into all the calculations) such that the functions can be applied to other data. See the readme at https://github.com/andrew-edwards/rightTruncation for details, including the link to the manuscript that contains the mathematical background. In particular, pages 8-12 of the Supplement give full details and derivations.

The data are counts $h_{nr}$ of the number of individuals whose case was reported (test was positive) at the end of day $r$ and whose symptoms are estimated to have started on day $n$.

Simple matrix example

An example simulated data set gives the values of $h_{nr}$ as a matrix:

h_nr_simulated <- h_nr_simulate(N = 10)
h_nr_simulated

The matrix is upper diagonal (all the entries below the diagonal are zero) because $h_{nr} = 0$ for $r < n$, since a case cannot be reported before the start of symptoms. Day $N$ is the final day of the data ($N = 10$ in the example), such that the non-zero values are ${ h_{nr} }_{n = 0, 1, 2, ..., N; n \leq r \leq N}$.

The counts are right-truncated on day $N$ -- there are individuals whose symptoms started on day $n$ who will be reported in the future (after $N$), but we do not yet know when. For example, for people whose symptoms started on day $n=8$, we only know about those whose cases were reported on days 8, 9 and 10. Anyone who will have a delay between symptom onset and reporting that is longer than 2 days is not in our data set and, crucially, we don't even know about them yet. The cases are considered to be reported at the end of day $r$ because there are values of $h_{nn} > 0$. The example matrix is size 11x11 since days start at 0 and end at $N=10$.

The maximum likelihood is calculated by minimising the negative log-likelihood function:

MLE_res = nlm(f = negLL_Weibull_counts_matrix,
              p = c(3, 15),
              h_nr = h_nr_simulated)

(Note that such code usually gives warnings about NaNs being produced during the optimisation; these have been suppressed in this vignette). The main results of interest are the MLEs for $k$ and $\lambda$, and also the resulting mean and median:

k_MLE <- MLE_res$estimate[1]
lambda_MLE <- MLE_res$estimate[2]
k_MLE      # shape
lambda_MLE # scale
mean_using_MLEs <- lambda_MLE * gamma(1 + 1/k_MLE)
mean_using_MLEs
median_using_MLEs <- lambda_MLE * (log(2))^(1/k_MLE)
median_using_MLEs

Wrapper function for simulating the same data set and fitting it, just showing the fitted parameters here (the random number seed is fixed in the simulation function, hence the result is the same as above):

h_nr_one_sim_fit(N=10)$estimate

Real data, analyse as a data frame

Once we have much larger values of $N$ it makes more sense to use a long data frame format for the data, particularly since we know the matrix is always upper triangular.

The actual data in British Columbia look like this, with a simple histogram of the delay times between symptom onset and reporting of a case:

delay_data
time_report_vec = 0:(as.numeric(max(delay_data$time_to_report)) + 2)
hist(as.numeric(delay_data$time_to_report),
     breaks = time_report_vec,
     right = FALSE,
     xlab = "Time from symptom onset to reported case (days)",
     main = "",
     col = "lightgrey")

To see the individual cases as points, with barplots that summarise the values for each day.:

plotdelay1 <- plot_time_to_report(delay_data, x_axis = "onset",
                                  xLim = c(lubridate::ymd("2020-02-28"),
                                           max(delay_data$reported_date) + 2))
plotdelay2 <- plot_time_to_report(delay_data,
                                  xLim = c(lubridate::ymd("2020-02-28"),
                                           max(delay_data$reported_date) + 2))
gridExtra::grid.arrange(
             plotdelay1,
             plotdelay2,
             ncol=1)

By definition there can be no values above the dashed 1:1 line in the top panel, because the delays to reporting are too long (any cases will be reported in the future, with data only up to r max(delay_data$reported_date)).

The actual values we need for the likelihood calculations are just counts $h_{nr}$ of the number of cases that had symptom onset on day $n$ and were reported on day $r$, where $n$ and $r$ take values from $0$ to $N$. This function converts the above tibble into the required form:

h_nr_tibble <- make_h_nr_tibble(delay_data)
h_nr_tibble
# Show how many of each h_nr values there are:
summary(as.factor(h_nr_tibble$h_nr))

To calculate the maximum likelihood estimates for $k$ and $\lambda$:

MLE <- nlm(f = negLL_Weibull_counts_tibble,
           p = c(3,15),
           h_nr_tibble = h_nr_tibble)

k_MLE <- MLE$estimate[1]
lambda_MLE <- MLE$estimate[2]
k_MLE      # shape
lambda_MLE # scale
mean_using_MLEs <- lambda_MLE * gamma(1 + 1/k_MLE)
mean_using_MLEs
median_using_MLEs <- lambda_MLE * (log(2))^(1/k_MLE)
median_using_MLEs

Same analysis from early May

When run in early May, the results were:

# k_MLE      # shape
#  1.655612
# lambda_MLE # scale
#  9.901742
# mean_using_MLEs
#  8.851888
# median_using_MLEs
#  7.935406

Same analysis from early April

For the first manuscript, we used the data between February 29 and April 2:

h_nr_manuscript <- make_h_nr_tibble(delay_data,
                                    day_0 = lubridate::ymd("2020-02-29"),
                                    day_N = lubridate::ymd("2020-04-02"))
MLE_man <- nlm(f = negLL_Weibull_counts_tibble,
               p = c(3,15),
               h_nr_tibble = h_nr_manuscript)
k_MLE_man <- MLE_man$estimate[1]
lambda_MLE_man <- MLE_man$estimate[2]
k_MLE_man      # shape
lambda_MLE_man # scale
mean_using_MLEs_man <- lambda_MLE_man * gamma(1 + 1/k_MLE_man)
mean_using_MLEs_man
median_using_MLEs_man <- lambda_MLE_man * (log(2))^(1/k_MLE_man)
median_using_MLEs_man

There are r sum(h_nr_manuscript$h_nr) individual cases there. In the first manuscript we reported only 535, the difference being due to data being updated since the code was run for the manuscript; difference was small at one point, but now (early June) it is a bit larger ($k_{MLE} = 1.73$ and $\lambda_{MLE} = 9.85$ in the manuscript). Note that we can't simply use the current data up to, say, May 1 to reproduce what we actually knew then.

The following code is for plotting results for British Columbia and New Zealand, but is commented out because the New Zealand data cannot be made public. TODO: functionalise the plotting, and use BC as example.

png(filename = "delay-hist-BC-NZ.png",
    width = 6.8*240,
    height = 4.5*240,
    res = 240)
max_delay <- max(c(h_nr_manuscript$r - h_nr_manuscript$n, h_NZ$breaks) + 2)
time_report_vec = seq(0, max_delay, by=0.1)

h_NZ <- readRDS(paste0(here::here(), "/data-raw/NZ_histogram.rds"))

time_report_BC = dweibull(time_report_vec,
                          shape = k_MLE_man,
                          scale = lambda_MLE_man) * sum(h_nr_manuscript$h_nr)
time_report_NZ = dweibull(time_report_vec,
                          shape = 1.53289,
                          scale = 7.818021) * sum(h_NZ$counts)
par(mfrow=c(1,2))
hist(rep(h_nr_manuscript$r - h_nr_manuscript$n, h_nr_manuscript$h_nr),
     breaks = 0:max_delay,
     right = FALSE,
     xlab = "Days from symptom onset to reporting",
     main = "British Columbia",
     col = "lightgrey")
lines(time_report_vec,
      time_report_BC,
      col = "red",
      lwd = 2)
plot(h_NZ,
     xlab = "Days from symptom onset to reporting",
     main = "New Zealand",
     col = "lightgrey")
lines(time_report_vec,
      time_report_NZ,
      col = "red",
      lwd = 2)
dev.off()

Confidence intervals for the main results

Do univariate conficence intervals (Ben Bolker's book implies they're often close to bivariate), set one parameter at MLE and get interval for the other:

k_confint <- sizeSpectra::profLike(negLL_Weibull_counts_tibble,
                                   MLE = k_MLE,
                                   minNegLL = MLE$minimum,
                                   vecInc = 0.001,
                                   h_nr_tibble = h_nr_tibble,
                                   lambda_MLE = lambda_MLE)
k_confint  #  In early May results were: 1.60-1.86

lambda_confint <- sizeSpectra::profLike(negLL_Weibull_counts_tibble,
                                        MLE = lambda_MLE,
                                        minNegLL = MLE$minimum,
                                        vecInc = 0.001,
                                        vecDiff = 0.65,
                                        h_nr_tibble = h_nr_tibble,
                                        k_MLE = k_MLE)
lambda_confint  # In early May results were: 9.30-10.46

New Zealand data

This summarises from above what is needed to apply to other jurisdictions, and was successfully used by colleagues in New Zealand.

We would like to fit a Weibull distribution to data on the delays between a person having the onset of symptoms and them being reported as a confirmed cases of COVID-19 in New Zealand. The data we have in British Columbia is a tibble with these (self-explanatory) headings (final column is just difference between the first two):

delay_data

This gets converted into the desired form for the likelihood function:

h_nr_tibble <- make_h_nr_tibble(delay_data)
h_nr_tibble

and the maximum likelihood calculation is then run using:

MLE <- nlm(f = negLL_Weibull_counts_tibble,
           p = c(3,15),
           h_nr_tibble = h_nr_tibble)
MLE

If you have data in other forms (or are not too familiar with tibbles) then please let me know (make an Issue) and I can write a wrapper function to enable you to run the likelihood calculation.



andrew-edwards/rightTruncation documentation built on Jan. 18, 2021, 7:43 p.m.