In skgallagher/timeternR: A Pipeline to Analyze and Compare Infectious Disease Epidemics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dplyr)
library(tidyr)
devtools::load_all()

library(EpiModel)
library(dplyr)
library(tidyr)
library(EpiCompare)

Overview

Creating prediction regions and quantifying uncertainty in epidemiological function space is a complicated endeavor. In this vignette, we focus on providing the reader some background for the approach presented in this package. Although our work was is based on work in (Dalmasso, Dunn, LeRoy^[Ben is a coauthor in this paper.], Schafer 2019), we provide more statistical grounding for our shared approaches as well as smarter approaches for simulation based prediction regions for multivariate functional data (functions with output in multi-dimensional space).

The first section of this vignette is a literature review of some useful tools and related works to create prediction regions for Euclidean space. We focus on creating rudimentary level sets for for conditional density estimates.

The second part of this vignette demonstrates how we leverage level set estimates to define a prediction region for simulated epidemics. We demonstrate our prediction region on a mixture of SIR epidemic models, and examine different ranking/quantile functions to create a prediction region.

Section 1: An introduction to creating prediction regions with level sets

There are many ways to create prediction regions. The most famous prediction region (interval) comes from linear regression. Under assumptions about the linearity of $y|X$ and the normality of the residuals, we can define a prediction interval for a new point $y_|X_$ as

[ X_^T\hat{\beta} \pm z_{\alpha/2} \cdot \hat{\sigma} \sqrt{1 + X_^T \left ( X^TX \right)^{-1}X_} \;, ] where $X_$ is a vector of $p$ predictors for the new point, $X$ is the information matrix for the training data, $\hat{\beta} = \left (X^TX\right )^{-1}X^Ty$, and $\hat{\sigma} = \sqrt{SSE/(n-(p+1))}$. This interval should contain $y_i$ $(1-\alpha)100\%$ of the time.

Such a prediction region relies heavily on the strict assumptions of the linear model. It is often useful to assume less strict assumptions for the generative distribution, potentially as far as fitting some black models that return an estimate of the conditional density to $y|x$ to define prediction regions [e.g. @Izbicki2017].

For models that can estimate conditional densities, density level sets provide a great tool to create a prediction region. Level sets are known to be the best (i.e. most efficient) prediction regions given a density is known (mentioned in passing in [@Lei2013]). Although level sets are often visualized by just evaluating the density over a grid, there are many ways to estimate density level sets, especially when one has a method of sampling from the underlying distribution. [@Grenander1981a; @Walther1997; @Jiang2017] all present possible ways to use a union of balls to represent level sets for densities for simulations/random samples. @Walther1997's more efficient approach is visualized in the below visuals. Our approach builds off of prediction regions through level sets from low-dimensional Euclidean space to multivariate functional epidemics.

The below visuals show different prediction regions when either varying the level set cutoff value $\lambda$ (top figures) or the radius of the delta balls $r_n$ (bottom figures). The figures on the right show a delta ball approach to creating a prediction region, a process dependent on both the radius $r_n$ and the cutoff value $\lambda$. The figures on the right show prediction regions determined by kernel density estimates, a process dependent only on $\lambda$.

library(tidyverse)
library(ks)
library(latex2exp)
# data ------
set.seed(1)
data <- data.frame(x = runif(10000),
                   y = runif(10000))

#plot(data$x, data$y)
#id_vals <- identify(data$x, data$y)
id_vals <- c(545, 800, 1090, 1659, 1750, 1886, 2374, 2692,
             2901, 3411, 3839, 4018, 5023, 5341, 5505,
             5523, 7983, 8273, 8706, 9025, 9401)

data_select <- data[id_vals,]
data_select$x[data_select$y < .55] <-
    data_select$x[data_select$y < .55] + .05

data_select$y[data_select$y > .5 & data_select$y < .65] <-
    data_select$y[data_select$y > .5 & data_select$y < .65] + .04

data_select$x[data_select$x < .3] <-
    data_select$x[data_select$x  < .3] + .05

my_kde_similar_mass_kde2 <- function(data, h){
    x <- data$x 
    y <- data$y
    nx <- length(x)

    h <- rep(h, 2L)

    gx <- x
    gy <- y

    if (any(h <= 0)) {
        stop("bandwidths must be strictly positive")
    }
    h <- h/4

    ax <- outer(gx, x, "-")/h[1L]
    ay <- outer(gy, y, "-")/h[2L]
    z <- diag(tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay), , nx))/(nx * h[1L] * h[2L]))
    return(z)
}

dist_x <- dist(data_select %>% select(x,y)) %>% as.matrix()
rank_prob_range <- (1:nrow(data_select))/nrow(data_select) + .5/nrow(data_select)

split_threshold_sims <- function(probs, threshold = NA, rank_prob = NA){

    if (is.na(threshold) & is.na(rank_prob)){
        stop("either 'threshold' or 'rank_prop' must be specified")
    }

    if (is.na(threshold)){
        proportion_rank <- rank(probs)/ length(probs)
        id <- 2*(proportion_rank > rank_prob) - 1
    } else{
        id <- 2*(probs < threshold) - 1
    }

    return(id)
}

plus_update <- function(dist_x, group_id, r_n = 1){
    group_id_inner <- group_id
    plus_dist_to_minus <- dist_x[group_id == 1,][,group_id == -1]
    group_id_inner[group_id == 1] <- 2*(apply(plus_dist_to_minus, 1, min) > r_n) - 1
    return(group_id_inner)
}


contour_vis_df <- function(correct_plus_data, r_n = 1, over_delta = .1, grid_size = c(200,200)){
    # code similar to EpiCompare::StatPredBandDeltaBall's compute_group function
    data_deep_points <- correct_plus_data

    delta_info <- EpiCompare::delta_structure(data_deep_points, xy_columns = c("x", "y"))


    structure <- delta_info$structure

    inner_df <- dplyr::setdiff(data_deep_points %>%
                                   dplyr::select(x,y),
                               structure %>%
                                   dplyr::select(x,y))

    border_points <- structure %>% dplyr::select(x,y)
    inner_points <- inner_df

    xrange <- seq(min(border_points$x) - over_delta,
                  max(border_points$x) + over_delta,
                  length.out = grid_size[1])

    yrange <- seq(min(border_points$y) - over_delta,
                  max(border_points$y) + over_delta,
                  length.out = grid_size[2])

    updated_gridpoints <- EpiCompare:::get_closest(border_points, inner_points,
                                                  r_n,
                                                  xrange = xrange,
                                                  yrange = yrange,
                                                  gridbreaks = NULL)


    update_gridpoints_mat <- tidyr::pivot_wider(updated_gridpoints,
                                                names_from = "y",
                                                values_from = "z") %>%
        dplyr::select(-x) %>% as.matrix


    cl <- grDevices::contourLines(x = xrange, y = yrange,
                                  z = update_gridpoints_mat,levels = c(2))

    lengths <- vapply(cl, function(x) length(x$x), integer(1))
    xs <- unlist(lapply(cl, "[[", "x"), use.names = FALSE)
    ys <- unlist(lapply(cl, "[[", "y"), use.names = FALSE)
    pieces <- rep(seq_along(cl), lengths)

    vis_df <- data.frame(
        x = xs,
        y = ys,
        piece = pieces)

    return(vis_df)
}


contour_density_vis_df <- function(data, h = .1, 
                                   over_delta = .1, 
                                   grid_size = c(200,200),
                                   lambda = 1){


    kde_estimate <- MASS::kde2d(x = data$x, y = data$y, h = h, n = grid_size, 
                lims = c(min(data$x) - over_delta, 
                         max(data$x) + over_delta,
                         min(data$y) - over_delta,
                         max(data$y) + over_delta))

    cl <- grDevices::contourLines(x = kde_estimate$x, y = kde_estimate$y,
                                  z = kde_estimate$z,levels = c(lambda))

    lengths <- vapply(cl, function(x) length(x$x), integer(1))
    xs <- unlist(lapply(cl, "[[", "x"), use.names = FALSE)
    ys <- unlist(lapply(cl, "[[", "y"), use.names = FALSE)
    pieces <- rep(seq_along(cl), lengths)

    vis_df <- data.frame(
        x = xs,
        y = ys,
        piece = pieces)

    return(vis_df)
}

arrangeGrob2 <- gridExtra::arrangeGrob
h <- .1
n <- nrow(data_select)
prop_cutoff <- c(seq(.5/n, 1 +  .5/n, by =1/n )[4:18],
                 seq(.5/n, 1 +  .5/n, by =1/n )[18:8],
                 rep(seq(.5/n, 1 +  .5/n, by =1/n )[8], 23),
                 seq(.5/n, 1 +  .5/n, by =1/n )[8:4]) # come back to .35
radius <- c(rep(.035, 26) , seq(.035, .02, by = -.005),
            seq(.02, .1, by = .005),
            seq(.1, .035, by = -.05), rep(.035, 5))

prop_cutoff_level <- c(seq(.5/n, 1 +  .5/n, by = 1/n )[4:18],
                       seq(.5/n, 1 +  .5/n, by = 1/n )[18:4])
radius_level <- rep(.035, length(prop_cutoff_level))

radius_rad <- c(seq(.1, .02, by = -.005), seq(.02,.1, by = .005))
prop_cutoff_rad <- rep(seq(.5/n, 1 +  .5/n, by =1/n )[8], length(radius_rad))

parameters_list_level <- lapply(1:length(radius_level), 
                                function(idx) list("r_n" = radius_level[idx],
                                                   "rank_prob" = prop_cutoff_level[idx]))
parameters_list_rad <- lapply(1:length(radius_rad), 
                              function(idx) list("r_n" = radius_rad[idx],
                                                 "rank_prob" = prop_cutoff_rad[idx]))

walther_plot_func <- function(input, range_name = "r_n"){
  my_data_select <- data_select
  my_prob <- my_kde_similar_mass_kde2(my_data_select, h = h)

  my_data_select$my_prob <- my_prob
  my_data_select$rank <- rank(my_data_select$my_prob)

  rank_prob <- input$rank_prob
  r_n <- input$r_n
  probs <- my_data_select$my_prob


  group_id <- split_threshold_sims(my_data_select$my_prob, 
                                   rank_prob = rank_prob)
  group_id_clean <- plus_update(dist_x, group_id, r_n = r_n)

  data_select_inner <- my_data_select %>%
      mutate(raw_id = group_id,
             cleaned_id = group_id_clean)

  lambda <- min(probs[group_id == 1])

  lambda_range <- range(probs)
  r_n_range <- c(.02,.1)

  vis0 <- ggplot(data.frame(lambda = lambda, r_n = r_n)) +
      geom_point(aes(y = r_n, x = lambda)) +
      ggrepel::geom_label_repel(aes(y = r_n, x = lambda,
                                    label = paste0("(", round(r_n,4),
                                                       ",", round(lambda,4),
                                                       ")")),
                                nudge_x = 3, nudge_y = .003) + 
      labs(x = TeX("$\\lambda$, threshold cutoff"),
           y = TeX("$r_n$, radius")) +
      ylim(r_n_range) + xlim(lambda_range) +
      theme_minimal()



  vis_df <- contour_vis_df(data_select_inner[group_id_clean == 1,], r_n = r_n)

  vis1 <- vis_df %>%
     ggplot() +
     geom_polygon(aes(x = x , y = y, group = piece), fill = NA, color = "red") +
     geom_point(data = data_select_inner, aes(x = x , y= y,
                                        color = factor(cleaned_id),
                                        shape = factor(raw_id))) +
     theme_minimal() +
     theme(aspect.ratio = 1) +
     labs(title = "Data Visualization",
          subtitle = "ranked by estimated probability value",
          y = TeX("$y_2$"),
          x = TeX("$y_1$"),
          color = TeX("In defining set"),
          shape = TeX("$\\hat{f}(x) \\geq \\lambda$")) +
  xlim(c(.15, .65)) +
  ylim(c(.3,.9)) +
  scale_shape_manual(values = c(16, 1))

  vis_df2 <- contour_density_vis_df(data = data_select_inner %>% dplyr::select(x,y), 
                                    h = h,
                                    grid_size = c(200,200),
                                    lambda = lambda) 

  vis2 <- vis_df2 %>%
        ggplot() +
        geom_density_2d(data = data_select_inner, aes(x = x , y= y), h = h,
                        alpha = .3) + 
        geom_polygon(aes(x = x , y = y, group = piece), fill = NA, 
                     color = "red") +
        geom_point(data = data_select_inner, aes(x = x , y= y,
                                           color = factor(cleaned_id),
                                           shape = factor(raw_id))) +

        theme_minimal() +
        theme(aspect.ratio = 1) +
        labs(title = "Data Visualization",
                subtitle = "Kernel Density Estimates",
                y = TeX("$y_2$"),
                x = TeX("$y_1$"),
                color = TeX("In defining set"),
                shape = TeX("$\\hat{f}(x) \\geq \\lambda$")) +
        xlim(c(.15, .65)) +
        ylim(c(.3,.9)) +
        scale_shape_manual(values = c(16, 1))

  layout_matrix <- matrix(c(1,2,2,2,3,3,3,
                            1,2,2,2,3,3,3,
                            1,2,2,2,3,3,3), nrow = 3, ncol = 7, byrow = T)

  return(arrangeGrob2(grobs = list(vis0, vis1, vis2),
                     layout_matrix = layout_matrix))
}

walther_plots_level <- lapply(parameters_list_level, walther_plot_func, range_name = "level")
walther_plots_rad <- lapply(parameters_list_rad, walther_plot_func)

#```r
walk(walther_plots_level, plot)
# note the below gif needs to be manually to `man/figures/` to be used , run this file in a different location that vignettes - will not save the gifs otherwise

#```r
walk(walther_plots_rad, plot)
# note the below gif needs to be manually to `man/figures/` to be used, run this file in a different location that vignettes - will not save the gifs otherwise

<center>
![test](../../reference/figures/walther-plots-level.gif)
</center>
<br>

<center>
![test](../../reference/figures/walther-plots-rad.gif)
</center>

Section 2: Prediction bands from epidemic simulations

In order to demonstrate our approaches to building prediction regions for epidemics, we start with a data example. We first examine the structure of the simulated points and then look at different approaches to contain a desired region in the space.

Data example

To understand how we create prediction regions (and a few options we provide), we will examine a generative model where the true epidemic comes from a mixture of these two SIR models with different infection parameters $\beta$ and recovery parameters $\gamma$. Specially,

$$ \text{Epidemic} \sim .7 \cdot \text{Model}_1 + .3 \cdot \text{Model}_2 $$ where $$ \text{Model}_1 = SIR(\beta = .1, \gamma = .03) \ \text{Model}_2 = SIR(\beta = .15, \gamma = .05) $$

set.seed(11)
## first group -----------
beta <- .1
gamma <- .03
n <- 100
prop_class_1 <- .7

out <- simulate_SIR_agents(n_sims = round(.7*n),
                           n_time_steps = 100,
                           beta = beta, gamma = gamma,
                           init_SIR = c(950, 50, 0))

df_group <- out %>% group_by(sim) %>% 
  agents_to_aggregate(states = c("tI", "tR"),
                      min_max_time = c(0,100)) %>%
  rename(S = "X0", I = "X1", R = "X2") 

## second group ----------

beta <- .15
gamma <- .05

out2 <- simulate_SIR_agents(n_sims = round((1-.7)*n),
                            n_time_steps = 100,
                            beta = beta, gamma = gamma,
                            init_SIR = c(950, 50, 0))

df_group2 <- out2 %>% group_by(sim) %>% 
  agents_to_aggregate(states = c("tI", "tR"),
                      min_max_time = c(0,100)) %>%
  rename(S = "X0", I = "X1", R = "X2") 

df_all <- rbind(df_group %>% mutate(id = "Model 1"),
                df_group2 %>% mutate(id = "Model 2"))
df_all_vis <- df_all %>% 
  mutate(id2 = factor(paste0(id, sim), levels = paste0(rep(c("Model 1", "Model 2"), each = 100),
                                     rep(1:100, length = 200)))) %>%
  mutate(id2 = as.numeric(id2))

The following figure shows these curves on the unit simplex. Notice that is hard for the eye to tell that the second group only contains 30\% of simulations. This observation will be important later.

all_curves <- ggplot(df_all_vis) +
  geom_path(aes(x = S, y = I, z = R, color = id, group = id2),
            alpha = .2) + 
  coord_tern() +
  labs(color = "Group") +
  theme_sir()

all_curves

Distances between epidemic simulations

If we imagine each epidemic simulation as it's own object, we can naturally imagine comparing simulations, and if we treat these simulations like functional data, we might approach comparing the simulations with a distance metric ($d_\mathcal{F}$). The metric $d_\mathcal{F}$ can be defined in different ways [e.g. @Chiou2014; @Chen2012; @Buchman2011], For concreteness, one potential distance between multivariate random functions from @Buchman2011 is defined as: [ d_\mathcal{F}(f_1, f_2) = \int_{c \in C} d_Y(\mathbf{f}_1(c), \mathbf{f}_2(c)) dc ]

This distance (if $C$ is indexing time), doesn't really align with our goal of "time-invariant" analysis, and we provide other distances that try to be more "time-invariant", including [ d_\mathcal{F}^\text{equa}(f_1,f_2,n) = \frac{1}{n} \sum_{i=1}^n d_Y(f_1(c_{1(i)}), f_2(c_{2(i)}))\;, ] where $c_{j(i)}$ relates to equally spaced, order points along the function^[In line with treating the function as a filament.].

Pseudo-density

To summarize these curves we want wish capture a representation of highly likely curves through a prediction region. Although one cannot define a density for multivariate functions, there's a long practice of defining pseudo-densities [e.g. @Ferraty2012; @Ciollaro2014] via kernels and the distance between observations, e.g. [ \hat{f}(X) = \frac{1}{n}\sum_{i=1}^n K(\text{dist}(X, X_i)^2/\sigma) \;, ]

often with $K$ being the standard Gaussian kernel. We can calculate the pseudo-density for the set of simulations, and we visualize them below. In both figures below we can see that the the estimated density for most simulations from the second mixture is above the maximum density estimated for the density from the second simulation.

Additional commentary associated with code tidy_dist_mat...

Although not important for this vignette's message, EpiCompare introduces a new class, tidy_dis_mat to keep track of the tidyverse grouping structure that we have relative to the simulations. This allows us to understand the distances between the simulated epidemics. Our tidy_dist_mat work very similar to distance matrices but also stores keys of information for each row and column in the matrix akin to tidyverse style grouping. *This was "mostly" done as tidyverse style as opposed to fully tidyverse style as tidyverse is inconsistent in the order of their groupings group_by + nest vs group_nest/group_split/group_keys. See ?tidy_dist_mat for examples. Any function that can take in a tidy_dist_mat and also take in a "regular" `matrix.

# this requires keeping track of the names in this way (sadly)
compression_df <- df_all %>%
  arrange(t) %>% # just to be safe
  dplyr::select(-t) %>%
  group_by(id, sim) %>% 
  filament_compression(data_columns = c(S,I,R), 
                       number_points = 20) 

tidy_dm <- compression_df %>%
  dist_matrix_innersq_direction(position = 3:5, 
                                verbose = TRUE,
                                tdm_out = TRUE)


sigma <-  quantile(as.matrix(tidy_dm), probs = .3)

psuedo_density <- distance_psuedo_density_function(x = tidy_dm, sigma = sigma)

group_names <- names(psuedo_density)[names(psuedo_density) != "pseudo_density"]

df_all_pd <- df_all_vis %>%
  left_join(psuedo_density, by = c("sim", "id")) %>%
  ungroup() %>% 
  mutate(psuedo_density2 = cut(.data$psuedo_density, breaks = 5),
         id2 = factor(paste0(id, sim), 
                      levels = paste0(rep(c("Model 1", "Model 2"), each = 100),
                                      rep(1:100, length = 200)))) %>%
  mutate(id2 = as.numeric(id2))

psuedo_density %>%
  ggplot() +
  geom_histogram(aes(x = psuedo_density, fill = id)) +
  facet_grid(id ~.) +
  labs(fill = "mixture membership", x = "pseudo-density")

df_all_pd %>% 
  ggplot(aes(x = S, y = I, z = R, color = psuedo_density, group = id2)) +
  geom_path() +
  coord_tern() +
  labs(title = "Pseudo Density", color = "pseudo density") +
  theme_sir()

# not needed to be presented 
df_all_pd %>% 
  ggplot(aes(x = S, y = I, z = R, color = psuedo_density, group = id2)) +
  geom_path() +
  coord_tern() +
  labs(title = "Psuedo Density", color = "psuedo density") +
  theme_sir() +
  facet_wrap(~psuedo_density2)

Creating a prediction region

By extending union-of-balls-based-level-set estimation techniques as described in the first section, we wrap each function in a small "tube" to represent the level set^[This is approximated by what [@Dalmasso2019] describes as the $\delta$-ball approach, but with a more statistically valid approach using the pseudo-density.]. The final result with the pseudo-density estimate is presented below (with a $\alpha$ level of $.5$).

psuedo_density_pb <- all_curves + 
  geom_prediction_band(data = df_all_vis,
                       aes(x = S, y = I, z = R, 
                          sim_group = id2),
                       pb_type = "delta_ball", conf_level = .5,
                       dist_params = list(dist_approach = "auto", 
                                          num_steps = "auto", 
                                          quantile_approach = "psuedo_density", 
                                          quantile_approach_params = list("sigma" = "30%"))) +
  coord_tern() +
  theme_sir()

psuedo_density_pb

# removed as not needed 
p_bound <- quantile(psuedo_density$psuedo_density, 1 - .5)
df_all_pd %>% 
  filter(psuedo_density > p_bound) %>% 
  ggplot() +
  geom_path(aes(x = S, y = I, z = R, group = id2), alpha = 1) + 
  coord_tern() +
  theme_sir() + geom_prediction_band(data = df_all_vis,
                       aes(x = S, y = I, z = R, 
                          sim_group = id2),
                       pb_type = "delta_ball", conf_level = .5, color = "blue",
                       dist_params = list(dist_approach = "auto", 
                                          num_steps = "auto", 
                                          quantile_approach = "pseudo_density", 
                                          quantile_approach_params = list("sigma" = "30%")))

Global depth

If we look at the above confidence interval, we might feel like the Model 2 part captured in the prediction region doesn't really capture the true structure of the data in the way we want, which also contains parts from Model 1. Specifically we might imagine that a more "central" curve in the second mixture should be highlighted first. We could change the $\sigma$ for the pseudo-density or redefine the distance function we use, but "depth" also provides a different way to think about the geometric structure of a distribution. At it's core, depth focuses on defining the "centrality" of observations.

Because we have a space with a distance defined between observations we use @Geenens2017's distance depth (directly explorable with the function distance_depth_function). @Geenens2017's distance depth is a global measure, and depth in general doesn't capture density structure. This means level sets based on this distance of depth aren't the best method to capture a true level set of density^[See Figure 2 in @Lei2013 for a Euclidean comparison of global depth vs density]. Nonetheless, depth can be used to define an ordering / ranking of the curves to create a prediction region. Depth is a very common tool in functional data analysis. Below we present the estimated "depth" of all the simulated curves using @Geenens2017's distance depth. Note that a simulation that had a similar shape and ended with $\approx 83\%$ in the recovery state would actually be seen as the most deep curve (and included in this prediction interval).

global_depth <- distance_depth_function(tidy_dm)

df_all_gd <- df_all_vis %>%
  left_join(global_depth, by = c("sim", "id")) %>% 
  rename(global_depth = depth) %>%
  mutate(global_depth2 = cut(global_depth, breaks = 5))

df_all_gd %>% 
  ggplot(aes(x = S, y = I, z = R, color = global_depth, group = id2)) +
  geom_path() +
  coord_tern() +
  labs(title = "Global Depth", color = "global depth") +
  theme_sir()

Below you'll see a prediction region defined by the global depth. The visualized prediction band using this global depth captures different structure of the simulations - specifically it seems to like the interior between these two groups a little bit more than what you and I might want. This is largely associated with the global nature of depth.

global_depth_pb <- all_curves + 
  geom_prediction_band(data = df_all_vis,
                       aes(x = S, y = I, z = R, 
                          sim_group = id2),
                       pb_type = "delta_ball", conf_level = .5,
                       dist_params = list(dist_approach = "auto", 
                                          num_steps = "auto", 
                                          quantile_approach ="depth", 
                                          quantile_approach_params = list())) +
  coord_tern() +
  theme_sir()

global_depth_pb

Local depth

If we are interested in highlighting central curves in a more local way, we can use local depth. Agostinelli \& Romanazzi (2011) proposed an approach to create local depth, and we have extended their ideas to Geenens \& Nieto-Reyes into a local distance based depth (directly explorably with the function local_distance_depth_function) [@Agostinelli2011]. @Agostinelli2011 recommended a localized parameter $\tau$ around the 20-30% quantile of the distance distribution - which we explore here.

Tau selection

quantile(as.matrix(tidy_dm),probs = 0:10/10) %>%
  round(2) %>% t %>%
  data.frame(check.names = F) %>%
  DT::datatable(options = list(dom = 't'))

tau <- quantile(as.matrix(tidy_dm), probs = .30)["30%"]

tidy_dm %>% c() %>% data.frame(dist = .) %>%
  ggplot(aes(x = dist)) + geom_histogram(bins = 30) +
  geom_vline(xintercept = tau)

Unlike the global depth, local depth "localizes" what it means to be central. It's not the same as a density estimate (see vignette for Euclidean examples). The below visual of this local depth emphasizes that we're not focused as much at the probabilistic level, but given the local nature of the the approach we might expect the union of balls to better capture the level-set structure of local depth.

local_depth <- local_distance_depth_function(tidy_dm, tau = tau)

df_all_ld <- df_all_vis %>% 
  left_join(local_depth, by = c("sim", "id"))

all_vis <- df_all_ld %>% 
  ggplot(aes(x = S, y = I, z = R, color = local_depth, group = id2)) +
  geom_path() +
  coord_tern() +
  labs(title = "Local Depth", color = "local depth") +
  theme_sir()
all_vis_minus1 <- df_all_ld %>% 
  filter(local_depth != 1) %>%
  ggplot(aes(x = S, y = I, z = R, color = local_depth, group = id2)) +
  geom_path() +
  coord_tern() +
  labs(title = "Local Depth", color = "local depth",
       subtitle = "removed extreme point") +
  theme_sir()

grid.arrange(all_vis, all_vis_minus1, nrow = 1)

We can then create prediction regions using this local depth, below are the visuals from such an approach. Notice that it doesn't agree with the pseudo-density approach, but better captures the central portion of each mixture.

local_depth_pb <- all_curves + 
  geom_prediction_band(data = df_all_vis,
                       aes(x = S, y = I, z = R, 
                          sim_group = id2),
                       pb_type = "delta_ball", conf_level = .5,
                       dist_params = list(dist_approach = "auto", 
                                          num_steps = "auto", 
                                          quantile_approach ="local_depth", 
                                          quantile_approach_params = list(tau = "30%"))) +
  coord_tern() +
  theme_sir()

local_depth_pb

Wrap-up

Overall, we have presented 3 ways to use different rank / orderings of epidemic simulations to create prediction regions. Although we've demonstrated they all capture different geometric structures we hope they can demonstrate different ways to represet simulation's structure and our uncertainty about complex epidemic systems and our attempts to represent them. The figure below shows these different approaches for our rather contrived confidence level of $1-\alpha = .5$.

layout_mat <- matrix(c(rep(c(4,4, rep(3,4), 4,4), 4),
                       rep(c(rep(1,4), rep(2, 4)),4)), nrow = 8, byrow = T)

grid.arrange(global_depth_pb + 
               labs(title = "global depth based prediction band"),
             local_depth_pb + 
               labs(title = "local depth based prediction band"),
             psuedo_density_pb + 
               labs(title = "pseudo-density based prediction band"),
             ggplot() + theme_minimal(),
             layout_matrix = layout_mat)

For examples of tools in our package that work for higher dimensional epidemic models please see Comparing Models (to Models) and Epidemics (to Models). Our vignette on the differences between density estimates and depth (local and global) may also help aid your understanding of the differences between these ordering approaches.

References

skgallagher/timeternR documentation built on Sept. 16, 2021, 7:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

skgallagher/timeternR
A Pipeline to Analyze and Compare Infectious Disease Epidemics

In skgallagher/timeternR: A Pipeline to Analyze and Compare Infectious Disease Epidemics

Overview

Section 1: An introduction to creating prediction regions with level sets

Section 2: Prediction bands from epidemic simulations

Data example

Distances between epidemic simulations

Pseudo-density

Creating a prediction region

Global depth

Local depth

Tau selection

Wrap-up

References

R Package Documentation

Browse R Packages

We want your feedback!

skgallagher/timeternR A Pipeline to Analyze and Compare Infectious Disease Epidemics

In skgallagher/timeternR: A Pipeline to Analyze and Compare Infectious Disease Epidemics

Overview

Section 1: An introduction to creating prediction regions with level sets

Section 2: Prediction bands from epidemic simulations

Data example

Distances between epidemic simulations

Pseudo-density

Creating a prediction region

Global depth

Local depth

Tau selection

Wrap-up

References

R Package Documentation

Browse R Packages

We want your feedback!

skgallagher/timeternR
A Pipeline to Analyze and Compare Infectious Disease Epidemics