knitr::opts_chunk$set(
  collapse = TRUE,
  fig.width = 7, 
  fig.height = 5,
  out.width = "80%", 
  fig.path = "rabies/"
)

This vignette provides a demonstration of the vimes package and more specifically how it can be used to perform analyses such as those presented in @vimes-paper. Here we use a simulated dataset of rabies transmitted among dogs, chosen among all the baseline simulations described in @vimes-paper as the only simulated dataset with 151 observed dogs infected with rabies, i.e. exactly the same number as in the real dataset analysed in this paper.

Loading and examining the data

The simulated dataset is available as sim_rabies in the vimes package:

library(vimes)
data(sim_rabies)
sim_rabies$n

We can plot the epidemic curve using the incidence package:

library(incidence) 
## weekly
plot(incidence(sim_rabies$onset,7))

## monthly
plot(incidence(sim_rabies$onset,365/12))

## bimonthly
plot(incidence(sim_rabies$onset,365/6))

And we can plot the locations of the cases:

locations <- t(sapply(1:sim_rabies$n, function(i) sim_rabies$cases[[i]]$location))
plot(locations, xlab = "", ylab = "")

DNA sequences have an unsual format in this simulation, as only mutated sites from an hypothetical reference sequence are stored:

## gather DNA info from cases
dna <- lapply(1:sim_rabies$n, function(i) sim_rabies$cases[[i]]$dna)

## have a look
class(dna)
length(dna)
head(dna)

In the next section, we see how to compute genetic distances and reconstruct a phylogenetic tree from this data.

Computing pairwise distances between cases in time, space, and genetics

Distances between dates are computed as numbers of days:

head(sim_rabies$onset, 5)
D_dates <- dist(sim_rabies$onset)

Distances between locations are computed using the great circle distance:

library(fields)
head(locations)
D_geo <- rdist(locations)

Distances between sequences are computed as simple Hamming distances using function dist_dna in the quicksim package. As this package is in developement, you will first need to install it using (requires devtools):

devtools::install_github("thibautjombart/quicksim")

Hamming distances (i.e. number of nucleotide differences between sequences) can be computed using:

library(quicksim)
D_dna <- dist_dna(dna)

We derive a phylogenetic tree from these distances using the ape package:

library(ape)
tree <- ladderize(root(nj(D_dna), 1))
plot(tree)
axisPhylo()
mtext(side = 1, text = "Number of mutations", line = 2.5)

We format, match and plot the distance data using \emph{vimes}:

D_all <- vimes_data(dates = D_dates, geo = D_geo, dna = D_dna)
plot(D_all, nclass = 60)

Defining cutoff distances above which cases are considered not linked by transmission

Distributions of expected distances between cases for rabies

We retrieved preexisting information about the serial interval, spatial kernel, and mutation rate for rabies, and used this information to define the distribution of expected distances between cases in the temporal, spatial, and genetic spaces.

The serial interval and spatial kernel distributions are taken from @Hampson2009-wk:

## serial interval distribution parameters
gamma_mean <- 23.55
gamma_std <- 20.85

## convert into shape and scale 
gamma_shape <- gamma_mean^2 / (gamma_std^2)
gamma_scale <- gamma_std^2 / gamma_mean

## spatial kernel parameters
rayleigh_mean <- 0.88

## find Rayleigh parameters to match this mean
rayleigh_scale <- rayleigh_mean / sqrt(acos(-1)/2)

The mutation rate is derived from @Bourhy2016-am:

## mutation rate
mu_year_per_site <- 5.9e-4
n_sites <- 11820

## mutation rate per day and sequence
mu_day_whole <- (mu_year_per_site * n_sites / 365) 

We use the fpaircase function to compute the distributions of expected distances (temporal, spatial and genetic respectively) between a case and their infector based on the above parameters.

## distance functions for each of the 3 types
f_temporal <- fpaircase(type = "temporal", gamma_shape = gamma_shape,
                        gamma_scale = gamma_scale)

f_spatial <- fpaircase(type = "spatial", sd_spatial = rayleigh_scale)

f_genetic <- fpaircase(type = "genetic", poisson_rate = mu_day_whole,
                       gamma_shape = gamma_shape, 
                       gamma_scale = gamma_scale)

## Plotting these
par(mfrow=c(3,1))
plot(f_temporal, xlim = c(0,365))
plot(f_spatial, xlim = c(0,5))
plot(f_genetic, xlim = c(0,5))

Cutoff distances defined as the quantiles of the distributions of expected pairwise distances

We will be using the above distributions of expected distances between a case and their infector to define cutoff distances above which pairs of cases are considered not linked by transmission.

First, we assume a certain level of reporting. Here we assume only $20\%$ of cases are observed.

## reporting rate
pi <- 0.2

Then, we define the quantile we want to use to define the cutoff distances. Here, we consider many different quantiles to assess sensitivity of our results to this choice.

## quantiles
q <- c(.50, .75, .90, .95, .95^(1/3), .99, .999, .9995, .9999)

## colours used to plot these
cols <- rainbow(length(q))

## our main results are with the cutoff corresponding to the 95% quantile
cutoff_choice <- q[4] 

We now plot the distributions of expected distances (temporal, spatial and genetic respectively) between a case and their closest observed ancestor, assuming that only $20\%$ of cases are reported. We also show the cutoffs corresponding to the quantiles defined above.

## distance functions for each of the 3 types, accounting for reporting
## probability pi with quantiles q overlayed on the graphs

par(mfrow=c(3,1))
plot(f_temporal, q, xlim = c(0,365*3), pi = pi,
     lines_arg = list(col=cols, lwd=2))
plot(f_spatial, q, xlim = c(0,15), pi = pi,
     lines_arg = list(col=cols, lwd=2))
plot(f_genetic, q, xlim = c(0,25), pi = pi,
     lines_arg = list(col=cols, lwd=2))

We can overlay the graph above on top of the histogram of observed distances to see where these cutoffs fall with respect to our observations.

### function used to generate a plot with distribution of observed and expected distances

plot_overlay <- function(dist, f, q, pi, xlab, breaks, resol = 1,
                         q_color = cols, hist_bordercolor = "grey",
                         hist_color = "lightgrey"){

    ## dist contains the observed distances
    ## f is the distribution of expected distances between a case and their infector
    ## q is the quantile or vector of quantiles of interest
    ## pi is the reporting rate
    ## xlab is the x axis label
    ## breaks is the breaks used for plotting the histogram of dist
    ## resol is the resolution used for plotting f
    ## q_color is the colour or vector of colours used to show the cutofs associated with the quantile(s) q
    ## hist_bordercolor is the colour used for the border of the histogram of dist
    ## hist_color is the colour used for the filling of the histogram of dist

    qtl <- get_quantiles(f, q, pi = pi)
    hist(dist, col = hist_color, border = hist_bordercolor, 
         main = "", xlab = xlab, 
         breaks = breaks)
    par(new = TRUE)
    x <- seq(min(breaks), max(breaks), resol)
    y <- f(x, pi = pi)
    plot(x, y, type = "l", axes = FALSE, main = "", xlab = "", ylab = "")

    ## add vertical lines corresponding to quantiles
    abline(v = qtl, col = q_color, lwd = 2)
}

### use the function above to create our plot: 
par(mfrow=c(3, 1), mar=c(5, 5, 0.5, 5))

## temporal
plot_overlay(dist = as.vector(D_all$dates), 
             f = f_temporal, 
             q = q, 
             pi = pi, 
             xlab = "Pairwise distance in time (days)", 
             breaks = seq(0,2500, 50),
             resol = 1)

## spatial
plot_overlay(dist = as.vector(D_all$geo), 
             f = f_spatial, 
             q = q, 
             pi = pi, 
             xlab = "Pairwise distance in space (km)", 
             breaks = seq(0,20,0.2),
             resol = 0.1)

## genetic
plot_overlay(dist = as.vector(D_all$dna), 
             f = f_genetic, 
             q = q, 
             pi = pi, 
             xlab = "Pairwise distance in space (km)", 
             breaks = seq(0,55,1),
             resol = 1)

Running vimes to identify clusters of cases linked by transmission

We run vimes to identify clusters of cases linked by transmission, for various cutoff choices, and various reporting rates. We plot the results by colouring cases identified as belonging to the same outbreak cluster, i.e. cases identified as being linked by local transmission. Cases shown in grey are identified as singletons who are not linked by transmission to any other observed case.

### function used to get results for a certain cutoff and reporting rate

get_res <- function(D_all, q, pi, f_temporal, f_spatial, f_genetic,
                    type = c("all", "temporal","spatial", "genetic")) {

  type <- match.arg(type)

  ## get the cutoffs
  cuts <- c(temporal = get_quantiles(f_temporal, q, pi = pi), 
            spatial = get_quantiles(f_spatial, q, pi = pi),
            genetic = get_quantiles(f_genetic, q, pi = pi))

  if (type == "all") {
    ## use vimes
    out <- vimes(D_all, cutoff = cuts,
                 graph.opt = vimes.graph.opt(col.pal = funky))
  } else if (type == "temporal") {
    out <- vimes(vimes_data(dates = D_all$dates), cutoff = cuts["temporal"],
                 graph.opt = vimes.graph.opt(col.pal = funky))
  } else if (type == "spatial") {
    out <- vimes(vimes_data(geo = D_all$geo), cutoff = cuts["spatial"],
                 graph.opt = vimes.graph.opt(col.pal = funky))
  } else if (type == "genetic") {
    out <- vimes(vimes_data(dna = D_all$dna), cutoff = cuts["genetic"],
                 graph.opt = vimes.graph.opt(col.pal = funky))
  }

  return(out)

}

### use the function above to generate results for several combinations of p and
### pi We assume, as in Bourhy et al., a reporting rate of 20% in our main
### analyses, and reporting rates of 10 and 50% respectively in two extreme
### scenarios considered in sensitivity analyses.

combi <- expand.grid(p = q,
                     pi = pi)
combi

quantile_pretty <- signif(combi$p*100, 4)
quantile_pretty <- paste0(quantile_pretty, "%")

res <- vector(9L, mode = "list")

for (i in 1:nrow(combi)) {
    res[[i]] <- get_res(D_all, combi[i, 1],
                        combi[i, 2], f_temporal,
                        f_spatial, f_genetic)
}

### visualise the output
par(mfrow = c(3, 3), mar=c(1,1,3,1))

for (i in 1:length(res)) {
  plot(res[[i]]$graph, vertex.label = "",
       main = paste("cutoff:", quantile_pretty[i]))
}

Compare vimes results to 'truth'

Since we are using simulated data, we can actually compare the results obtained using vimes to the truth, i.e. check whether cases identified as belonging to the same outbreak cluster were indeed part of the same transmission chain in the simulation

i <- which(combi$p %in% 0.95) # taking this is our main result

## cluster membership as identified by vimes
est_clust_membership <- as.vector(res[[i]]$clusters$membership)

## cluster membership as simulated
sim_clust_membership <- sim_rabies$clust_renumbered

## contingency table to compare estimated cluster (row) vs actual one (columns)
table(est_clust_membership, sim_clust_membership)

For a formal comparison, we can measure the proportion of pairs of individuals belonging to the same outbreak cluster which were indeed clustered together by the method (True Positive Rate), as well as the proportion of pairs of individuals which did not belong to the same outbreak cluster and were adequately placed in different groups by the method (True Negative Rate).

# function to compute the true positive and true negative rates
get_TPR_TNR <- function(est_clust_membership, sim_clust_membership) {

### look at pairs of individuals and whether they are in the same cluster or not
### in simulation and estimation

    sim_same_clust <- as.matrix(dist(sim_clust_membership, method="manhattan")) == 0
    est_same_clust <- as.matrix(dist(est_clust_membership, method="manhattan")) == 0

    sim_same_clust[lower.tri(sim_same_clust, diag = TRUE)] <- NA
    est_same_clust[lower.tri(est_same_clust, diag = TRUE)] <- NA

    sim_same_clust_vect <- na.omit(as.vector(sim_same_clust))
    est_same_clust_vect <- na.omit(as.vector(est_same_clust))

    ## see
    ## https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12968
    ## for the following param definitions

    ## The true positive rate (TPR) defined as the proportion of individuals
    ## belonging to the same population which were indeed clustered together by
    ## the method.

    TPR <- sum(est_same_clust_vect[which(sim_same_clust_vect)]) / sum(sim_same_clust_vect)

    ## The true negative rate (TNR) defined as the proportion of individuals
    ## which did not belong to the same population and were adequately placed in
    ## different groups by the method
    TNR <- sum(!est_same_clust_vect[which(!sim_same_clust_vect)]) / sum(!sim_same_clust_vect)

    return(c(TPR = TPR, TNR = TNR))
}

## use this function to compute the TPR and TNR for all the cutoffs we
## considered
all_TPR_TNR <- as.data.frame(t(sapply(1:length(res),
                                      function(i) get_TPR_TNR(as.vector(res[[i]]$clusters$membership),
                                                              sim_clust_membership))))

all_TPR_TNR

all_TPR_TNR$quantile <- combi$p
all_TPR_TNR$SumTPRTNR <- all_TPR_TNR$TPR + all_TPR_TNR$TNR

## plot results as a ROC curve
plot(1 - all_TPR_TNR$TNR, all_TPR_TNR$TPR, xlab = "1 - TNR", ylab = "TPR",
     xlim = c(0, 1), ylim = c(0, 1), col = cols, pch = 19, cex = 1.5)

legend("bottomright", legend = quantile_pretty,
       title = "Quantile used for cutoff",
       col = cols, pch = 19, pt.cex = 1.5,
       bty = "n", cex = 0.75)

Estimate the underlying reproduction number and number of imported cases

This part requires the package branchr, which needs to be installed using devtools by typing:

library(devtools)
install_github("reconhub/branchr")
library(branchr)

compute_R  <- function(cl_size, rho) {
    profile <- profile_likelihood(y_obs = cl_size, 
                                  rho = rho, 0.01, 20)
    R_estimate <- theta_max_likelihood(profile$theta,
                                       profile$Likelihood, 
                                       0.95)
    R <- c(central = R_estimate$theta_max_likelihood, 
           low = R_estimate$lower_theta, 
           up = R_estimate$upper_theta)

    import <- import(y_obs = cl_size, 
                     rho = rho, 
                     profile, 1e3, 1e3, 0.95)
    unobs <- c(central = import$theta_max_likelihood,
               low = import$lower_theta,
               up = import$upper_theta)

    return(list(R, unobs))
}

clust_size <- lapply(res, function(i) i$clusters$size)
rho <- combi[, 2]

R_estimate_and_imports <- lapply(1:length(clust_size),
                                 function(i) compute_R(clust_size[[i]], rho[i]))

R_estimates <- sapply(1:length(R_estimate_and_imports), 
                      function(i) R_estimate_and_imports[[i]][[1]])

N_unobs_estimates <- sapply(1:length(R_estimate_and_imports), 
                            function(i) R_estimate_and_imports[[i]][[2]])

N_tot_estimates <- N_unobs_estimates + 
  matrix(rep(lengths(clust_size), 3), 
         nrow = 3, byrow = TRUE)

n_days <- diff(range(sim_rabies$onset))

rate_import_tot_days <- N_tot_estimates / as.numeric(n_days)
rate_import_tot_year <- rate_import_tot_days*365


### plotting results ###

par(mfrow = c(2, 1))

## R estimates
plot(R_estimates["central",], ylim = c(0, max(c(1.2, max(R_estimates)))),
     pch = 19, xlab = "Cutoff used for pruning", ylab = "Estimated R", axes = FALSE)
axis(side = 1, at = 1:length(R_estimate_and_imports),
     labels = quantile_pretty, cex = 0.75)
axis(side = 2)
for (i in 1:length(R_estimate_and_imports)) {
  segments(i, R_estimates["low",i],i, R_estimates["up",i])
}

## adding the reproduction number used to generate the simulation
abline(h = 0.92, col = "red", lty = 2)

## importation rate estimates
plot(rate_import_tot_year["central",], ylim = c(0, max(rate_import_tot_year)),
     pch = 19, xlab = "Cutoff used for pruning",
     ylab = "Estimated rate of importation (per year)", axes = FALSE)
axis(side = 1, at = 1:length(R_estimate_and_imports),
     labels = quantile_pretty, cex = 0.75)
axis(side = 2)
for (i in 1:length(R_estimate_and_imports)) {
  segments(i, rate_import_tot_year["low",i],i, rate_import_tot_year["up",i])
}

# adding the importation rate used to generate the simulation
abline(h = 7, col = "red", lty = 2)

Visualising the clusters identified with vimes with different cutoffs

In time:

library(igraph)
library(ggplot2)
library(gridExtra)
library(ggmap)

plot_incidence_res <- function(x){
  grp <- factor(V(x$graph)$color)
  col <- levels(grp)
  names(col) <- col
  plot(incidence(sim_rabies$onset, 365/6, groups = grp),
       color = as.character(col), ylab = "Bimonthly incidence") +
      guides(fill = FALSE)
}

ts_plts <- lapply(res, plot_incidence_res)

grid.arrange(grobs = ts_plts, nrow = 3)

In space:

map_res <- function(x) {
  grp <- factor(V(x$graph)$color)
  col <- levels(grp)
  names(col) <- col
  dat <- cbind.data.frame(locations, grp)
  names(dat)[1] <- "x"
  names(dat)[2] <- "y"

  ggplot(data = dat,
               aes(x = x, y = y, color = grp),
               alpha = .8, size = 3) +  
    geom_point() +
    scale_color_manual(values = col) +
    guides(color = FALSE) +
    labs(x = 'Longitude', y = 'Latitude')
}

map_plts <- lapply(res, map_res)

grid.arrange(grobs = map_plts, nrow = 3)

In the genetic space:

## check that tip labels are ordered as vimes data
identical(attr(D_all, "labels"), tree$tip.label)

make_tree_res <- function(x) {
    grp <- factor(V(x$graph)$color)
    col <- levels(grp)
    plot(tree, show.tip.label = FALSE)
    tiplabels(text = NULL, pch = 20, col = col)
}

par(mfrow = c(3, 3), mar = c(1, 1, 0.5, 0))
invisible(lapply(res, make_tree_res))

Visualising the clusters identified with vimes with the 95% cutoff.

In time and space (for the 95% cutoff):

main_res <- which(combi$p %in% 0.95 & combi$pi %in% 0.2)

ts_plts[[main_res]]

map_plts[[main_res]]

make_tree_res(res[[main_res]])

Illustrating the added value of combining several data streams

We can illutrate the added value of combining the different types of data by comparing the final clusters to the graphs that would be obtained by each type of data separately.

res_all_data_streams <- res[[4]]

res_temporal_data_only <- get_res(D_all, combi[4, 1], combi[4, 2], f_temporal,
                                  f_spatial, f_genetic, type = "temporal")

res_spatial_data_only <- get_res(D_all, combi[4, 1], combi[4, 2], f_temporal,
                                 f_spatial, f_genetic, type="spatial")

res_genetic_data_only <- get_res(D_all, combi[4, 1], combi[4, 2], f_temporal,
                                 f_spatial, f_genetic, type = "genetic")


## function to get the correct amount of transparency: more transparency as more
## edges to be drawn:

transp_vertex <- function(gr) {
  (1 - ecount(gr) / (gorder(gr)^2/2) )^1.75
}

par(mfrow = c(2, 2), mar = c(.2, .2, 4, .2))

# temporal only

color <- c("lightgrey", "red", "orange", "forestgreen", "pink", "blue", "brown",
           "black", "purple","darkgrey","yellow","green", "cyan","deepskyblue")

gr <- res_all_data_streams$separate_graphs$dates$graph

mbrshp <- res_temporal_data_only$clusters$membership
tab <- table(mbrshp)
singletons_names <- tab %in% 1
singletons <- mbrshp %in% as.numeric(names(tab)[singletons_names])
names <- unique(mbrshp[!singletons])
mbrshp[!singletons] <- 1 + match(mbrshp[!singletons], names)
mbrshp[singletons] <- 1
col <- color[mbrshp]

plot(gr, main = "Temporal data only", vertex.label = "",
     edge.color = alpha("darkgrey", transp_vertex(gr)),
     vertex.frame.color = col)

## spatial only

gr <- res_all_data_streams$separate_graphs$geo$graph
mbrshp <- res_spatial_data_only$clusters$membership
tab <- table(mbrshp)
singletons_names <- tab %in% 1
singletons <- mbrshp %in% as.numeric(names(tab)[singletons_names])
names <- unique(mbrshp[!singletons])
mbrshp[!singletons] <- 1 + match(mbrshp[!singletons], names)
mbrshp[singletons] <- 1
col <- color[mbrshp]
plot(gr, main = "Spatial data only", vertex.label = "",
     edge.color = alpha("darkgrey", transp_vertex(gr)),
     vertex.frame.color = col)

## genetic only

gr <- res_all_data_streams$separate_graphs$dna$graph 
mbrshp <- res_genetic_data_only$clusters$membership
tab <- table(mbrshp)
singletons_names <- tab %in% 1
singletons <- mbrshp %in% as.numeric(names(tab)[singletons_names])
names <- unique(mbrshp[!singletons])
mbrshp[!singletons] <- 1 + match(mbrshp[!singletons], names)
mbrshp[singletons] <- 1
col <- color[mbrshp]
plot(gr, main = "Genetic data only", vertex.label = "",
     edge.color = alpha("darkgrey", transp_vertex(gr)),
     vertex.frame.color = col)

## all together 
gr <- res_all_data_streams$graph
plot(gr, main = "All data streams together", vertex.label = "",
     edge.color = alpha("darkgrey", transp_vertex(gr)),
     vertex.frame.color = V(gr)$color)

Although the genetic data alone performed very well at reconstructing the outbreak clusters, adding the temporal and spatial information allowed to improve the reconstruction, with sensitivity remaining the same at 98.0% but specificitiy increasing from 94.4% to 100%:

get_TPR_TNR(res_genetic_data_only$clusters$membership, sim_clust_membership)
get_TPR_TNR(res_all_data_streams$clusters$membership, sim_clust_membership)

References



thibautjombart/vimes documentation built on May 31, 2019, 10:38 a.m.