knitr::opts_chunk$set( collapse = TRUE, fig.width = 7, fig.height = 5, out.width = "80%", fig.path = "rabies/" )
This vignette provides a demonstration of the vimes package and more specifically how it can be used to perform analyses such as those presented in @vimes-paper. Here we use a simulated dataset of rabies transmitted among dogs, chosen among all the baseline simulations described in @vimes-paper as the only simulated dataset with 151 observed dogs infected with rabies, i.e. exactly the same number as in the real dataset analysed in this paper.
The simulated dataset is available as sim_rabies
in the vimes package:
library(vimes) data(sim_rabies) sim_rabies$n
We can plot the epidemic curve using the incidence
package:
library(incidence) ## weekly plot(incidence(sim_rabies$onset,7)) ## monthly plot(incidence(sim_rabies$onset,365/12)) ## bimonthly plot(incidence(sim_rabies$onset,365/6))
And we can plot the locations of the cases:
locations <- t(sapply(1:sim_rabies$n, function(i) sim_rabies$cases[[i]]$location)) plot(locations, xlab = "", ylab = "")
DNA sequences have an unsual format in this simulation, as only mutated sites from an hypothetical reference sequence are stored:
## gather DNA info from cases dna <- lapply(1:sim_rabies$n, function(i) sim_rabies$cases[[i]]$dna) ## have a look class(dna) length(dna) head(dna)
In the next section, we see how to compute genetic distances and reconstruct a phylogenetic tree from this data.
Distances between dates are computed as numbers of days:
head(sim_rabies$onset, 5) D_dates <- dist(sim_rabies$onset)
Distances between locations are computed using the great circle distance:
library(fields) head(locations) D_geo <- rdist(locations)
Distances between sequences are computed as simple Hamming distances using
function dist_dna
in the
quicksim
package. As this
package is in developement, you will first need to install it using (requires
devtools):
devtools::install_github("thibautjombart/quicksim")
Hamming distances (i.e. number of nucleotide differences between sequences) can be computed using:
library(quicksim) D_dna <- dist_dna(dna)
We derive a phylogenetic tree from these distances using the ape
package:
library(ape) tree <- ladderize(root(nj(D_dna), 1)) plot(tree) axisPhylo() mtext(side = 1, text = "Number of mutations", line = 2.5)
We format, match and plot the distance data using \emph{vimes}:
D_all <- vimes_data(dates = D_dates, geo = D_geo, dna = D_dna) plot(D_all, nclass = 60)
We retrieved preexisting information about the serial interval, spatial kernel, and mutation rate for rabies, and used this information to define the distribution of expected distances between cases in the temporal, spatial, and genetic spaces.
The serial interval and spatial kernel distributions are taken from @Hampson2009-wk:
## serial interval distribution parameters gamma_mean <- 23.55 gamma_std <- 20.85 ## convert into shape and scale gamma_shape <- gamma_mean^2 / (gamma_std^2) gamma_scale <- gamma_std^2 / gamma_mean ## spatial kernel parameters rayleigh_mean <- 0.88 ## find Rayleigh parameters to match this mean rayleigh_scale <- rayleigh_mean / sqrt(acos(-1)/2)
The mutation rate is derived from @Bourhy2016-am:
## mutation rate mu_year_per_site <- 5.9e-4 n_sites <- 11820 ## mutation rate per day and sequence mu_day_whole <- (mu_year_per_site * n_sites / 365)
We use the fpaircase
function to compute the distributions of expected
distances (temporal, spatial and genetic respectively) between a case and their
infector based on the above parameters.
## distance functions for each of the 3 types f_temporal <- fpaircase(type = "temporal", gamma_shape = gamma_shape, gamma_scale = gamma_scale) f_spatial <- fpaircase(type = "spatial", sd_spatial = rayleigh_scale) f_genetic <- fpaircase(type = "genetic", poisson_rate = mu_day_whole, gamma_shape = gamma_shape, gamma_scale = gamma_scale) ## Plotting these par(mfrow=c(3,1)) plot(f_temporal, xlim = c(0,365)) plot(f_spatial, xlim = c(0,5)) plot(f_genetic, xlim = c(0,5))
We will be using the above distributions of expected distances between a case and their infector to define cutoff distances above which pairs of cases are considered not linked by transmission.
First, we assume a certain level of reporting. Here we assume only $20\%$ of cases are observed.
## reporting rate pi <- 0.2
Then, we define the quantile we want to use to define the cutoff distances. Here, we consider many different quantiles to assess sensitivity of our results to this choice.
## quantiles q <- c(.50, .75, .90, .95, .95^(1/3), .99, .999, .9995, .9999) ## colours used to plot these cols <- rainbow(length(q)) ## our main results are with the cutoff corresponding to the 95% quantile cutoff_choice <- q[4]
We now plot the distributions of expected distances (temporal, spatial and genetic respectively) between a case and their closest observed ancestor, assuming that only $20\%$ of cases are reported. We also show the cutoffs corresponding to the quantiles defined above.
## distance functions for each of the 3 types, accounting for reporting ## probability pi with quantiles q overlayed on the graphs par(mfrow=c(3,1)) plot(f_temporal, q, xlim = c(0,365*3), pi = pi, lines_arg = list(col=cols, lwd=2)) plot(f_spatial, q, xlim = c(0,15), pi = pi, lines_arg = list(col=cols, lwd=2)) plot(f_genetic, q, xlim = c(0,25), pi = pi, lines_arg = list(col=cols, lwd=2))
We can overlay the graph above on top of the histogram of observed distances to see where these cutoffs fall with respect to our observations.
### function used to generate a plot with distribution of observed and expected distances plot_overlay <- function(dist, f, q, pi, xlab, breaks, resol = 1, q_color = cols, hist_bordercolor = "grey", hist_color = "lightgrey"){ ## dist contains the observed distances ## f is the distribution of expected distances between a case and their infector ## q is the quantile or vector of quantiles of interest ## pi is the reporting rate ## xlab is the x axis label ## breaks is the breaks used for plotting the histogram of dist ## resol is the resolution used for plotting f ## q_color is the colour or vector of colours used to show the cutofs associated with the quantile(s) q ## hist_bordercolor is the colour used for the border of the histogram of dist ## hist_color is the colour used for the filling of the histogram of dist qtl <- get_quantiles(f, q, pi = pi) hist(dist, col = hist_color, border = hist_bordercolor, main = "", xlab = xlab, breaks = breaks) par(new = TRUE) x <- seq(min(breaks), max(breaks), resol) y <- f(x, pi = pi) plot(x, y, type = "l", axes = FALSE, main = "", xlab = "", ylab = "") ## add vertical lines corresponding to quantiles abline(v = qtl, col = q_color, lwd = 2) } ### use the function above to create our plot: par(mfrow=c(3, 1), mar=c(5, 5, 0.5, 5)) ## temporal plot_overlay(dist = as.vector(D_all$dates), f = f_temporal, q = q, pi = pi, xlab = "Pairwise distance in time (days)", breaks = seq(0,2500, 50), resol = 1) ## spatial plot_overlay(dist = as.vector(D_all$geo), f = f_spatial, q = q, pi = pi, xlab = "Pairwise distance in space (km)", breaks = seq(0,20,0.2), resol = 0.1) ## genetic plot_overlay(dist = as.vector(D_all$dna), f = f_genetic, q = q, pi = pi, xlab = "Pairwise distance in space (km)", breaks = seq(0,55,1), resol = 1)
We run vimes to identify clusters of cases linked by transmission, for various cutoff choices, and various reporting rates. We plot the results by colouring cases identified as belonging to the same outbreak cluster, i.e. cases identified as being linked by local transmission. Cases shown in grey are identified as singletons who are not linked by transmission to any other observed case.
### function used to get results for a certain cutoff and reporting rate get_res <- function(D_all, q, pi, f_temporal, f_spatial, f_genetic, type = c("all", "temporal","spatial", "genetic")) { type <- match.arg(type) ## get the cutoffs cuts <- c(temporal = get_quantiles(f_temporal, q, pi = pi), spatial = get_quantiles(f_spatial, q, pi = pi), genetic = get_quantiles(f_genetic, q, pi = pi)) if (type == "all") { ## use vimes out <- vimes(D_all, cutoff = cuts, graph.opt = vimes.graph.opt(col.pal = funky)) } else if (type == "temporal") { out <- vimes(vimes_data(dates = D_all$dates), cutoff = cuts["temporal"], graph.opt = vimes.graph.opt(col.pal = funky)) } else if (type == "spatial") { out <- vimes(vimes_data(geo = D_all$geo), cutoff = cuts["spatial"], graph.opt = vimes.graph.opt(col.pal = funky)) } else if (type == "genetic") { out <- vimes(vimes_data(dna = D_all$dna), cutoff = cuts["genetic"], graph.opt = vimes.graph.opt(col.pal = funky)) } return(out) } ### use the function above to generate results for several combinations of p and ### pi We assume, as in Bourhy et al., a reporting rate of 20% in our main ### analyses, and reporting rates of 10 and 50% respectively in two extreme ### scenarios considered in sensitivity analyses. combi <- expand.grid(p = q, pi = pi) combi quantile_pretty <- signif(combi$p*100, 4) quantile_pretty <- paste0(quantile_pretty, "%") res <- vector(9L, mode = "list") for (i in 1:nrow(combi)) { res[[i]] <- get_res(D_all, combi[i, 1], combi[i, 2], f_temporal, f_spatial, f_genetic) } ### visualise the output par(mfrow = c(3, 3), mar=c(1,1,3,1)) for (i in 1:length(res)) { plot(res[[i]]$graph, vertex.label = "", main = paste("cutoff:", quantile_pretty[i])) }
Since we are using simulated data, we can actually compare the results obtained
using vimes
to the truth, i.e. check whether cases identified as belonging to
the same outbreak cluster were indeed part of the same transmission chain in the
simulation
i <- which(combi$p %in% 0.95) # taking this is our main result ## cluster membership as identified by vimes est_clust_membership <- as.vector(res[[i]]$clusters$membership) ## cluster membership as simulated sim_clust_membership <- sim_rabies$clust_renumbered ## contingency table to compare estimated cluster (row) vs actual one (columns) table(est_clust_membership, sim_clust_membership)
For a formal comparison, we can measure the proportion of pairs of individuals belonging to the same outbreak cluster which were indeed clustered together by the method (True Positive Rate), as well as the proportion of pairs of individuals which did not belong to the same outbreak cluster and were adequately placed in different groups by the method (True Negative Rate).
# function to compute the true positive and true negative rates get_TPR_TNR <- function(est_clust_membership, sim_clust_membership) { ### look at pairs of individuals and whether they are in the same cluster or not ### in simulation and estimation sim_same_clust <- as.matrix(dist(sim_clust_membership, method="manhattan")) == 0 est_same_clust <- as.matrix(dist(est_clust_membership, method="manhattan")) == 0 sim_same_clust[lower.tri(sim_same_clust, diag = TRUE)] <- NA est_same_clust[lower.tri(est_same_clust, diag = TRUE)] <- NA sim_same_clust_vect <- na.omit(as.vector(sim_same_clust)) est_same_clust_vect <- na.omit(as.vector(est_same_clust)) ## see ## https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12968 ## for the following param definitions ## The true positive rate (TPR) defined as the proportion of individuals ## belonging to the same population which were indeed clustered together by ## the method. TPR <- sum(est_same_clust_vect[which(sim_same_clust_vect)]) / sum(sim_same_clust_vect) ## The true negative rate (TNR) defined as the proportion of individuals ## which did not belong to the same population and were adequately placed in ## different groups by the method TNR <- sum(!est_same_clust_vect[which(!sim_same_clust_vect)]) / sum(!sim_same_clust_vect) return(c(TPR = TPR, TNR = TNR)) } ## use this function to compute the TPR and TNR for all the cutoffs we ## considered all_TPR_TNR <- as.data.frame(t(sapply(1:length(res), function(i) get_TPR_TNR(as.vector(res[[i]]$clusters$membership), sim_clust_membership)))) all_TPR_TNR all_TPR_TNR$quantile <- combi$p all_TPR_TNR$SumTPRTNR <- all_TPR_TNR$TPR + all_TPR_TNR$TNR ## plot results as a ROC curve plot(1 - all_TPR_TNR$TNR, all_TPR_TNR$TPR, xlab = "1 - TNR", ylab = "TPR", xlim = c(0, 1), ylim = c(0, 1), col = cols, pch = 19, cex = 1.5) legend("bottomright", legend = quantile_pretty, title = "Quantile used for cutoff", col = cols, pch = 19, pt.cex = 1.5, bty = "n", cex = 0.75)
This part requires the package branchr
, which needs to be installed using
devtools
by typing:
library(devtools) install_github("reconhub/branchr")
library(branchr) compute_R <- function(cl_size, rho) { profile <- profile_likelihood(y_obs = cl_size, rho = rho, 0.01, 20) R_estimate <- theta_max_likelihood(profile$theta, profile$Likelihood, 0.95) R <- c(central = R_estimate$theta_max_likelihood, low = R_estimate$lower_theta, up = R_estimate$upper_theta) import <- import(y_obs = cl_size, rho = rho, profile, 1e3, 1e3, 0.95) unobs <- c(central = import$theta_max_likelihood, low = import$lower_theta, up = import$upper_theta) return(list(R, unobs)) } clust_size <- lapply(res, function(i) i$clusters$size) rho <- combi[, 2] R_estimate_and_imports <- lapply(1:length(clust_size), function(i) compute_R(clust_size[[i]], rho[i])) R_estimates <- sapply(1:length(R_estimate_and_imports), function(i) R_estimate_and_imports[[i]][[1]]) N_unobs_estimates <- sapply(1:length(R_estimate_and_imports), function(i) R_estimate_and_imports[[i]][[2]]) N_tot_estimates <- N_unobs_estimates + matrix(rep(lengths(clust_size), 3), nrow = 3, byrow = TRUE) n_days <- diff(range(sim_rabies$onset)) rate_import_tot_days <- N_tot_estimates / as.numeric(n_days) rate_import_tot_year <- rate_import_tot_days*365 ### plotting results ### par(mfrow = c(2, 1)) ## R estimates plot(R_estimates["central",], ylim = c(0, max(c(1.2, max(R_estimates)))), pch = 19, xlab = "Cutoff used for pruning", ylab = "Estimated R", axes = FALSE) axis(side = 1, at = 1:length(R_estimate_and_imports), labels = quantile_pretty, cex = 0.75) axis(side = 2) for (i in 1:length(R_estimate_and_imports)) { segments(i, R_estimates["low",i],i, R_estimates["up",i]) } ## adding the reproduction number used to generate the simulation abline(h = 0.92, col = "red", lty = 2) ## importation rate estimates plot(rate_import_tot_year["central",], ylim = c(0, max(rate_import_tot_year)), pch = 19, xlab = "Cutoff used for pruning", ylab = "Estimated rate of importation (per year)", axes = FALSE) axis(side = 1, at = 1:length(R_estimate_and_imports), labels = quantile_pretty, cex = 0.75) axis(side = 2) for (i in 1:length(R_estimate_and_imports)) { segments(i, rate_import_tot_year["low",i],i, rate_import_tot_year["up",i]) } # adding the importation rate used to generate the simulation abline(h = 7, col = "red", lty = 2)
In time:
library(igraph) library(ggplot2) library(gridExtra) library(ggmap) plot_incidence_res <- function(x){ grp <- factor(V(x$graph)$color) col <- levels(grp) names(col) <- col plot(incidence(sim_rabies$onset, 365/6, groups = grp), color = as.character(col), ylab = "Bimonthly incidence") + guides(fill = FALSE) } ts_plts <- lapply(res, plot_incidence_res) grid.arrange(grobs = ts_plts, nrow = 3)
In space:
map_res <- function(x) { grp <- factor(V(x$graph)$color) col <- levels(grp) names(col) <- col dat <- cbind.data.frame(locations, grp) names(dat)[1] <- "x" names(dat)[2] <- "y" ggplot(data = dat, aes(x = x, y = y, color = grp), alpha = .8, size = 3) + geom_point() + scale_color_manual(values = col) + guides(color = FALSE) + labs(x = 'Longitude', y = 'Latitude') } map_plts <- lapply(res, map_res) grid.arrange(grobs = map_plts, nrow = 3)
In the genetic space:
## check that tip labels are ordered as vimes data identical(attr(D_all, "labels"), tree$tip.label) make_tree_res <- function(x) { grp <- factor(V(x$graph)$color) col <- levels(grp) plot(tree, show.tip.label = FALSE) tiplabels(text = NULL, pch = 20, col = col) } par(mfrow = c(3, 3), mar = c(1, 1, 0.5, 0)) invisible(lapply(res, make_tree_res))
In time and space (for the 95% cutoff):
main_res <- which(combi$p %in% 0.95 & combi$pi %in% 0.2) ts_plts[[main_res]] map_plts[[main_res]] make_tree_res(res[[main_res]])
We can illutrate the added value of combining the different types of data by comparing the final clusters to the graphs that would be obtained by each type of data separately.
res_all_data_streams <- res[[4]] res_temporal_data_only <- get_res(D_all, combi[4, 1], combi[4, 2], f_temporal, f_spatial, f_genetic, type = "temporal") res_spatial_data_only <- get_res(D_all, combi[4, 1], combi[4, 2], f_temporal, f_spatial, f_genetic, type="spatial") res_genetic_data_only <- get_res(D_all, combi[4, 1], combi[4, 2], f_temporal, f_spatial, f_genetic, type = "genetic") ## function to get the correct amount of transparency: more transparency as more ## edges to be drawn: transp_vertex <- function(gr) { (1 - ecount(gr) / (gorder(gr)^2/2) )^1.75 } par(mfrow = c(2, 2), mar = c(.2, .2, 4, .2)) # temporal only color <- c("lightgrey", "red", "orange", "forestgreen", "pink", "blue", "brown", "black", "purple","darkgrey","yellow","green", "cyan","deepskyblue") gr <- res_all_data_streams$separate_graphs$dates$graph mbrshp <- res_temporal_data_only$clusters$membership tab <- table(mbrshp) singletons_names <- tab %in% 1 singletons <- mbrshp %in% as.numeric(names(tab)[singletons_names]) names <- unique(mbrshp[!singletons]) mbrshp[!singletons] <- 1 + match(mbrshp[!singletons], names) mbrshp[singletons] <- 1 col <- color[mbrshp] plot(gr, main = "Temporal data only", vertex.label = "", edge.color = alpha("darkgrey", transp_vertex(gr)), vertex.frame.color = col) ## spatial only gr <- res_all_data_streams$separate_graphs$geo$graph mbrshp <- res_spatial_data_only$clusters$membership tab <- table(mbrshp) singletons_names <- tab %in% 1 singletons <- mbrshp %in% as.numeric(names(tab)[singletons_names]) names <- unique(mbrshp[!singletons]) mbrshp[!singletons] <- 1 + match(mbrshp[!singletons], names) mbrshp[singletons] <- 1 col <- color[mbrshp] plot(gr, main = "Spatial data only", vertex.label = "", edge.color = alpha("darkgrey", transp_vertex(gr)), vertex.frame.color = col) ## genetic only gr <- res_all_data_streams$separate_graphs$dna$graph mbrshp <- res_genetic_data_only$clusters$membership tab <- table(mbrshp) singletons_names <- tab %in% 1 singletons <- mbrshp %in% as.numeric(names(tab)[singletons_names]) names <- unique(mbrshp[!singletons]) mbrshp[!singletons] <- 1 + match(mbrshp[!singletons], names) mbrshp[singletons] <- 1 col <- color[mbrshp] plot(gr, main = "Genetic data only", vertex.label = "", edge.color = alpha("darkgrey", transp_vertex(gr)), vertex.frame.color = col) ## all together gr <- res_all_data_streams$graph plot(gr, main = "All data streams together", vertex.label = "", edge.color = alpha("darkgrey", transp_vertex(gr)), vertex.frame.color = V(gr)$color)
Although the genetic data alone performed very well at reconstructing the outbreak clusters, adding the temporal and spatial information allowed to improve the reconstruction, with sensitivity remaining the same at 98.0% but specificitiy increasing from 94.4% to 100%:
get_TPR_TNR(res_genetic_data_only$clusters$membership, sim_clust_membership) get_TPR_TNR(res_all_data_streams$clusters$membership, sim_clust_membership)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.