knitr::opts_chunk$set(
  collapse = TRUE,
  quiet = TRUE,
  progress = FALSE,
  comment = "#>",
  fig.width=6,
  fig.height=4
)
options(readr.show_progress = FALSE)

Package Introduction

transnp: Takes SNP alignments for clusters, together with sampling dates and some assumptions, and models transmission events and times.

Functionality

The available functions in transnp are:

addAlignment: add to a list of alignments from fasta files

setUpDates: create list of dates to match alignments

createMLTrees: creates a list of maximum likelihood estimated trees from a list of alignments. Uses phangorn and phytools

createTimedTrees: creates a list of timed trees from a list of trees and their sample dates. Uses treedater

simulTransTrees: creates a set of transmission trees which are jointly inferred from timed trees. Uses TransPhylo

getLikelyInfectors: extract infection likelihoods for a given case

getInfectionTimes: extract infection times for a given case

Plotting and visualisation

networkTPlot: makes dynamic plot of a TransPhylo transmission tree showing unsampled cases

plot_gen_times: plot histogram of generation times for each cluster

plot_times_to_samp: plot histogram of times to sampling for each cluster

plot_unsampled_cases: plot histogram of the number of unsampled cases for each cluster

plotGenerationTimeDensity: plot generation time density for chosen case

plotInfectionDateDensity: plot infection date density for chosen case

plotSampleTimeDensity: plot time to sampling (from infection) for chosen case

plotTransTreeSummary: plot transmission network with posterior probabilities at or above given cutoff

Filtering on posterior records

filterInfectedDate: filter record on specified date of infection

filterPossibleWiw: filter record on specified allowable transmission matrix

filterTransPair: filter record on specified directed transmission pair

Package dependencies

Load libraries:

library(transnp)
library(ape)
library(phangorn)
library(TransPhylo)
library(treedater)
library(phytools)
library(gridExtra)
library(purrr)
library(ggplot2)
library(visNetwork)


Step 1: read in the data

We start with a fasta format file containing aligned sequence data.

Read the fasta files

aligns <- list()
aligns <- addAlignment(aligns, "demo_cluster_CL001.fas")
aligns <- addAlignment(aligns, "demo_cluster_CL002.fas")
aligns <- addAlignment(aligns, "demo_cluster_CL005.fas")
aligns <- addAlignment(aligns, "demo_cluster_CL007.fas")

This gives a list containing the alignments we will model:

aligns[[1]]

Step 2: create timed phylogenetic trees

2a: Create the maximum likelihood trees (in this case using TB base frequencies)

basefreqs <- c(0.1719, 0.3286, 0.3276, 0.1719)
set.seed(12345)
mltrees <- createMLTrees(aligns, bf=basefreqs)
par(mfrow=c(2,2)); lapply(mltrees,function(tree) {plot(tree); add.scale.bar()}); par(mfrow=c(1,1))
mltrees[[1]]

2b: Using sample dates, create timed trees using treedater.

allDates <- read.csv(system.file("extdata", "demo_dates.csv", package = "transnp", mustWork = TRUE))
sampledates <- setUpDates(aligns, allDates)
timedtrees <- createTimedTrees(mltrees, sampledates, aligns, strictClock = T, meanRateLimits = c(0.3, 0.7))

Note: results depend on the clock assumptions and can affect onward analysis (long time vs short time). We are working towards incorporating this uncertainty.

Step 3: transmission modelling

Infer transmission trees using TransPhylo functionality. This jointly estimates parameters over the trees, sharing parameters among the clusters. We analyse different clusters together.

nIter <- 1000
tpTimedTrees <- list()

maxDate <- max(unlist(lapply(sampledates, function(dateList) {return(max(dateList))})))
tpTimedTrees <- lapply(timedtrees, function(tree) {return(ptreeFromPhylo(tree, maxDate))}) 

#listRecords <- infer_multiTTree_shareParam(tpTimedTrees, share=c("neg","off.r","off.p","pi"), mcmcIterations=nIter, startNeg=1, startOff.p = 0.8, startOff.r = 1, startPi=0.9, updateNeg = T, updatePi = F, updateOff.p=FALSE)
listRecords <- simulTransTrees(tpTimedTrees, mean_gen=2, stddev_gen=sqrt(2), mean_sample=2, stddev_sample=sqrt(2), share=c("neg","off.r","off.p","pi"), mcmcIterations=nIter, startNeg=1, startOff.p = 0.8, startOff.r = 1, startPi=0.9, updateNeg = T, updatePi = F, updateOff.p=FALSE)

The result is a list of records, one for each cluster. The listRecords object contains the inferred transmission trees.

Step 4: Explore the results

Transmission trees

Combined tree using TransPhylo. In this tree, each colour corresponds to a host. Colours that don't end up at a tip represent inferred unsampled cases. Note that this Bayesian analysis produces a collection of possible transmission trees. Here we show the one with highest posterior probability, for cluster 4.

maptree <- selectTTree(listRecords[[4]],burnin = 0.3)
plotCTree(listRecords[[4]][[maptree]]$ctree)

Plot of the maximum posterior probability transmission tree for Cluster 7; we use visNetwork for the visualisation.

networkTPlot(listRecords[[4]], mcmcIndex=nIter-10, missLabel="Missed", colours=c('lightblue', 'orange'))

The thicker edges have higher probability - these transmission events are more certain. Here, there is a fairly high probability that c599 infected c946. We do not have a likely infector for c599.

Here is a different plot, for Cluster 2, showing different colour options.

networkTPlot(listRecords[[2]], mcmcIndex=nIter, missLabel="Missed", colours=c('blue', 'orange'))

How do the clusters compare to each other?

We can plot the distribution of generation times (times from one infection to the next) from the posterior. But note that the generation times included are those where an infected case (infectee) is itself sampled or is ancestral (in the sense of transmissions) to another case which is sampled. If an infectee is unsampled, as well as all of its descendants by transmission being unsampled, those generation times are not included.

names(listRecords) = c("Cluster1", "Cluster2", "Cluster5", "Cluster7")
plot_gen_times(listRecords)

Clusters 1 and 5 contain some long times, suggesting cases that could have been intermittently infectious or non-infectious for a longer period. Cluster 2 seems to be moving faster.

Times from getting infected to getting sampled, showing differences between clusters.

plot_times_to_samp(listRecords)

The same pattern is mirrored in the times to sampling

How many unsampled cases are estimated in the clusters?

plot_unsampled_cases(listRecords)

Higher numbers in clusters 1 and 5 reflect longer branches and more intermediate cases between sample cases.

What information can we see for a given individual?

We can extract individual-level infection times by case, for each cluster:

gt11=getInfectionTimes(listRecords[[1]][300:1000],7)
name1=listRecords[[1]][[1000]]$ctree$nam[7]
hist(gt11,breaks=15,xlab = paste("estimated time of infection for", name1),main = "")

The timing of infection depends on who infected whom and on the timed tree

maptree1=selectTTree(listRecords[[1]],burnin = 0.5)
plotCTree(listRecords[[1]][[maptree1]]$ctree)

Generation time (time from being infected to infecting next case) density for a given individual

plotGenerationTimeDensity(listRecords[[1]], index=2)

Infection-to-sample time density for a given individual

plotSampleTimeDensity(listRecords[[1]], index=2)

We can overlay specific events on individual-level infection times by case:

plotInfectionDateDensity(listRecords[[1]], index=1, overlayDate=2015)

We can extract likehoods of infector for a given case

caseName <- 'c405'
infectors <- getLikelyInfectors(listRecords[[1]], caseName)
barplot(infectors[-which(names(infectors) == caseName)])

What summary information can we show for a given cluster?

Plot of transmission network with posterior probabilities at or above given cutoff. Edges are weighted in proportion to the probability of transmission. Hover over an edge to see its probability!

plotTransTreeSummary(listRecords[[2]], cutoff=0.2, burnin=0.5) 

Refine the results using additional knowledge

We have epidemiological data which adds information but is not explicity captured in the model parameters. It might be known, for example, that patient A could not have infected patient B (even indirectly) because they were known to living in geographically isolated locations.

Filter record on a specified directed transmission pair. From the previous example, we impose the condition that c674 cannot infect c1608.

subRecord1 <- filterTransPair(listRecords[[2]], from='c674', to='c1608')
plotTransTreeSummary(subRecord1, cutoff=0.2, burnin=0.5) 

Filter record on specified date of infection (before or after, depending on the keepAfter flag for a particular case:

subRecord2 <- filterInfectedDate(listRecords[[2]], case='c1608', infDate=2013.5, keepAfter=TRUE)
plotTransTreeSummary(subRecord2, cutoff=0.2, burnin=0.5, nodeColour='orange' )

Shifting the date forward a little excludes more entries

subRecord2b <- filterInfectedDate(listRecords[[2]], case='c1608', infDate=2014.5, keepAfter=TRUE)
plotTransTreeSummary(subRecord2b, cutoff=0.2, burnin=0.5, nodeColour='green' )

Filter record on specified logical matrix, which specifies who can infect whom. Record entries are kept only where infection events correspond to a TRUE entry in the testMatrix.

name_list <- listRecords[[2]][[1]]$ctree$nam
name_dim <- length(name_list)
testMatrix <- matrix(TRUE, name_dim, name_dim)
testMatrix[10,8] <- FALSE
testMatrix[8,10] <- FALSE
name_list[[8]]
name_list[[10]]
subRecord3 <- filterPossibleWiw(listRecords[[2]], testMatrix)
plotTransTreeSummary(subRecord3, cutoff=0.2, burnin=0.5, nodeColour='pink' )

Acknowledgements

Centre for Mathematics of Precision Healthcare, Imperial College London

EPSRC Impact Acceleration award, Imperial College London

Inaki Comas and Tom Irving: TB cluster data from Valencia, Spain

Notes

Last updated June 2018.



JamesStimson/transnp documentation built on May 30, 2019, 3:51 p.m.