knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/" ) set.seed(0)
This package provides an analytical pipeline for rapid variant scanning based on integration of scalable phylogenetic analysis with non-genetic epidemiological data streams. The scanning tool, tfpscan()
, computes a test statistic for every possible bipartition of the provided phylogeny to identify relative growth rates and relative evolutionary rates of clades with the phylogeny. For every clade in a partitioned tree, matched comparison clades are selected based on time (and if specified, space), for further statistical analyses.
The main analyses conducted for each clade, or node with included descendants, are as follows:
Additional options in the scanning tool may be specified, including:
get_mlesky_node()
functionThe output from the scanning tool includes a new output directory will be created called "tfpscan-{Sys.Date()}"
, unless otherwise specified, along with a RDS file containing output descriptives (size, lineage, sample date) and statistics (clock outlier, logistic growth rate, GAM) for every node included and a RDS file containing the environment from the scanning tool run. For each node, logistic growth rate p-values as well as suport for logistic model vs. GAM are also reported.
Within the output directory, the scanning tool creates a further folder directory for each node included in analyses. Node specific outputs within individual node directories include seperate CSV files with the following:
Additional options for node specific output directories may be specified, including:
These outputs are computational expensive, substantially increasing time and space required, so are recommened for running only as required within the main tfpscan()
run. These options can alternatively be specified with the tfpscan_report()
function, for a single node of interest. The tfpscan_report()
function outputs a summary report for a selected node, including primary outputs for a node of interest, as well as specified additional outputs and mlesky analyses if specified in the function options.
The tfpscanner package additional offers an optional online tree viewer for the whole phylogeny with linked hover function for statistics associated with each node, using the function treeview()
. A user may specify particular mutations or lineages of interest for the tree. Mutations will be illustrated wtih a heatmap; lineages will be used to subdivide outputs in scatter plots. An example tree can be viewed at the link below, where mutations specified include S:A222V, S:N:Q9L, and S:E484K, and lineages specified include a selection of Delta lineages (AY.9, AY.43, AY.4.2).
https://www.biorxiv.org/content/10.1101/2021.01.18.427056v1.full
For a complete description of the statistical methodology underpinning this package, see our preprint:
preprint-link
In R, install the devtools
package and run
devtools::install_github('mrc-ide/tfpscanner')
To run the scanning tool, the user requires a phylogeny, in ape::phylo or treeio::treedata format, and associated metadata. If the phylogeny is not rooted, the user must provide an outgroup to root on wtih paramter root_on_tip
, and outgroop sample time with paramter root_on_tip_sample_time
.
The associated metadata should be provided in CSV format, with at minimum sequence_name
, sample_date
(date format), and region
included for each sample (if NA
for any of the three variables, sample with NA
values will be excluded from further analyses). Optional metadata variables include sample_time
(numeric format), mutations
, and other covariates
of interest (e.g. age_group
or vaccine_status
).
If additional covariates are included, a character vector for all variable names must be specified as paramter test_cluster_odds
, and a vector of same length as character vector to test_cluster_odds_value
. For example if vaccine_status, with values `c("yes", "no") was included as a covariate, the user would specify in the scanner tool as follows:
tfpscan(..., test_cluster_odds = c(vaccine_status), test_cluster_odds_value = c(1,0), ... )
If no values are provided to test_cluster_odds_value
, the covariate is assumes to be continuous (e.g. age
). For each included covariate, the odds of a sample belonging to each cluster given this variable will be estimated using conditional logistic regression and adjusting for time.
See the vignettes in the R package for examples of how to use tfpscan()
, treeview()
, tfpscan_report()
, and get_mlesky_node()
functions. An example phylogeny, "tree_2021-12-30.nwk"
, and linked metadata, "amd_2021-12-30.nwk"
, are provided for the user to trial the scanning tool functions and outputs prior to running on their own phylogeny and metdata. A further covariate example is included with vaccine_breakthrough
and age_group
.
N.B. The more options are included in tfpscan()
, the more computational power and time is required to run the scanner tool. In particlar, outputting tree figures and geo figures for every node is computational expensive and not recommended unless required. Alternative options include outputting tree figures and geo figures within the tfpscan_report()
function for a selected node, rather than all nodes within a tree in the tfpscan()
function.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.