knitr::opts_chunk$set(echo = TRUE)
THD is an R package for estimating and comparing the evolutionary success of individuals in a population based on genetic distances.
The primary application of THD is the quantitative analysis and identification of factors that influence the epidemic success of pathogens. The approach assigns a index of epidemic success to each individual in a population. This index is then typically used as the response variable in multivariate models including potential predictors of success. See the publication list below for applications.
Make sure the devtools
package is installed then run
devtools::install_github("rasigadelab/thd")
THD analysis requires a matrix of genetic distances and three user-defined parameters:
t
is the timescale parameter that controls the length of time considered in analysis. Shorter t
estimates success in recent times while longer t
considers long-term evolution. The choice depends on the evolutionary dynamics of the population. t
can be as small as 1 month for a fast-spreading viral epidemic or several hundred years for historical diseases such as tuberculosis.
m
is the number of markers used for computing the genetic distances. These markers can be anything from minisatellites to nucleotides as long as they convey a phylogenetic signal. For typical whole genome-based applications, m
is the effective genome size, i.e. the pathogen's genome size minus the size of regions excluded from analysis (typically repeated or low-quality regions).
mu
is the mutation (or substitution) rate for each marker per unit of time. This unit of time in mu
must be the same as the one used in t
.
We compute and plot THD values for a synthetic, minimalistic example of 6 individuals.
Load the package and simulate a toy matrix of genetic distances with 3 very similar individuals (A to C) and 3 more distant individuals (D to E).
library(thd) H <- matrix(c( 00, 00, 00, 00, 00, 00, 01, 00, 00, 00, 00, 00, 03, 01, 00, 00, 00, 00, 21, 18, 28, 00, 00, 00, 23, 25, 31, 07, 00, 00, 19, 27, 23, 09, 11, 00 ), ncol = 6, byrow = TRUE)
Make the matrix symmetric and assign names to individuals (THD keeps track of column names in its output).
H <- H + t(H) colnames(H) <- LETTERS[1:ncol(H)]
Select parameters. Suppose we used 64 markers, each with a substitution rate of 0.001 per marker per year, and we are interested in a timescale of 10 years.
m <- 64 mu <- 0.001 t <- 10
Use the thd
function to compute indices. The scale = time
option scales the output relative to the timescale (use ?thd
for details on scaling options).
x <- thd(H, t, m, mu, scale = "time")
Visualize the results along with a dendrogram of the isolates.
hc <- hclust(as.dist(H)) par(mfrow = c(2, 1)) par(mar = c(0, 6, 2, 6)) plot(hc, hang = -1, xlab = "", yaxt = "n", ylab = "", main = "", labels = FALSE) par(mar = c(3, 4, 0, 4)) barplot(x[hc$order], ylim = c(0.5, 0), ylab = "THD")
This (more involved) example illustrates computation of THD values with different timescales and identification of factors (metadata) associated with epidemic success. The tutorial also provides an introduction to model adjustment for population structure.
The dataset is derived from Rasigade et al. 2017 (open access).
We use a collection of 1,641 isolates of Mycobacterium tuberculosis complex from France.
The genotypes are 15-loci minisatellite markers, technically called mycobacterial interspersed repetitive units (MIRU).
The metadata are related to disease transmissibility, including the pulmonary localization of disease (lung
) and
the presence of acid-fast bacilli in the sputum (afb
), associated with high transmission risk.
Load the thd
package and the example dataset \code{tb}.
library(thd) data(tb)
The dataset contains the matrix of genotypes miru
and the metadata table meta
.
attach(tb) class(miru); dim(miru) class(meta); dim(meta); names(meta)
The next computations require a matrix of genetic distances. The package provides the convenience function hamming
to compute this matrix. (Note that this function is written in R to avoid compilation and thus is not very fast; consider using specialized packages such as ape
for larger datasets (> 10,000 individuals)).
H <- hamming(miru)
The THD timescale allows to focus the analysis on recent epidemic success (with a short timescale) or long-term success. Long-term success is more likely to reflect features of the pathogen rather than the host, because longer time scales (i.e. orders of magnitude above the usual infectivity period) consider success across numerous hosts, which is expected to average out the host's influence. (An obvious exception is pathogen-host coevolution, however it is not expected to weigh much in tuberculosis-human interactions over moderately short timescales)
The 3 parameters of a THD analysis are the timescale, the no. of markers m
and the substitution rate per marker mu
.
We use two timescales, a short-term timescale of 20y and a long-term timescale of 200y. Parameters m
and mu
are adapted
to the minisatellites used for genotyping.
tshort <- 20 tlong <- 200 m <- 15 mu <- 5e-4
Compute the THD values with short and long timescales using the thd
function and examine their distribution.
thd.short <- thd(H, tshort, m, mu) thd.long <- thd(H, tlong, m, mu) par(mfrow = c(1, 2)) hist(thd.short, xlab = "Short-term THD", main = "") hist(thd.long, xlab = "Long-term THD", main = "")
The distribution of short-term THDs is highly skewed. Most individuals have values near zero, denoting sporadic cases, while higher values denote isolates in recent clusters. Before we move on to the linear modelling step, it can be helpful to apply a transformation to reduce skewness. A simple but effective transformation is to obtain Z-scores after taking logs (log-Z transform). (This transformation is not strictly necessary for the weakly-skewed long-term THD but we apply anyway for consistency with short-term THD)
thd.short.lz <- scale(log(thd.short)) thd.long.lz <- scale(log(thd.long))
We compare THDs depending on two suspected drivers of success, sputum smear positivity and pulmonary localization. We examine two working hypothese to illustrate THD modelling:
sputum smear positivity (afb
) drives short-term success independent of TB population structure, because this
feature depends more on patient status (delayed treatment, immune impairment, etc) than pathogen features;
pulmonary infection (lung
) also drives success because extrapulmonary disease is less transmissible, but this feature depends more on the population structure because various TB lineages exhibit differences in their ability to cause active pulmonary disease.
Under these hypotheses, the association of afb
with THD should be stronger using the shorter 20y timescale compared with the longer 200y timescale. This association should not be solely explained by population structure if it is host-related. Conversely, the association of lung
with THD, which is expected to be pathogen-dependent and vertically inherited, should be stronger using the 200y timescale and should also depend on population structure.
Examine the distribution of THDs depending on the timescale, sputum smear positivity and pulmonary infection.
bp <- function(f, xlab, ylab) { boxplot(f, meta, ylim = c(-3,3), xlab = xlab, ylab = ylab) } par(mfrow = c(2,2)) bp(thd.short.lz ~ afb, "AFB smear positivity", "Short-term THD") bp(thd.short.lz ~ lung, "Pulmonary infection", "Short-term THD") bp(thd.long.lz ~ afb, "AFB smear positivity", "Long-term THD") bp(thd.long.lz ~ lung, "Pulmonary infection", "Long-term THD")
Visual inspection suggests that, as expected, both pulmonary disease and sputum smear positivity correlate with success. The amplitudes of difference are weak, with the smallest difference found for smear positivity and a long-term THD. This observation supports the working hypotheses because if smear positivity depends more on the host than on the pathogen, then its association with long-term THD should be weak when the influence of the host is diluted over long time frames.
We will now use linear modelling to examine the strength of associations and their dependency on population structure. Recall that by design, THD is highly dependent on population structure because it directly reflects this structure, so it is desirable to control for this effect.
The thd
package provides the thd.adjust
method to adjust for population structure.
This function computes a set of principal coordinates (PCs) significantly associated with the provided THD values.
Including these PCs as covariates of a linear model provides a means to control for population structure using the same data that were used to compute THD.
(More sophisticated methods exist to adjust for population structure, including phylogenetic generalized least squares and mixed-effect models. The thd.adjust
method, in line with
the philosophy of the thd
approach, puts emphasis on simplicity; it does not require phylogenetic reconstruction nor additional packages. See Price et al. Nat Genet 2006 and Li & Yu. Genet Epidemiol 2008 for
details on PC-based correction, and Revell. Meth Ecol Evol 2010 and Hoffman. PLoS One 2013 for alternatives.)
Compute the PC adjustment sets for short- and long-term THDs.
adj.short <- thd.adjust(thd.short.lz, H, m) adj.long <- thd.adjust(thd.long.lz, H, m) ncol(adj.short); ncol(adj.long)
Eight and 12 PCs were retained for short and long timescales, respectively. These adjustement sets will help determine whether associations with THD are structure-dependent:
An association is structure-dependent if its significance decreases after inclusion of the adjustment set in the model. Structure dependency indicates that the association is explained by the population structure; in the case of a pathogen such as TB, this suggests that the predictor depends on the pathogen itself and evolves with the pathogen with little evolutionary convergence.
A structure-independent association retains its significance after correction for population structure. This suggests either that the predictor depends on the host rather than on the pathogen or that the predictor is a pathogen feature that evolves independent of the pathogen structure - this latter case may indicate evolutionary convergence of the feature, which supports a causal effect.
We use analysis of variance on linear models to examine structure dependency of the association of smear positivity and short-term THD.
anova(lm(thd.short.lz ~ afb, meta)) anova(lm(thd.short.lz ~ adj.short + afb, meta))
The significance of the afb
predictor was retained (even reinforced by a factor ~10) after controlling for population structure.
This supports the hypothesis that afb
depends on the host rather than the pathogen (the alternative explanation, namely
that a pathogen feature under convergent evolution enhances smear positivity, is much less likely given current knowledge on TB).
Using the same approach with the lung
predictor yields the following models.
anova(lm(thd.long.lz ~ lung, meta)) anova(lm(thd.long.lz ~ adj.long + lung, meta))
The significance of the lung
predictor decreases strongly (by a factor ~40) after controlling for population structure.
This supports the hypothesis that an active pulmonary infection mostly depends on the pathogen itself, as supported by varying rates of active
pulmonary diseases in different TB lineages.
We can verify this point by modelling lung
as a function of population structure using logistic regression.
anova(glm(lung ~ adj.long, meta, family = binomial), test = "Chisq")
This model confirms a strong dependency of pulmonary infection and population structure. Performing the same analysis for sputum smear positivity and the appropriate adjustment set confirms that afb
, contrary to pulmonary infection, is not explained by population structure:
anova(glm(afb ~ adj.short, meta, family = binomial), test = "Chisq")
In this tutorial, we have seen how to use different timescales to examine hypotheses pertaining to different epidemic dynamics (here, AFB positivity of sputum smear as a host-dependent feature influencing short-term success; and a pulmonary disease as a pathogen-dependent feature influencing long-term success).
The tutorial illustrates how to use THD values (after transformation if required) in linear models, and the use of the thd.adjust
method to
examine dependendy on population structure.
Rasigade JP, Barbier M, Dumitrescu O, Pichat C, Carret G, Ronnaux-Baron AS, Blasquez G, Godin-Benhaim C, Boisset S, Carricajo A, Jacomo V, Fredenucci I, Pérouse de Montclos M, Flandrois JP, Ader F, Supply P, Lina G, Wirth T. Strain-specific estimation of epidemic success provides insights into the transmission dynamics of tuberculosis. Sci Rep. 2017 Mar 28;7:45326. doi: 10.1038/srep45326. PMID: 28349973
Barbier M, Dumitrescu O, Pichat C, Carret G, Ronnaux-Baron AS, Blasquez G, Godin-Benhaim C, Boisset S, Carricajo A, Jacomo V, Fredenucci I, Pérouse de Montclos M, Genestet C, Flandrois JP, Ader F, Supply P, Lina G, Wirth T, Rasigade JP. Changing patterns of human migrations shaped the global population structure of Mycobacterium tuberculosis in France. Sci Rep. 2018 Apr 11;8(1):5855. doi: 10.1038/s41598-018-24034-6. PMID: 29643428
Thierry Wirth, Marine Bergot, Jean-Philippe Rasigade, Bruno Pichon, Maxime Barbier, Patricia Martins-Simoes, Laurent Jacob, Rachel Pike, Pierre Tissieres, Jean-Charles Picaud, Angela Kearns, Philip Supply, Marine Butin, Frédéric Laurent, on behalf of the International Consortium for Staphylococcus capitis neonatal sepsis and the ESGS group of ESCMID. Niche specialization and spread of a multidrug-resistant Staphylococcus capitis clone involved in neonatal sepsis. Nat. Microbiol. 2020 (in press)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.