This tutorial illustrates computation of THD values with different timescales and identification of factors (metadata) associated with epidemic success. The tutorial also provides an introduction to model adjustment for population structure.
The dataset is derived from Rasigade et al. 2017 (open access).
We use a collection of 1,641 isolates of Mycobacterium tuberculosis complex from France.
The genotypes are 15-loci minisatellite markers, technically called mycobacterial interspersed repetitive units (MIRU).
The metadata are related to disease transmissibility, including the pulmonary localization of disease (lung
) and
the presence of acid-fast bacilly in the sputum (afb
), associated with high transmission risk.
Load the thd
package and the example dataset \code{tb}.
library(thd) data(tb)
The dataset contains the matrix of genotypes miru
and the metadata table meta
.
attach(tb) class(miru); dim(miru) class(meta); dim(meta); names(meta)
The next computations require a matrix of genetic distances. The package provides the convenience function hamming
to compute this matrix. (Note that this function is written in R to avoid compilation and thus is not very fast; consider using specialized packages such as ape
for larger datasets (> 10,000 individuals)).
H <- hamming(miru)
The THD timescale allows to focus the analysis on recent epidemic success (with a short timescale) or long-term success. Long-term success is more likely to reflect features of the pathogen rather than the host, because longer time scales (i.e. orders of magnitude above the usual infectivity period) consider success across numerous hosts, which is expected to average out the host's influence. (An obvious exception is pathogen-host coevolution, however it is not expected to weigh much in tuberculosis-human interactions over moderately short timescales)
The 3 parameters of a THD analysis are the timescale, the no. of markers m
and the substitution rate per marker mu
.
We use two timescales, a short-term timescale of 20y and a long-term timescale of 200y. Parameters m
and mu
are adapted
to the minisatellites used for genotyping.
tshort <- 20 tlong <- 200 m <- 15 mu <- 5e-4
Compute the THD values with short and long timescales using the thd
function and examine their distribution.
thd.short <- thd(H, tshort, m, mu) thd.long <- thd(H, tlong, m, mu) par(mfrow = c(1, 2)) hist(thd.short, xlab = "Short-term THD", main = "") hist(thd.long, xlab = "Long-term THD", main = "")
The distribution of short-term THDs is highly skewed. Most individuals have values near zero, denoting sporadic cases, while higher values denote isolates in recent clusters. Before we move on to the linear modelling step, it can be helpful to apply a transformation to reduce skewness. A simple but effective transformation is to obtain Z-scores after taking logs (log-Z transform). (This transformation is not strictly necessary for the weakly-skewed long-term THD but we apply anyway for consistency with short-term THD)
thd.short.lz <- scale(log(thd.short)) thd.long.lz <- scale(log(thd.long))
We compare THDs depending on two suspected drivers of success, sputum smear positivity and pulmonary localization. The working hypotheses that we examine in this section to illustrate THD modelling are as follows:
sputum smear positivity drives short-term success independent of TB population structure, because this feature depends more on patient status (delayed treatment, immune impairments, etc) than pathogen features;
pulmonary infection also drives success (as extrapulmonary disease is less transmissible) but depends more on the population structure because various TB lineages exhibit differences in their ability to cause active pulmonary disease.
Examine the distribution of THDs depending on the timescale, sputum smear positivity and pulmonary infection.
bp <- function(f, xlab, ylab) { boxplot(f, meta, ylim = c(-3,3), xlab = xlab, ylab = ylab) } par(mfrow = c(2,2)) bp(thd.short.lz ~ afb, "AFB smear positivity", "Short-term THD") bp(thd.short.lz ~ lung, "Pulmonary infection", "Short-term THD") bp(thd.long.lz ~ afb, "AFB smear positivity", "Long-term THD") bp(thd.long.lz ~ lung, "Pulmonary infection", "Long-term THD")
Visual inspection suggests that, as expected, both pulmonary disease and sputum smear positivity correlate with success. The amplitudes of difference are weak, with the smallest difference found for smear positivity and a long-term THD. This observation supports the working hypotheses because if smear positivity depends more on the host than on the pathogen, then its association with long-term THD should be weak because the effect of the host is diluted over long time frames.
We can now use linear models to examine the strength of associations and their dependency on population structure. Recall that by design, THD is highly dependent on population structure because it directly reflects this structure, so it is desirable to adjust for this effect.
The thd
package provides the thd.adjust
method to adjust for population structure.
This function computes a set of principal coordinates (PCs) significantly associated with the provided THD values.
Including these PCs as covariates of a linear model provides a means to control for population structure using the same data that were used to compute THD.
(More sophisticated methods exist to adjust for population structure, including mixed-effect models. The thd.adjust
method, in line with
the philosophy of the thd
approach, puts emphasis on simplicity. See Price et al. Nat Genet 2006 for
details on PC-based correction, and Hoffman. PLoS One 2013 for alternatives.)
Compute the PC adjustment sets for short- and long-term THDs.
adj.short <- thd.adjust(thd.short.lz, H, m) adj.long <- thd.adjust(thd.long.lz, H, m) ncol(adj.short); ncol(adj.long)
Eight and 12 PCs were retained for short and long timescales, respectively. These adjustement sets will help determine whether associations with THD are structure-dependent:
An association is structure-dependent if its significance decreases after inclusion of the adjustment set in the model. Structure dependency indicates that the predictor is tied to the population structure; in the case of a pathogen such as TB, this suggests that the predictor depends on the pathogen itself and evolves with the pathogen with little evolutionary convergence.
A structure-independent association retains its significance after correction for population structure. This suggests either that the predictor depends on the host rather than on the pathogen or that the predictor is a pathogen feature that evolves independent of the pathogen structure - this latter case strongly suggest evolutionary convergence of the feature, which supports a causal effect.
We use analysis of variance on linear models to examine structure dependency of the association of smear positivity and short-term THD.
anova(lm(thd.short.lz ~ afb, meta)) anova(lm(thd.short.lz ~ adj.short + afb, meta))
The significance of the afb
predictor was retained (even reinforced by a factor ~10) after controlling for population structure.
This supports the hypothesis that afb
depends on the host rather than the pathogen (the alternative explanation, namely
that a pathogen feature under convergent evolution enhances smear positivity, is much less likely given current knowledge on TB).
Using the same approach with the lung
predictor yields the following models.
anova(lm(thd.long.lz ~ lung, meta)) anova(lm(thd.long.lz ~ adj.long + lung, meta))
The significance of the lung
predictor decreases strongly (by a factor ~40) after controlling for population structure.
This supports the hypothesis that an active pulmonary infection mostly depends on the pathogen itself, as supported by varying rates of active
pulmonary diseases in different TB lineages.
We can verify this point by modelling lung
as a function of population structure using logistic regression.
anova(glm(lung ~ adj.long, meta, family = binomial), test = "Chisq")
This model confirms a strong dependency of pulmonary infection and population structure. Performing the same analysis for sputum smear positivity and the appropriate adjustment set confirms that afb
, contrary to pulmonary infection, is not explained by population structure:
anova(glm(afb ~ adj.short, meta, family = binomial), test = "Chisq")
In this tutorial, we have seen how to use different timescales to examine hypotheses pertaining to different epidemic dynamics (here, AFB positivity of sputum smear as a host-dependent feature influencing short-term success; and a pulmonary disease as a pathogen-dependent feature influencing long-term success).
The tutorial illustrates how to use THD values (after transformation if required) in linear models, and the use of the thd.adjust
method to
examine dependendy on population structure.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.