Tuberculosis minisatellite example

This tutorial illustrates computation of THD values with different timescales and identification of factors (metadata) associated with epidemic success. The tutorial also provides an introduction to model adjustment for population structure.

Introduction

The dataset is derived from Rasigade et al. 2017 (open access). We use a collection of 1,641 isolates of Mycobacterium tuberculosis complex from France. The genotypes are 15-loci minisatellite markers, technically called mycobacterial interspersed repetitive units (MIRU). The metadata are related to disease transmissibility, including the pulmonary localization of disease (lung) and the presence of acid-fast bacilly in the sputum (afb), associated with high transmission risk.

Load the thd package and the example dataset \code{tb}.

library(thd)
data(tb)

The dataset contains the matrix of genotypes miru and the metadata table meta.

attach(tb)
class(miru); dim(miru)
class(meta); dim(meta); names(meta)

The next computations require a matrix of genetic distances. The package provides the convenience function hamming to compute this matrix. (Note that this function is written in R to avoid compilation and thus is not very fast; consider using specialized packages such as ape for larger datasets (> 10,000 individuals)).

H <- hamming(miru)

Computing THD with short- and long-term timescales

The THD timescale allows to focus the analysis on recent epidemic success (with a short timescale) or long-term success. Long-term success is more likely to reflect features of the pathogen rather than the host, because longer time scales (i.e. orders of magnitude above the usual infectivity period) consider success across numerous hosts, which is expected to average out the host's influence. (An obvious exception is pathogen-host coevolution, however it is not expected to weigh much in tuberculosis-human interactions over moderately short timescales)

The 3 parameters of a THD analysis are the timescale, the no. of markers m and the substitution rate per marker mu. We use two timescales, a short-term timescale of 20y and a long-term timescale of 200y. Parameters m and mu are adapted to the minisatellites used for genotyping.

tshort <- 20
tlong <- 200
m <- 15
mu <- 5e-4

Compute the THD values with short and long timescales using the thd function and examine their distribution.

thd.short <- thd(H, tshort, m, mu)
thd.long  <- thd(H, tlong, m, mu)
par(mfrow = c(1, 2))
hist(thd.short, xlab = "Short-term THD", main = "")
hist(thd.long, xlab = "Long-term THD", main = "")

The distribution of short-term THDs is highly skewed. Most individuals have values near zero, denoting sporadic cases, while higher values denote isolates in recent clusters. Before we move on to the linear modelling step, it can be helpful to apply a transformation to reduce skewness. A simple but effective transformation is to obtain Z-scores after taking logs (log-Z transform). (This transformation is not strictly necessary for the weakly-skewed long-term THD but we apply anyway for consistency with short-term THD)

thd.short.lz <- scale(log(thd.short))
thd.long.lz <- scale(log(thd.long))

Linear modelling of THD to characterize drivers of epidemic success

We compare THDs depending on two suspected drivers of success, sputum smear positivity and pulmonary localization. The working hypotheses that we examine in this section to illustrate THD modelling are as follows:

Examine the distribution of THDs depending on the timescale, sputum smear positivity and pulmonary infection.

bp <- function(f, xlab, ylab) {
  boxplot(f, meta, ylim = c(-3,3), xlab = xlab, ylab = ylab)
}
par(mfrow = c(2,2))
bp(thd.short.lz ~ afb, "AFB smear positivity", "Short-term THD")
bp(thd.short.lz ~ lung, "Pulmonary infection", "Short-term THD")
bp(thd.long.lz ~ afb, "AFB smear positivity", "Long-term THD")
bp(thd.long.lz ~ lung, "Pulmonary infection", "Long-term THD")

Visual inspection suggests that, as expected, both pulmonary disease and sputum smear positivity correlate with success. The amplitudes of difference are weak, with the smallest difference found for smear positivity and a long-term THD. This observation supports the working hypotheses because if smear positivity depends more on the host than on the pathogen, then its association with long-term THD should be weak because the effect of the host is diluted over long time frames.

We can now use linear models to examine the strength of associations and their dependency on population structure. Recall that by design, THD is highly dependent on population structure because it directly reflects this structure, so it is desirable to adjust for this effect.

Adjusting for population structure

The thd package provides the thd.adjust method to adjust for population structure. This function computes a set of principal coordinates (PCs) significantly associated with the provided THD values. Including these PCs as covariates of a linear model provides a means to control for population structure using the same data that were used to compute THD.

(More sophisticated methods exist to adjust for population structure, including mixed-effect models. The thd.adjust method, in line with the philosophy of the thd approach, puts emphasis on simplicity. See Price et al. Nat Genet 2006 for details on PC-based correction, and Hoffman. PLoS One 2013 for alternatives.)

Compute the PC adjustment sets for short- and long-term THDs.

adj.short <- thd.adjust(thd.short.lz, H, m)
adj.long <- thd.adjust(thd.long.lz, H, m)
ncol(adj.short); ncol(adj.long)

Eight and 12 PCs were retained for short and long timescales, respectively. These adjustement sets will help determine whether associations with THD are structure-dependent:

AFB smear sputum positivity drives success independent of population structure

We use analysis of variance on linear models to examine structure dependency of the association of smear positivity and short-term THD.

anova(lm(thd.short.lz ~ afb, meta))
anova(lm(thd.short.lz ~ adj.short + afb, meta))

The significance of the afb predictor was retained (even reinforced by a factor ~10) after controlling for population structure. This supports the hypothesis that afb depends on the host rather than the pathogen (the alternative explanation, namely that a pathogen feature under convergent evolution enhances smear positivity, is much less likely given current knowledge on TB).

Pulmonary infection is a pathogen feature that depends on population structure

Using the same approach with the lung predictor yields the following models.

anova(lm(thd.long.lz ~ lung, meta))
anova(lm(thd.long.lz ~ adj.long + lung, meta))

The significance of the lung predictor decreases strongly (by a factor ~40) after controlling for population structure. This supports the hypothesis that an active pulmonary infection mostly depends on the pathogen itself, as supported by varying rates of active pulmonary diseases in different TB lineages.

We can verify this point by modelling lung as a function of population structure using logistic regression.

anova(glm(lung ~ adj.long, meta, family = binomial), test = "Chisq")

This model confirms a strong dependency of pulmonary infection and population structure. Performing the same analysis for sputum smear positivity and the appropriate adjustment set confirms that afb, contrary to pulmonary infection, is not explained by population structure:

anova(glm(afb ~ adj.short, meta, family = binomial), test = "Chisq")

Conclusions

In this tutorial, we have seen how to use different timescales to examine hypotheses pertaining to different epidemic dynamics (here, AFB positivity of sputum smear as a host-dependent feature influencing short-term success; and a pulmonary disease as a pathogen-dependent feature influencing long-term success).

The tutorial illustrates how to use THD values (after transformation if required) in linear models, and the use of the thd.adjust method to examine dependendy on population structure.



rasigadelab/thd documentation built on April 11, 2020, 7:27 a.m.