treeWAS: Phylogenetic tree-based GWAS for microbes.
In caitiecollins/treeWAS: Phylogenetic tree-based microbial GWAS

treeWAS

R Documentation

Phylogenetic tree-based GWAS for microbes.

Description

This function implements a phylogenetic approach to genome-wide association studies (GWAS) designed for use in bacteria and viruses. The treeWAS approach measures the statistical association between a phenotype of interest and the genotype at all loci, seeking to identify significant associations, while accounting for the confounding effects of clonal population structure and homologous recombination.

Usage

treeWAS(
  snps,
  phen,
  tree = c("BIONJ", "NJ", "parsimony", "BIONJ*", "NJ*"),
  phen.type = NULL,
  n.subs = NULL,
  n.snps.sim = ncol(snps) * 10,
  chunk.size = ncol(snps),
  mem.lim = FALSE,
  test = c("terminal", "simultaneous", "subsequent"),
  correct.prop = FALSE,
  snps.reconstruction = "parsimony",
  snps.sim.reconstruction = "parsimony",
  phen.reconstruction = "parsimony",
  na.rm = TRUE,
  p.value = 0.01,
  p.value.correct = c("bonf", "fdr", FALSE),
  p.value.by = c("count", "density"),
  dist.dna.model = "JC69",
  plot.tree = TRUE,
  plot.manhattan = TRUE,
  plot.null.dist = TRUE,
  plot.dist = FALSE,
  plot.null.dist.pairs = "grid",
  snps.assoc = NULL,
  filename.plot = NULL,
  seed = NULL
)

Arguments

`snps`	A matrix containing binary genetic data, with individuals in the rows and genetic loci in the columns and both rows and columns labelled.
`phen`	A vector containing the phenotypic state of each individual, whose length is equal to the number of rows in `snps` and which is named with the same set of labels. The phenotype can be either binary (character or numeric), categorical (character or numeric), discrete or continuous (numeric). You must specify which type it is via the `phen.type` argument (otherwise treeWAS will assume binary or continuous).
`tree`	A `phylo` object containing the phylogenetic tree; or, a character string, one of `"NJ"`, `"BIONJ"` (the default), or `"parsimony"`; or, if NAs are present in the distance matrix, one of: `"NJ"` or `"BIONJ"`, specifying the method of phylogenetic reconstruction.
`phen.type`	An optional character string specifying whether the phenotypic variable should be treated as either `"categorical"`, `"discrete"` or `"continuous"`. If `phen.type` is `NULL` (the default), ancestral state reconstructions performed via ML will treat any binary phenotype as discrete and any non-binary phenotype as continuous. If `phen.type` is `"categorical"`, ML reconstructions and association tests will treat values as nominal (not ordered) levels and not as meaningful numbers. Categorical phenotypes must have >= 3 unique values (<= 5 recommended). If `phen.type` is `"continuous"`, ML reconstructions will treat values as meaningful numbers and may infer intermediate values.
`n.subs`	A numeric vector containing the homoplasy distribution (if known, see details), or `NULL` (the default).
`n.snps.sim`	An integer specifying the number of loci to be simulated for estimating the null distribution (by default `10*ncol(snps)`). If memory errors arise during the analyis of a large dataset, it may be necesary to reduce `n.snps.sim` from a multiple of 10 to, for example, 5x the number of loci.
`chunk.size`	An integer indicating the number of `snps` loci to be analysed at one time. This may be needed for large datasets or machines with insufficient memory. Smaller values of `chunk.size` will increase the computational time required (e.g., if `chunk.size = ncol(snps)/2`, treeWAS will take twice as long to complete).
`mem.lim`	A logical or numeric value to set a memory limit for large datasets. If `FALSE` (the default), no limit is estimated and `chunk.size` is not changed. If `TRUE`, available memory is estimated with `memfree()` and `chunk.size` is reduced. A single numeric value can be used to set a memory limit (in GB) and `chunk.size` will be reduced accordingly.
`test`	A character string or vector containing one or more of the following available tests of association: `"terminal"`, `"simultaneous"`, `"subsequent"`, `"cor"`, `"fisher"`. By default, the first three tests are run (see details).
`correct.prop`	A logical indicating whether the `"terminal"` and `"subsequent"` tests will be corrected for phenotypic class imbalance. Recommended if the proportion of individuals varies significantly across the levels of the phenotype (if binary) or if the phenotype is skewed (if continuous). If `correct.prop` is `FALSE` (the default), the original version of each test is run. If `TRUE`, an alternate association metric based on the phi correlation coefficient is calculated across the terminal and all (internal and terminal) nodes, respectively.
`snps.reconstruction`	Either a character string specifying `"parsimony"` (the default) or `"ML"` (maximum likelihood) for the ancestral state reconstruction of the genetic dataset, or a matrix containing this reconstruction if it has been performed elsewhere and you provide the tree.
`snps.sim.reconstruction`	A character string specifying `"parsimony"` (the default) or `"ML"` (maximum likelihood) for the ancestral state reconstruction of the simulated null genetic dataset.
`phen.reconstruction`	Either a character string specifying `"parsimony"` (the default) or `"ML"` (maximum likelihood) for the ancestral state reconstruction of the phenotypic variable, or a vector containing this reconstruction if it has been performed elsewhere.
`na.rm`	A logical indicating whether columns in `snps` containing more than 75% `NA`s should be removed at the outset (TRUE, the default) or not (FALSE).
`p.value`	A number specifying the base p-value to be set as the threshold of significance (by default, `0.01`).
`p.value.correct`	A character string, either `"bonf"` (the default) or `"fdr"`, specifying whether correction for multiple testing should be performed by Bonferonni correction (recommended) or the False Discovery Rate.
`p.value.by`	A character string specifying how the upper tail of the p-value distribution is to be identified. Either `"count"` (the default, recommended) for a simple count-based approach or `"density"` for a kernel-density based approximation.
`dist.dna.model`	A character string specifying the type of model to use in reconstructing the phylogenetic tree for calculating the genetic distance between individual genomes, only used if `tree` is a character string (see ?dist.dna).
`plot.tree`	A logical indicating whether to generate a plot of the phylogenetic tree (`TRUE`, the default) or not (`FALSE`).
`plot.manhattan`	A logical indicating whether to generate a manhattan plot for each association score (`TRUE`, the default) or not (`FALSE`).
`plot.null.dist`	A logical indicating whether to plot the null distribution of association score statistics (`TRUE`, the default) or not (`FALSE`).
`plot.dist`	A logical indicating whether to plot the true distribution of association score statistics (`TRUE`) or not (`FALSE`, the default).
`plot.null.dist.pairs`	Either a logical indicating, for categorical phenotypes only, whether to plot additional null distributions of association score statistics for each pairwise comparison of phenotype levels (`TRUE`) or not (`FALSE`), or the character string `"grid"` (the default) which will print all of these plots in one grid (N pairs x N tests).
`snps.assoc`	An optional character string or vector specifying known associated loci to be demarked in results plots (e.g., from previous studies or if data is simulated); else `NULL`.
`filename.plot`	An optional character string denoting the file location for saving any plots produced (eg. "C:/Home/treeWAS_plots.pdf"); else `NULL`.
`seed`	An optional integer to control the pseudo-randomisation process and allow for identical repeat runs of the function; else `NULL`.

Details

The treeWAS Approach

For a description of the approach adopted in our method, please see either the treeWAS vignette or the treeWAS GitHub Wiki.

A more detailed description of our method can also be found in our paper, available in PLOS Computational Biology.

Data Cleaning

The genetic data matrix, phenotype, and phylogenetic tree (terminal nodes) should all be labelled with corresponding sets of names for the individuals. The order of individuals does not matter, as they will be rearranged to match within the function.

Any individual that is not present in one or more of the genetic data matrix, the phenotypic variable, and/or the phylogenetic tree must be removed.

If, in the genetic data matrix, redundant columns are present for binary loci (i.e., Denoting the state of the second allele as the inverse of the previous colummn), these should be removed. For loci with more than two alleles, all columns should be retained. The removal of redundant binary columns can be done by hand; but, the function get.binary.snps may also be used. This function requires column names to have a two-character suffix and unique locus identifiers. The function expects the suffixes ".a", ".c", ".g", ".t" (e.g., "Locus_123243.a", "Locus_123243.g"), though alternative two-character suffixes can be used (e.g., "Locus_123243_1", "Locus_123243_2") by setting the argument force = TRUE. Please also be careful not to accidentally remove any purposeful duplications with repeated names; for example, if you have deliberately duplicated columns without subsequently generating unique column names (e.g., by expanding unique columns according to an index returned by ClonalFrameML).

Missing data is permitted (denoted by NA values only) in the genetic data matrix. However, any row or column that is entirely missing will be automatically removed within treeWAS. In addition, any column that is more than 75 (though, if you do not wish this to be the case, you can set na.rm to FALSE).

The phylogenetic tree, if provided by the user, should contain only the terminal nodes corresponding to the individuals under analysis. Any additional individuals, including the outgroup, if not under analysis should be removed prior to running treeWAS. The tree can be either rooted or unrooted.

Homoplasy Distribution

The homoplasy distribution contains the number of substitutions per site.

If this information is not known, it will be reconstructed within treeWAS using Fitch's parsimony.

If this information is known (i.e., it has been estimated elsewhere through a parsimonious reconstruction), it can be provided in the n.subs argument. It must be in the form of a named vector (or table), or a vector in which the i'th element contains the number of loci that have been estimated to undergo i substitutions on the tree. The vector must be of length max n.subs, and "empty" indices must contain zeros. For example: the vector n.subs = c(1833, 642, 17, 6, 1, 0, 0, 1), could be used to define the homoplasy distribution for a dataset with 2500 loci, where the maximum number of substitutions to be undergone on the tree by any locus is 8, and no loci undergo either 6 or 7 substitutions.

Ancestral State Reconstrution

If ancestral state reconstruction has been performed outside of treeWAS for the snps and/or phen variable, these reconstructions can be submitted in the form of a matrix and a vector, respectively, to the snps.reconstruction and phen.reconstruction arguments. Please note the formatting requirements.

If provided by the user, snps.reconstruction should contain snps in its first nrow(snps) rows, and then have the reconstructed states in rows nrow(snps)+1 to max(tree$edge)

If provided by the user, phen.reconstruction should contain phen in its first length(phen) elements, and then have the reconstructed states in elements length(phen)+1 to max(tree$edge)

If created externally, the snps.reconstruction must have been generated through either parsimony or ML, and the snps.sim.reconstruction argument should be set to match (as either "parsimony" or "ML"), so that a direct comparison can be made. At least for small datasets, it may be worth (re-)running this reconstruction within treeWAS instead, in case any inconsistencies exist between the external and internal methods of reconstruction.

In addition, if either snps.reconstruction or phen.reconstruction is being provided as input, the user must ensure that tree$node.label contains labels for all internal nodes and that this same set of names is used to label the rows or indices correspondng to the internal nodes in snps.reconstruction and/or phen.reconstruction.

Tests of Association

Notation used in the equations below:

G = Genotypic state...

P = Phenotypic state...

t = ... at terminal nodes

a = ... at ancestral nodes

d = ... at descendant nodes

Nterm = Number of terminal nodes

terminal

The terminal test solves the following equation, for each genetic locus, at the terminal nodes of the tree only:

Terminal = | (1/Nterm)*(Pt*Gt - (1 - Pt)*Gt - Pt*(1 - Gt) + (1 - Pt)*(1 - Gt)) |

The terminal test is a sample-wide test of association that seeks to identify broad patterns of correlation between genetic loci and the phenotype, without relying on inferences drawn from reconstructions of the ancestral states.

simultaneous

The simultaneous test solves the following equation, for each genetic locus, across each branch in the tree:

Simultaneous = | (Pa - Pd)*(Ga - Gd) |

This allows for the identification of simultaneous substitutions (i.e., substitutions occuring in both the genetic locus and phenotypic variable on the same branch of the phylogenetic tree, or parallel change on the same branch if non-binary data). Simultaneous substitutions are an indicator of a deterministic relationship between genotype and phenotype. Moreover, because this score measures only simultaneous change, and is not negatively impacted by the lack of association on other branches, it may be able to detect associations occurring within some clades but not others and, therefore, to identify loci giving rise to the phenotype through complementary pathways.

subsequent

The subsequent test solves the following equation, for each genetic locus, across each branch in the tree:

Subsequent = | 4/3(Pa*Ga) + 2/3(Pa*Gd) + 2/3(Pd*Ga) + 4/3(Pd*Gd) - Pa - Pd - Ga - Gd + 1 |

Calculating this metric across all branches of the tree allows us to measure in what proportion of tree branches we expect the genotype and phenotype to be in the same state.

Value

The output of 'treeWAS' contains the set of significant loci identified as well as all relevant information used by or generated within treeWAS. To examine the significant findings alone, we recommend using the print function.

The treeWAS function returns a list object, which takes on the following general structure:

$treeWAS.combined

The first element is a list of length two containing the identities of significant findings:

$treeWAS.combined The pooled set of significant loci identified by any association score.
$treeWAS A list with the sets of significant loci identified by each association score individually.

$[SCORE]

There are list elements for each association score, containing the original score values for each locus and additional information for significant loci. By default, there will be three such $[SCORE]-type elements called $terminal, $simultaneous, and $subsequent, each of which will have the following elements:

$corr.dat The association score values for loci in the empirical genetic dataset.
$corr.sim The association score values for loci in the simulated genetic dataset.
$p.vals The p-values associated with the loci in the empirical genetic dataset for this association score.
$sig.thresh The significance threshold for this association score.
$sig.snps A data frame describing the genetic loci identified as significant. The last four columns will only be present if the data is binary, in which case they will contain the cell counts of a 2x2 table of genotypic and phenotypic states for each significant locus.
- row.names: The column names of significant loci.
- $SNP.locus: The column positions of significant loci in dat$snps (see below).
- $p.value: The p-values for significant loci.
- $score: The association score values for significant loci.
- $G1P1: N.individuals with genotype = 1 and phenotype = 1 at this locus.
- $G0P0: N.individuals with genotype = 0 and phenotype = 0 at this locus.
- $G1P0: N.individuals with genotype = 1 and phenotype = 0 at this locus.
- $G0P1: N.individuals with genotype = 0 and phenotype = 1 at this locus.
$min.p.value The minimum p-value. P-values listed as zero can only truly be defined as being below this value.

$dat

The final element contains all of the data either used by or generated within treeWAS. Objects that were provided as inputs to the treeWAS function will be returned here in the form in which they were analysed (i.e., after data cleaning within treeWAS).

$snps The empirical genetic data matrix.
$snps.reconstruction The ancestral state reconstruction of the empirical genetic data matrix.
$snps.sim The simulated genetic data matrix.
$snps.sim.reconstruction The ancestral state reconstruction of the simulated genetic data matrix.
$phen The phenotypic variable.
$phen.reconstruction The ancestral state reconstruction of the phenotype.
$tree The phylogenetic tree.
$n.subs The homoplasy distribution. Each element represents a number of substitutions (from 1 to length(n.subs)) and contains the number of loci that have been inferred to undergo that many substitutions.

Author(s)

Caitlin Collins caitiecollins@gmail.com

References

Collins, C. and Didelot, X. "A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination." PLoS Comput. Biol., vol. 14, p. e1005958, Feb. 2018.

Examples


## Not run: 
## load example homoplasy distribution
data(dist_0)
str(dist_0)


## simulate a tree, phenotype, and genetic
## data matrix with 10 associated loci:
dat <- coalescent.sim(n.ind = 100,
                        n.snps = 1000,
                        n.subs = dist_0,
                        n.snps.assoc = 10,
                        assoc.prob = 90,
                        n.phen.subs = 15,
                        phen = NULL,
                        plot = TRUE,
                        heatmap = FALSE,
                        reconstruct = FALSE,
                        dist.dna.model = "JC69",
                        grp.min = 0.25,
                        row.names = NULL,
                        coaltree = TRUE,
                        s = NULL,
                        af = NULL,
                        filename = NULL,
                        set = 1,
                        seed = 1)

## isolate elements of output:
snps <- dat$snps
phen <- dat$phen
snps.assoc <- dat$snps.assoc
tree <- dat$tree

## run treeWAS:
out <- treeWAS(snps = snps,
                phen = phen,
                tree = tree,
                seed = 1)

## examine output:
print(out)


## End(Not run)

caitiecollins/treeWAS documentation built on Feb. 11, 2025, 11:25 a.m.