treeWAS | R Documentation |
This function implements a phylogenetic approach to genome-wide association studies (GWAS)
designed for use in bacteria and viruses.
The treeWAS
approach measures the statistical association
between a phenotype of interest and the genotype at all loci,
seeking to identify significant associations, while accounting for the
confounding effects of clonal population structure and homologous recombination.
treeWAS(
snps,
phen,
tree = c("BIONJ", "NJ", "parsimony", "BIONJ*", "NJ*"),
phen.type = NULL,
n.subs = NULL,
n.snps.sim = ncol(snps) * 10,
chunk.size = ncol(snps),
mem.lim = FALSE,
test = c("terminal", "simultaneous", "subsequent"),
correct.prop = FALSE,
snps.reconstruction = "parsimony",
snps.sim.reconstruction = "parsimony",
phen.reconstruction = "parsimony",
na.rm = TRUE,
p.value = 0.01,
p.value.correct = c("bonf", "fdr", FALSE),
p.value.by = c("count", "density"),
dist.dna.model = "JC69",
plot.tree = TRUE,
plot.manhattan = TRUE,
plot.null.dist = TRUE,
plot.dist = FALSE,
plot.null.dist.pairs = "grid",
snps.assoc = NULL,
filename.plot = NULL,
seed = NULL
)
snps |
A matrix containing binary genetic data, with individuals in the rows and genetic loci in the columns and both rows and columns labelled. |
phen |
A vector containing the phenotypic state of each individual, whose length is equal to the number of rows in
|
tree |
A |
phen.type |
An optional character string specifying whether the phenotypic variable should be treated
as either |
n.subs |
A numeric vector containing the homoplasy distribution (if known, see details), or |
n.snps.sim |
An integer specifying the number of loci to be simulated for estimating the null distribution
(by default |
chunk.size |
An integer indicating the number of |
mem.lim |
A logical or numeric value to set a memory limit for large datasets.
If |
test |
A character string or vector containing one or more of the following available tests of association:
|
correct.prop |
A logical indicating whether the |
snps.reconstruction |
Either a character string specifying |
snps.sim.reconstruction |
A character string specifying |
phen.reconstruction |
Either a character string specifying |
na.rm |
A logical indicating whether columns in |
p.value |
A number specifying the base p-value to be set as the threshold of significance (by default, |
p.value.correct |
A character string, either |
p.value.by |
A character string specifying how the upper tail of the p-value distribution is to be identified.
Either |
dist.dna.model |
A character string specifying the type of model to use in reconstructing the phylogenetic tree for
calculating the genetic distance between individual genomes,
only used if |
plot.tree |
A logical indicating whether to generate a plot of the phylogenetic tree
( |
plot.manhattan |
A logical indicating whether to generate a manhattan plot for each association score
( |
plot.null.dist |
A logical indicating whether to plot the null distribution of association score statistics
( |
plot.dist |
A logical indicating whether to plot the true distribution of association score statistics
( |
plot.null.dist.pairs |
Either a logical indicating, for categorical phenotypes only, whether to plot
additional null distributions of association score statistics for each
pairwise comparison of phenotype levels ( |
snps.assoc |
An optional character string or vector specifying known associated loci to be demarked in
results plots (e.g., from previous studies or if data is simulated); else |
filename.plot |
An optional character string denoting the file location for
saving any plots produced (eg. "C:/Home/treeWAS_plots.pdf"); else |
seed |
An optional integer to control the pseudo-randomisation process and allow
for identical repeat runs of the function; else |
The treeWAS Approach
For a description of the approach adopted in our method,
please see either the treeWAS
vignette
or the treeWAS
GitHub Wiki.
A more detailed description of our method can also be found in our paper, available in PLOS Computational Biology.
Data Cleaning
The genetic data matrix, phenotype, and phylogenetic tree (terminal nodes) should all be labelled with corresponding sets of names for the individuals. The order of individuals does not matter, as they will be rearranged to match within the function.
Any individual that is not present in one or more of the genetic data matrix, the phenotypic variable, and/or the phylogenetic tree must be removed.
If, in the genetic data matrix, redundant columns are present for binary loci
(i.e., Denoting the state of the second allele as the inverse of the previous colummn), these should be removed.
For loci with more than two alleles, all columns should be retained. The removal of redundant binary columns
can be done by hand; but, the function get.binary.snps
may also be used. This function requires
column names to have a two-character suffix and unique locus identifiers. The function expects the suffixes
".a", ".c", ".g", ".t" (e.g., "Locus_123243.a", "Locus_123243.g"), though alternative two-character suffixes can be
used (e.g., "Locus_123243_1", "Locus_123243_2") by setting the argument force = TRUE
.
Please also be careful not to accidentally remove any purposeful duplications with repeated names;
for example, if you have deliberately duplicated columns without subsequently generating unique column names
(e.g., by expanding unique columns according to an index returned by ClonalFrameML).
Missing data is permitted (denoted by NA values only) in the genetic data matrix.
However, any row or column that is entirely missing will be automatically removed within treeWAS
.
In addition, any column that is more than 75
(though, if you do not wish this to be the case, you can set na.rm
to FALSE
).
The phylogenetic tree, if provided by the user, should contain only the terminal nodes corresponding to the individuals under analysis.
Any additional individuals, including the outgroup, if not under analysis should be removed prior to running treeWAS
.
The tree can be either rooted or unrooted.
Homoplasy Distribution
The homoplasy distribution contains the number of substitutions per site.
If this information is not known, it will be reconstructed within treeWAS
using Fitch's parsimony.
If this information is known (i.e., it has been estimated elsewhere through a parsimonious reconstruction),
it can be provided in the n.subs
argument.
It must be in the form of a named vector (or table), or a vector in which the i'th element
contains the number of loci that have been estimated to undergo i substitutions on the tree.
The vector must be of length max n.subs, and "empty" indices must contain zeros.
For example: the vector n.subs = c(1833, 642, 17, 6, 1, 0, 0, 1)
,
could be used to define the homoplasy distribution for a dataset with 2500 loci,
where the maximum number of substitutions to be undergone on the tree
by any locus is 8, and no loci undergo either 6 or 7 substitutions.
Ancestral State Reconstrution
If ancestral state reconstruction has been performed
outside of treeWAS
for the snps
and/or phen
variable,
these reconstructions can be submitted in the form of a matrix and a vector, respectively, to the
snps.reconstruction
and phen.reconstruction
arguments.
Please note the formatting requirements.
If provided by the user, snps.reconstruction
should contain snps
in its first nrow(snps)
rows,
and then have the reconstructed states in rows nrow(snps)+1
to max(tree$edge)
If provided by the user, phen.reconstruction
should contain phen
in its first length(phen)
elements,
and then have the reconstructed states in elements length(phen)+1
to max(tree$edge)
If created externally, the snps.reconstruction
must have
been generated through either parsimony or ML, and the
snps.sim.reconstruction
argument should be set to match
(as either "parsimony"
or "ML"
), so that a direct comparison can be made.
At least for small datasets, it may be worth (re-)running this
reconstruction within treeWAS
instead, in case any inconsistencies exist between the
external and internal methods of reconstruction.
In addition, if either snps.reconstruction
or phen.reconstruction
is being provided as input,
the user must ensure that tree$node.label
contains labels for all internal nodes and
that this same set of names is used to label the rows or indices correspondng
to the internal nodes in snps.reconstruction
and/or phen.reconstruction
.
Tests of Association
Notation used in the equations below:
G = Genotypic state...
P = Phenotypic state...
t = ... at terminal nodes
a = ... at ancestral nodes
d = ... at descendant nodes
Nterm = Number of terminal nodes
terminal
The terminal
test solves the following equation, for each genetic locus,
at the terminal nodes of the tree only:
Terminal = | (1/Nterm)*(Pt*Gt - (1 - Pt)*Gt - Pt*(1 - Gt) + (1 - Pt)*(1 - Gt)) |
The terminal
test is a sample-wide test of association that
seeks to identify broad patterns of correlation between genetic loci and the phenotype,
without relying on inferences drawn from reconstructions of the ancestral states.
simultaneous
The simultaneous
test solves the following equation, for each genetic locus,
across each branch in the tree:
Simultaneous = | (Pa - Pd)*(Ga - Gd) |
This allows for the identification of simultaneous substitutions (i.e., substitutions occuring in both the genetic locus and phenotypic variable on the same branch of the phylogenetic tree, or parallel change on the same branch if non-binary data). Simultaneous substitutions are an indicator of a deterministic relationship between genotype and phenotype. Moreover, because this score measures only simultaneous change, and is not negatively impacted by the lack of association on other branches, it may be able to detect associations occurring within some clades but not others and, therefore, to identify loci giving rise to the phenotype through complementary pathways.
subsequent
The subsequent
test solves the following equation, for each genetic locus,
across each branch in the tree:
Subsequent = | 4/3(Pa*Ga) + 2/3(Pa*Gd) + 2/3(Pd*Ga) + 4/3(Pd*Gd) - Pa - Pd - Ga - Gd + 1 |
Calculating this metric across all branches of the tree allows us to measure in what proportion of tree branches we expect the genotype and phenotype to be in the same state.
The output of 'treeWAS' contains the set of significant loci identified
as well as all relevant information used by or generated within treeWAS
.
To examine the significant findings alone, we recommend using the print
function.
The treeWAS
function returns a list object, which takes on the following general structure:
The first element is a list of length two containing the identities of significant findings:
$treeWAS.combined
The pooled set of significant loci identified by any association score.
$treeWAS
A list with the sets of significant loci identified by each association score individually.
There are list elements for each association score, containing the original score values
for each locus and additional information for significant loci.
By default, there will be three such $[SCORE]
-type elements called
$terminal
, $simultaneous
, and $subsequent
,
each of which will have the following elements:
$corr.dat
The association score values for loci in the empirical genetic dataset.
$corr.sim
The association score values for loci in the simulated genetic dataset.
$p.vals
The p-values associated with the loci in the empirical genetic dataset for this association score.
$sig.thresh
The significance threshold for this association score.
$sig.snps
A data frame describing the genetic loci identified as significant.
The last four columns will only be present if the data is binary, in which case they will
contain the cell counts of a 2x2 table of genotypic and phenotypic states for each significant locus.
row.names
: The column names of significant loci.
$SNP.locus
: The column positions of significant loci in dat$snps
(see below).
$p.value
: The p-values for significant loci.
$score
: The association score values for significant loci.
$G1P1
: N.individuals with genotype = 1 and phenotype = 1 at this locus.
$G0P0
: N.individuals with genotype = 0 and phenotype = 0 at this locus.
$G1P0
: N.individuals with genotype = 1 and phenotype = 0 at this locus.
$G0P1
: N.individuals with genotype = 0 and phenotype = 1 at this locus.
$min.p.value
The minimum p-value. P-values listed as zero can only truly be defined as being below this value.
The final element contains all of the data either used by or generated within treeWAS
.
Objects that were provided as inputs to the treeWAS
function will be returned here
in the form in which they were analysed (i.e., after data cleaning within treeWAS
).
$snps
The empirical genetic data matrix.
$snps.reconstruction
The ancestral state reconstruction of the empirical genetic data matrix.
$snps.sim
The simulated genetic data matrix.
$snps.sim.reconstruction
The ancestral state reconstruction of the simulated genetic data matrix.
$phen
The phenotypic variable.
$phen.reconstruction
The ancestral state reconstruction of the phenotype.
$tree
The phylogenetic tree.
$n.subs
The homoplasy distribution. Each element represents a number of substitutions (from 1 to length(n.subs)
)
and contains the number of loci that have been inferred to undergo that many substitutions.
Caitlin Collins caitiecollins@gmail.com
Collins, C. and Didelot, X. "A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination." PLoS Comput. Biol., vol. 14, p. e1005958, Feb. 2018.
## Not run:
## load example homoplasy distribution
data(dist_0)
str(dist_0)
## simulate a tree, phenotype, and genetic
## data matrix with 10 associated loci:
dat <- coalescent.sim(n.ind = 100,
n.snps = 1000,
n.subs = dist_0,
n.snps.assoc = 10,
assoc.prob = 90,
n.phen.subs = 15,
phen = NULL,
plot = TRUE,
heatmap = FALSE,
reconstruct = FALSE,
dist.dna.model = "JC69",
grp.min = 0.25,
row.names = NULL,
coaltree = TRUE,
s = NULL,
af = NULL,
filename = NULL,
set = 1,
seed = 1)
## isolate elements of output:
snps <- dat$snps
phen <- dat$phen
snps.assoc <- dat$snps.assoc
tree <- dat$tree
## run treeWAS:
out <- treeWAS(snps = snps,
phen = phen,
tree = tree,
seed = 1)
## examine output:
print(out)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.