In caitiecollins/treeWAS: Phylogenetic tree-based microbial GWAS

Introduction

The treeWAS R package allows users to apply our phylogenetic tree-based appraoch to Genome-Wide Association Studies (GWAS) to microbial genetic and phenotypic data. In short, treeWAS measures the statistical association between a phenotype of interest and the genotype at all loci, with the aim of identifying significant associations, while correcting for the confounding effects of clonal population structure and homologous recombination. treeWAS is applicable to both bacterial and viral genetic data from both the core and accessory genomes, and to both binary and continuous phenotypes.

Installation

treeWAS is currently being hosted on GitHub at https://github.com/caitiecollins/treeWAS.

The most up-to-date version of treeWAS can be easily installed directly within R, using the devtools package.

#  library(knitr)
# opts_chunk$set(fig.path='figs', fig.keep='high', dev='pdf', fig.width=7, fig.height=7,
# tidy=FALSE, warning=FALSE, fig.show="asis", fig.align='center', out.width=".8\\textwidth")
## set syntax-highlighting color theme for R code?
# opts_knit$set(out.format = "latex")
# thm <- knit_theme$get("acid")  # parse the theme to a list
# knit_theme$set(thm)

## install devtools, if necessary:
install.packages("devtools", dep=TRUE)
library(devtools)

## install treeWAS from github:
install_github("caitiecollins/treeWAS", build_vignettes = TRUE, force=TRUE)
library(treeWAS)

library(treeWAS, quietly = TRUE, verbose=FALSE)

To open the vignette from within R (recommended if any formatted elements are not rendering properly where you are currently reading this), run browseVignettes and click on the HTML hyperlink:

browseVignettes("treeWAS")

The treeWAS Approach

The approach adopted within treeWAS is described fully in our paper, available in PLOS Computational Biology.

As a GWAS approach, treeWAS performs an unbiased search for statistically significant associations between a phenotype of interest and the genotype at all loci in a genetic dataset. No prior hypotheses about potential associations at candidate loci are required. Instead, a statistical approach is used to compare the genomes of individuals and to identify systematic differences in genotype that correspond to differences in the phenotype.

In addition to measuring associations and identifying statistically-significant findings, a central aim of treeWAS is to control for the confounding effects of clonal population structure and population stratification (overlap between the genetic ancestry and phenotypic states of individuals that gives rise to spurious associations) and homologous recombination. Our approach uses data simulation to disentangle genuine associations, with statistical significance and evolutionary support, from the noisy background of spurious associations arising by chance and from confounding factors. treeWAS simulates a "null" genetic dataset in such a way as to maintain several features of the empirical dataset, namely: its clonal genealogy, terminal phenotype, genetic composition, and homoplasy distribution (the number of substitutions per site due to both mutation and recombination). The simulated dataset is therefore able to capture these potentially-confounding features of the dataset under analysis, but without recreating any of the "true" associations, beyond those expected to arise by chance or as a result of these confounding factors.

Once the "null" genetic dataset has been generated, treeWAS can calculate association scores for loci in both the real and simulated datasets and compare the two. The association between each simulated locus and the empirical phenotype is measured and, collectively, these values form a null distribution of association score statistics. The degree of association between each locus in the empirical genetic dataset and the empirical phenotype is measured using the same association scores. At the upper tail of the null distribution, a threshold of significance is drawn at the quantile corresponding to: 1 - (a base p-value corrected for multiple testing). Any loci in the real dataset that have association score values lying above this threshold are deemed to be significantly associated to the phenotype.

Tests of Association

When measuring association between the phenotype of interest and the genotype, by default, three separate association scores are calculated for each locus in the genetic dataset.

Equations below use the following notation:

$g$ = Genotypic state...
$p$ = Phenotypic state...
$anc$ = ... at ancestral nodes
$des$ = ... at descendant nodes
$n_{term}$ = Number of terminal nodes
$n_{branch}$ = Number of branches

Terminal : The terminal test solves the following equation, for each genetic locus, at the terminal nodes of the tree only: $$ \begin{align} Terminal = | \sum_{i = 1}^{n_{term}}\frac{1}{n_{term}}(&p_{i}^{des}g_{i}^{des}\ -\ (1 - p_{i}^{des})g_{i}^{des}\ -\ \ &p_{i}^{des}(1 - g_{i}^{des})\ +\ (1 - p_{i}^{des})(1 - g_{i}^{des}))\ | \end{align} $$ : The terminal test is a sample-wide test of association that seeks to identify broad patterns of correlation between genetic loci and the phenotype, without relying on inferences drawn from reconstructions of the ancestral states.

Simultaneous : The simultaneous test solves the following equation, for each genetic locus, across each branch in the tree. $$ \begin{align} Simultaneous = |\sum_{i = 1}^{n_{branch}}(p_i^{anc} - p_i^{des})(g_i^{anc} - g_i^{des})\ | \end{align} $$ : This allows for the identification of simultaneous substitutions in both the genetic locus and phenotypic variable on the same branch of the phylogenetic tree (or parallel change in non-binary data). Simultaneous substitutions are an indicator of a deterministic relationship between genotype and phenotype. Moreover, because this score is not negatively impacted by the lack of association on other branches, it may be able to detect associations occurring through complementary pathways (i.e., in some clades but not others).

Subsequent : The subsequent test solves the following equation, for each genetic locus, across each branch in the tree: $$ \begin{align} Subsequent = |\sum_{i = 1}^{n_{branch}}\ \ (&\frac{4}{3}(p_i^{anc}g_i^{anc})\ +\ \frac{2}{3}(p_i^{anc}g_i^{des})\ +\ \ & \frac{2}{3}(p_i^{des}g_i^{anc})\ +\ \frac{4}{3}(p_i^{des}g_i^{des})\ -\ \ & p_i^{anc}\ -\ p_i^{des}\ -\ g_i^{anc}\ -\ g_i^{des}\ +\ 1)\ | \end{align} $$ : Calculating this metric across all branches of the tree allows us to measure in what proportion of tree branches we expect the genotype and phenotype to be in the same state. By drawing on inferences from the ancestral state reconstructions as well as the terminal states, this score may allows us to identify broad, if imperfect, patterns of association.

Using the treeWAS R Package

Here we will go over the functions and arguments used within the treeWAS package. We will use a simple example dataset available within treeWAS to illustrate the code involved. Integration with the ClonalFrameML software will be discussed in a subsequent section.

Data

Required Elements

To carry out a GWAS using treeWAS, the following data is required:

A genetic dataset : A matrix containing binary genetic data (whether this encodes SNPs, gene presence/absence, etc. is up to you). Individuals should be in the rows, and genetic variables in the columns. Both rows and columns must be appropriately labelled with unique names.

A phenotypic variable : A vector or factor containing either a binary or continuous variable encoding the phenotype for each individual. Each element should have a name that corresponds to the row labels of the genetic data matrix (order does not matter).

Optional Elements

The following optional elements can also be provided by the user or, alternatively, these can be generated automatically by the treeWAS function:

A phylogenetic tree : An object of class phylo (from the ape package), containing a phylogenetic tree. Tip labels are required and must correspond to both the row labels of the genetic data matrix and the names of the phenotypic variable. Any additional individuals, including the outgroup if not under analysis, should be removed prior to running treeWAS. The tree can be either rooted or unrooted.

An ancestral state reconstruction of the genotype : A matrix containing a reconstruction of the ancestral states of all loci in the genetic dataset, at all internal nodes. This should include the original states at the terminal nodes and have the same number of columns as the input genetic data matrix. The number of rows should be equal to the total number of nodes in the tree, including the terminal nodes (in rows 1:N). It must have been reconstructed with either parsimony or ML (see ?treeWAS), and you must provide the tree, for consistency with the ancestral state reconstruction of the simulated genetic dataset that will be performed within treeWAS.

An ancestral state reconstruction of the phenotype : A vector or factor containing a reconstruction of the ancestral states of the phenotypic variable at all nodes in the tree, including the original states at the terminal nodes (at positions 1:N).

Example Data

We will use the data stored within treeWAS as an example throughout this section of the vignette. We load the data using the data function and examine its structure below:

## Load example data:
data(snps)
data(phen)
data(tree)
data(snps.reconstruction)

## Examine data:
## genetic data
str(snps)

## phenotype
str(phen)
table(phen)

## tree
str(tree)

## genetic data reconstruction:
str(snps.reconstruction)

## Load colours:
data(phen.plot.col)

## Plot tree showing phenotype:
p <- plot_phen(tree, phen.nodes=phen.plot.col$all.nodes)

library(png)
library(grid)
img <- readPNG("figs/tree_phen_eg.png")
grid.raster(img)

Data Cleaning

Before running the treeWAS function, please ensure that your data is in the appropriate format.

Loading data:

The ape and adegenet packages loaded with treeWAS contain a number of functions useful for reading in data from a file and converting it. Some useful commands are included below.

For genetic data, the read.dna function can read in sequences from "sequential", "CLUSTAL", or "FASTA" formatted files. One way to convert the resulting DNAbin class object to a binary matrix is by way of the DNAbin2genind function.

## (Template code, don't run)
## Read data from file:
dna <- read.dna(file = "filename", format = "fasta")

## Convert: 
mat <- DNAbin2genind(dna)@tab

Then, see "Biallelic loci" below to remove redundant columns.

Phylogenetic trees can also be imported. Good tree-building programs exist outside of R, and we encourage users to provide a tree that has been built with external software, especially if the organism under analysis undergoes recombination (see the section below on "Integration with ClonalFrameML"). To read trees from file, read.tree can be used for trees stored in Newick/New Hampshire format and read.nexus for trees of Nexus format. Both return a phylo object, so no further conversion is required.

## (Template code, don't run)
## Read tree from a Newick/New Hampshire file:
tree <- read.tree(file = "filename")

## Or from a Nexus file:
tree <- read.nexus(file = "filename")

## Convert: 
mat <- DNAbin2genind(dna)@tab

Phenotypic data can be read in from a text or CSV file with read.table. Then, simply convert to a vector and retain the names.

## (Template code, don't run)
## Read file:
df <- read.table(file="filename")

## Convert:
phen <- as.vector(unlist(df))
names(phen) <- rownames(df)

Labels:

The genetic data matrix, phenotype, and phylogenetic tree (terminal nodes) should all be labelled with corresponding sets of names for the individuals.

Any individual that is not present in one or more of the genetic data matrix, the phenotypic variable, and/or the phylogenetic tree must be removed.

## Check that labels are present:
is.null(tree$tip.label)
is.null(rownames(snps))
is.null(names(phen))

## Cross-check labels with each other:
all(tree$tip.label %in% rownames(snps))
all(rownames(snps) %in% tree$tip.label)
all(tree$tip.label %in% names(phen))
all(names(phen) %in% tree$tip.label)
all(names(phen) %in% rownames(snps))
all(rownames(snps) %in% names(phen))

The above checks are run within treeWAS, but it may be worthwhile to confirm for yourself that all labels are present and look correct.

In addition, if either snps.reconstruction or phen.reconstruction is being provided as input, the user must ensure that tree$node.label contains labels for all internal nodes and that this same set of names is used to label the rows or indices correspondng to the internal nodes in snps.reconstruction and/or phen.reconstruction.

Biallelic loci:

If, in the genetic data matrix, redundant columns are present for binary loci (i.e., Denoting the state of the second allele as the inverse of the previous column), these should be removed. For loci with more than two alleles, all columns should be retained.

The removal of redundant binary columns can be done by hand; but, the function get.binary.snps may also be used. This function requires column names to have a two-character suffix and unique locus identifiers. The function expects the suffixes ".a", ".c", ".g", ".t" (e.g., "Locus_123243.a", "Locus_123243.g"), though alternative two-character suffixes can be used (e.g., "Locus_123243_1", "Locus_123243_2") by setting the argument force = TRUE.

Please also be careful not to accidentally remove any purposeful duplications with repeated names, for example, if you have deliberately duplicated columns (e.g., by expanding unique columns according to an index returned by ClonalFrameML), assign unique column names before removing unwanted columns.

Before running the get.binary.snps function, you can check the suffixes of the genetic data matrix's column names using another small function in treeWAS, keepLastN, which allows you to keep the last N characters of a variable or vector:

suffixes <- keepLastN(colnames(snps), n = 2)
suffixes <- unique(suffixes)

if(all(suffixes %in% c(".a", ".c", ".g", ".t"))){
  ## SNPs:
  snps <- get.binary.snps(snps)

  ## ... Optional step (for snps.reconstruction, if present)
  ## (Example only, if needed for user data)
  if(is.matrix(snps.reconstruction)){
    snps.reconstruction <- snps.reconstruction[, which(colnames(snps.reconstruction) %in% colnames(snps))]
  }
}

Missing data:

Missing data is permitted (denoted by NA values only) in the genetic data matrix. Any column that is entirely missing will be automatically removed within treeWAS. Any column composed of more than 75% NAs will also be removed; however, if for some reason you do not wish this to be the case, set the argument na.rm to FALSE.

Missing data is also permitted in the phenotype, although any missing individuals will be automatically removed from the analysis.

Trees:

The phylogenetic tree, if provided by the user, should contain only the terminal nodes corresponding to the individuals under analysis. Any additional individuals, including the outgroup, if not under analysis should be removed prior to running treeWAS. The tree can be either rooted or unrooted.

Continuous phenotypes:

If you are working with a continuous phenotype, you may be able to apply treeWAS to the phenotype directly.

But, if you are working with a particularly skewed phenotype or one with serious outliers, you may want to modify your data prior to running treeWAS. It is important to examine the distribution of your phenotype.

## Load eg. variable: 
data(phen.cont)

## Examine phen distribution:
hist(phen.cont)

library(png)
library(grid)
img <- readPNG("figs/plot_hist_phen.png")
grid.raster(img)

A skewed phenotypic distribution, like this one, can have a negative impact on the performance of treeWAS (causing left-skewed null distributions in Scores 1 and 3, and false positives). It may be more appropriate to first transform the phenotype by rank, and analyse the relative values (ranks) rather than the original values.

Identify the ranks of your phenotypic values and examine the distribution of the phenotypic ranks.

## Get ranks of phen:
phen.cont.rank <- rank(phen.cont, ties.method = "average")
names(phen.cont.rank) <- names(phen.cont)

## Examine distribution of phen ranks:
hist(phen.cont.rank)

library(png)
library(grid)
img <- readPNG("figs/plot_hist_phen_rank.png")
grid.raster(img)

The phenotypic ranks are likely to have a more uniform distribution than the original values. Greater uniformity in this distribution will improve the performance of treeWAS. Consider replacing the phenotype with the ranks thereof.

## Store original phen: 
phen.cont.ori <- phen.cont

## Replace phen with ranks of phen:
phen.cont <- phen.cont.rank

If you choose not to transform the phenotype by rank, you may want to consider removing outliers if any are present, as these may similarly distort the analysis of continuous phenotypes.

Arguments

The treeWAS function takes the following arguments:

## Don't run this:
out <- treeWAS(snps,
                phen,
                tree = c("BIONJ", "NJ", "BIONJ*", "NJ*"),
                n.subs = NULL,
                n.snps.sim = ncol(snps)*10,
                chunk.size = ncol(snps),
                mem.im = FALSE,
                test = c("terminal", "simultaneous", "subsequent"),
                snps.reconstruction = "parsimony",
                snps.sim.reconstruction = "parsimony",
                phen.reconstruction = "parsimony",
                phen.type = NULL,
                na.rm = TRUE,
                p.value = 0.01,
                p.value.correct = c("bonf", "fdr", FALSE),
                p.value.by = c("count", "density"),
                dist.dna.model = "JC69",
                plot.tree = TRUE,
                plot.manhattan = TRUE,
                plot.null.dist = TRUE,
                plot.dist = FALSE,
                snps.assoc = NULL, 
                filename.plot = NULL,
                seed = NULL)

snps : A matrix containing binary genetic data, with individuals in the rows and genetic loci in the columns and both rows and columns labelled.

phen : A vector containing the phenotypic state of each individual, whose length is equal to the number of rows in snps and which is named with the same set of labels. The phenotype can be either binary (character or numeric) or continuous (numeric).

tree : A phylo object containing the phylogenetic tree; or, a character string, one of "NJ", "BIONJ" (the default), or "parsimony"; or, if NAs are present in the distance matrix, one of: "NJ*" or "BIONJ*", specifying the method of phylogenetic reconstruction.

n.subs : A numeric vector containing the homoplasy distribution (if known, see details), or NULL (the default).

n.snps.sim : An integer specifying the number of loci to be simulated for estimating the null distribution (by default 10*ncol(snps)). Note that 10x is the recommended minimum: where possible (i.e., for datasets that are not very large),
simulating more loci (e.g., 100*ncol(snps)) may further improve results.

chunk.size : An integer indicating the number of snps loci to be analysed at one time. This provides a solution for machines with insufficient memory to analyse the dataset at hand. Note that smaller values of chunk.size will increase the computational time required (e.g., for chunk.size = ncol(snps)/2, treeWAS will take twice as long to complete).

mem.lim : Either a number or a logical value to establish a memory limit (in GB) that will be used to automatically update the chunk.size argument if there is not enough available memory to run treeWAS in one chunk. If FALSE (the default), no limit is estimated and chunk.size is not changed. If TRUE, the amount of memory currently available is estimated with memfree() and chunk.size is scaled back to account for the amount of memory estimated to be needed by treeWAS for this dataset. If a single numeric value, this is taken to be the amount of memory (in GB) available/designated for use by treeWAS and chunk.size is updated to reflect this.

test : A character string or vector containing one or more of the following available tests of association: "terminal", "simultaneous", "subsequent", "cor", "fisher". By default, the first three tests are run (see details).

snps.reconstruction : Either a character string specifying "parsimony" (the default) or "ML" (maximum likelihood) for the ancestral state reconstruction of the genetic dataset, or a matrix containing this reconstruction if it has been performed elsewhere and you provide the tree.

snps.sim.reconstruction : A character string specifying "parsimony" (the default) or "ML" (maximum likelihood) for the ancestral state reconstruction of the simulated null genetic dataset.

phen.reconstruction : Either a character string specifying "parsimony" (the default) or "ML" (maximum likelihood) for the ancestral state reconstruction of the phenotypic variable, or a vector containing this reconstruction if it has been performed elsewhere.

phen.type : An optional character string specifying whether the ancestral state reconstruction of the phenotypic variable, if performed via ML, should treat the phenotype as either "discrete" or "continuous". By default, phen.type is NULL, in which case ML reconstructions will be "continuous" for any non-binary phenotypes.

na.rm : A logical indicating whether to remove snps columns if they contain more than 75% NAs (by default, TRUE).

p.value : A number specifying the base p-value to be set the threshold of significance (by default, 0.01).

p.value.correct : A character string, either "bonf" (the default) or "fdr", specifying whether correction for multiple testing should be performed by Bonferonni correction (recommended) or the False Discovery Rate.

p.value.by : A character string specifying how the upper tail of the p-value distribution is to be identified. Either "count" (the default, recommended) for a simple count-based approach or "density" for a kernel-density based approximation.

dist.dna.model : A character string specifying the type of model to use in reconstructing the phylogenetic tree for calculating the genetic distance between individual genomes, only used if tree is a character string (see ?dist.dna).

plot.tree : A logical indicating whether to generate a plot of the phylogenetic tree (TRUE, the default) or not (FALSE).

plot.manhattan : A logical indicating whether to generate a manhattan plot for each association score (TRUE, the default) or not (FALSE).

plot.null.dist : A logical indicating whether to plot the null distribution of association score statistics (TRUE, the default) or not (FALSE).

plot.dist : A logical indicating whether to plot the true distribution of association score statistics (TRUE) or not (FALSE, the default).

snps.assoc : An optional character string or vector specifying known associated loci to be demarked in results plots (e.g., from previous studies or if data is simulated); else NULL.

filename.plot : An optional character string denoting the file location for saving any plots produced (eg. "C:/Home/treeWAS_plots.pdf"); else NULL.

seed : An optional integer to control the pseudo-randomisation process and allow for identical repeat runs of the function; else NULL.

Running treeWAS

Running treeWAS takes only one function. It should finish within a few minutes, depending on the size of the dataset.

out <- treeWAS(snps = snps,
                phen = phen,
                tree = tree,
                seed = 1)

Interpreting Output

Plots

If plot.tree is set to TRUE, the tree will be plotted, with the terminal phenotype indicated by tip colour. If plot.null.dist is set to TRUE, the null distributions and findings will be plotted for each association score. If plot.dist is TRUE, a distribution of the empirical association score values will be plotted for each association score. And if plot.manhattan is set to TRUE, a manhattan plot will show the values for each each association score, with a threshold delineating the significant findings from the insignificant.

By default, plot.tree is set to TRUE.

library(png)
library(grid)
img <- readPNG("figs/plot_tree.png")
grid.raster(img)

By default, plot.null.dist is set to TRUE.

library(png)
library(grid)
img <- readPNG("figs/plot_hist_terminal.png")
grid.raster(img)

library(png)
library(grid)
img <- readPNG("figs/plot_hist_simultaneous.png")
grid.raster(img)

library(png)
library(grid)
img <- readPNG("figs/plot_hist_subsequent.png")
grid.raster(img)

By default, plot.manhattan is also set to TRUE.

library(png)
library(grid)
img <- readPNG("figs/plot_manhattan_terminal.png")
grid.raster(img)

library(png)
library(grid)
img <- readPNG("figs/plot_manhattan_simultaneous.png")
grid.raster(img)

library(png)
library(grid)
img <- readPNG("figs/plot_manhattan_subsequent.png")
grid.raster(img)

Printing Output

In the example above, the output of the treeWAS function was assigned to an object called out. It is a large object. As such, we recommend that the print.treeWAS function be used to examine the set of results identified (NB: print.treeWAS is just the print function for an object of class treeWAS):

## Example output:
data(treeWAS.example.out)
out <- treeWAS.example.out
class(out) # treeWAS

print(out, sort.by.p=FALSE)

Writing Output to CSV

The results contained in the output of treeWAS can also be saved in a CSV file. The write.treeWAS function will write a summary table to CSV, containing the names and loci of all significant findings, as well as their scores and p-values, as well as the cells of a 2x2 contingency table of genotype and phenotype if the phenotype is binary.

All that is required is to run the write.treeWAS function and specify a file name to which the ".csv" file will be saved. By default, the filename argument is "./treeWAS_results". The "./" prefix means that by default the file will be saved to the current working directory, "~/" will save to the home directory, or you can specify a path explicitly ahead of the desired file name. You can see an example by running the following command on the output from the example above (the object called out):

write.treeWAS(out, filename="./treeWAS_results")

Output Returned

The output of treeWAS contains the set of significant loci identified as well as all relevant information used by or generated within treeWAS. The treeWAS function returns a list object (in our example, called out), which takes on the following general structure (run str(out) to examine the structure of our example output):

$treeWAS.combined, the first element, is a list of length two containing the identities of significant findings:

$treeWAS.combined : The pooled set of significant loci identified by any association score.

$treeWAS : A list with the sets of significant loci identified by each association score individually.

\$[SCORE], list elements for each association score, contain the original score values for each locus and additional information for significant loci. By default, there will be three such $[SCORE]-type elements called $terminal, $simultaneous, and $subsequent, each of which will have the following elements:

$corr.dat : The association score values for loci in the empirical genetic dataset.

$corr.sim : The association score values for loci in the simulated genetic dataset.

$p.vals : The p-values associated with the loci in the empirical genetic dataset for this association score.

$sig.thresh : The significance threshold for this association score.

$sig.snps : A data frame describing the genetic loci identified as significant. The last four columns will only be present if the data is binary, in which case they will contain the cell counts of a 2x2 table of genotypic and phenotypic states for each significant locus. : row.names: The column names of significant loci. : $SNP.locus: The column positions of significant loci in dat$snps (see below). : $p.value: The p-values for significant loci. : $score: The association score values for significant loci. : $G1P1: The number of individuals with genotype = 1 and phenotype = 1 at this locus. : $G0P0: The number of individuals with genotype = 0 and phenotype = 0 at this locus. : $G1P0: The number of individuals with genotype = 1 and phenotype = 0 at this locus. : $G0P1: The number of individuals with genotype = 0 and phenotype = 1 at this locus.

$min.p.value : The minimum p-value. P-values listed as zero can only truly be defined as below this value.

$dat, the final element, contains all of the data either used by or generated within treeWAS. Objects that were provided as inputs to the treeWAS function will be returned here in the form in which they were analysed (i.e., after data cleaning within treeWAS).

$snps : The empirical genetic data matrix.

$snps.reconstruction : The ancestral state reconstruction of the empirical genetic data matrix.

$snps.sim : The simulated genetic data matrix.

$snps.sim.reconstruction : The ancestral state reconstruction of the simulated genetic data matrix.

$phen : The phenotypic variable.

$phen.reconstruction : The ancestral state reconstruction of the phenotype.

$tree : The phylogenetic tree.

$n.subs : The homoplasy distribution. Each element represents a number of substitutions (from 1 to length(n.subs)) and contains the number of loci that have been inferred to undergo that many substitutions.

Integration with ClonalFrameML

The treeWAS R package has also been designed to work with the output of ClonalFrameML. By using the simple read.CFML function, you can convert the output of ClonalFrameML into a form suitable for input into treeWAS.

The Data

For practice using ClonalFrameML, download the ClonalFrameML example data and follow the steps on the wiki to arrive at the example.output dataset. To run the example below, replace the prefix with the path to the example.output dataset on your computer or to another set of output from ClonalFrameML containing:

"prefix.labelled_tree.newick"
"prefix.ML_sequence.fasta"
"prefix.position_cross_reference.txt"

Convert the Data

To convert the data, use the read.CFML function:

prefix <- "./example.output"
dat <- read.CFML(prefix=prefix)

Isolate the elements of the output of read.CFML:

## Required input into treeWAS:
snps <- dat$snps

## Recommended input into treeWAS:
tree <- dat$tree

## Optional input into treeWAS:
n.subs <- dat$n.subs
snps.rec <- dat$snps.rec

Input & Arguments

You can now run treeWAS using this snps matrix, and the tree returned from read.CFML.

Instead of supplying snps.rec to the argument snps.reconstruction, it may be worth (re-)creating the ancestral state reconstruction within treeWAS. This is recommended, provided the dataset you are working with is not extremely large, in case any inconsistencies exist between the reconstruction performed by ClonalFrameML and that run within treeWAS (which will also be applied to the simulated null dataset).

The dist object can be provided in the n.subs argument, though you may also leave this NULL and have treeWAS recreate n.subs internally, again in case there is any discrepancy between the internal and external methods or reconstruction.

All you would need to run treeWAS with this example dataset is a phenotype. We can make a toy phenotype for this purpose, although no significant findings should result:

set.seed(1)
phen <- sample(c(0,1), nrow(snps), replace=TRUE)
phen <- as.factor(phen)
names(phen) <- rownames(snps)

## Exmine toy phenotype
str(phen)
table(phen)

Running treeWAS

You can now run treeWAS using the converted ClonalFrameML output:

## Example not run... 
out <- treeWAS(snps = snps,
                phen = phen,
                tree = tree,
                n.subs = NULL,  ## (optional input: n.subs)
                n.snps.sim = ncol(snps)*10,
                chunk.size = ncol(snps),
                test = c("terminal", "simultaneous", "subsequent"),
                snps.reconstruction = "parsimony",  ## (optional input: snps.rec)
                snps.sim.reconstruction = "parsimony",
                phen.reconstruction = "parsimony",
                phen.type = NULL,
                na.rm = TRUE,
                p.value = 0.01,
                p.value.correct = "bonf",
                p.value.by = "count",
                dist.dna.model = "JC69",
                plot.tree = TRUE,
                plot.manhattan = TRUE,
                plot.null.dist = TRUE,
                plot.dist = FALSE,
                snps.assoc = NULL, 
                filename.plot = NULL,
                seed = 1)

Bugs & Features

Report a Bug

The treeWAS R package is still in its early days. As such, you may encounter an error before we do. If you believe you have identified a bug, or you are unable to overcome an error after consulting the documentation, we encourage you to visit our Issues Page. If your issue has not been documented by another user, please Post A New Issue.

Request a Feature

There may be a feature absent from treeWAS that you wish could be performed by the R package. We welcome your suggestions and invite you to Place a Feature Request on our Issues Page.

We are working towards a number of features in the near future, for example, allowing treeWAS to handle non-binary categorical phenotypes.

caitiecollins/treeWAS documentation built on March 9, 2024, 3:15 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

caitiecollins/treeWAS
Phylogenetic tree-based microbial GWAS

In caitiecollins/treeWAS: Phylogenetic tree-based microbial GWAS

Introduction

Installation

The treeWAS Approach

Tests of Association

Using the treeWAS R Package

Data

Required Elements

Optional Elements

Example Data

Data Cleaning

Loading data:

Labels:

Biallelic loci:

Missing data:

Trees:

Continuous phenotypes:

Arguments

Running treeWAS

Interpreting Output

Plots

Printing Output

Writing Output to CSV

Output Returned

Integration with ClonalFrameML

The Data

Convert the Data

Input & Arguments

Running treeWAS

Bugs & Features

Report a Bug

Request a Feature

R Package Documentation

Browse R Packages

We want your feedback!

caitiecollins/treeWAS Phylogenetic tree-based microbial GWAS

In caitiecollins/treeWAS: Phylogenetic tree-based microbial GWAS

Introduction

Installation

The treeWAS Approach

Tests of Association

Using the treeWAS R Package

Data

Required Elements

Optional Elements

Example Data

Data Cleaning

Loading data:

Labels:

Biallelic loci:

Missing data:

Trees:

Continuous phenotypes:

Arguments

Running treeWAS

Interpreting Output

Plots

Printing Output

Writing Output to CSV

Output Returned

Integration with ClonalFrameML

The Data

Convert the Data

Input & Arguments

Running treeWAS

Bugs & Features

Report a Bug

Request a Feature

R Package Documentation

Browse R Packages

We want your feedback!

caitiecollins/treeWAS
Phylogenetic tree-based microbial GWAS