knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Building phylogenetic trees usually require multiple genes and the supermatrix approach is one of the popular ways to incorporate the information. However, the preparation of supermatrix is not straightforward and error-prone. The phylotools package aims to simplify the preparation of supermatrix. It automatically concatenated different gene markers by their names, i.e., building supermatrices based on the aligned Fasta or Phylip files.

The Relaxed Phylip Format

The resulted supermatrices files are in the Relaxed Phylip format, a format that could be used in RAxML and IQ-Tree. The Relaxed Phylip format is an extension of the strict Phylip format, i.e., the two numbers in the first line show the number of sequences and the number of base pairs (bp), respectively, with a white-space in between. The second line starts with the sequence name, and between the sequence name and the nucleotide sequences, there is a white-space. In the strict Phylip format, each sequence name only allows 10 characters, software such as Phylip (https://evolution.genetics.washington.edu/phylip.html) will truncate the name and only take the first 10 characters if there are more than 10. The Relaxed Phylip format does not have this limitation and the names can contain as long as 128 characters.

Why building supermatrix?

Why is it necessary to use multiple gene markers to build an evolutionary tree? This is mainly because the individual genes markers are generally too short to fully reflect the relationships of the species, and building more accurate evolutionary trees generally requires longer sequences. Obtaining DNA barcoding markers became relatively easy from the year 2005 on wards, and these markers are suitable for building large phylogenies.

Commonly used DNA barcoding markers

DNA barcoding sequences are generally obtained by the traditional Sanger sequencing method based on PCR products. rbcLa (Ribulose bisphosphate carboxylase large chain), matK (Megakaryocyte-Associated Tyrosine Kinase) and trnH-psbA (intergenic spacer region) are among the most commonly used DNA barcoding markers, and they all derived from the chloroplast genome. Other markers, such as ITS (Internal transcribed spacer), are also commonly used.

The gene marker's mutation rate

The mutation rates could vary greatly among different genes. Some genes are more stable and the differences between families and genera could be very minor, while others mutate rapidly and could be very different even among populations/individuals of the same species. For example, rbcLa is commonly used to locate the family a taxon belongs to, while matK could locate the genus or even species. Non-coding genes with very fast mutation rates, such as the trnH-psbA, are often used to determine species or subspecies. Using partitioned model during phylogenetic inferences is recommended.

Missing Data

Ideally, all the markers of a species should be obtained. However, some sequences of some species may not always be amplified or sequenced successfully, resulting in missing data. These missing loci will be denoted as ? in the supermatrix.

Similar software

In addition to phylotools, the following software can also be used to build supermatrix.

  1. Biopython (https://biopython.org/wiki/Concatenate_nexus)
  2. Utensils (https://github.com/ballesterus/Utensils)
  3. Geneious (https://assets.geneious.com/manual/11.1/GeneiousManualsu53.html)
  4. catfasta2phyml (https://github.com/nylander/catfasta2phyml)
  5. SEDA (http://www.sing-group.org/seda/ , https://www.biostars.org/p/332853/)
  6. SuperMatrix: (http://coleoguy.blogspot.com/2015/04/r-concatenate-alignments-into.html)
  7. concat (https://figshare.com/articles/concatenation_R/3839538)
  8. AMAS (https://www.ncbi.nlm.nih.gov/pmc/articl::es/PMC4734057/)

In which Genious has a graphical interface.

Features of Phylotools

The Installation

The latest version of phylotools is stored on github and has not been uploaded to CRAN. Use the following command to install.

devtools::install_github("helixcn/phylotools")

If devtools has not been installed, you can use install.packages("devtools") to install.

Functions in phylotools

Creating supermatrix with phylip files

Suppose there are three Phylip files encoding different genes, create the supermatrix of the three genes and save them in Relaxed Phylip format, and generate RAxML Partition File to record the start and end loci of each gene.

The corresponding codes are as follows.

library(phylotools) 
cat("6 22",
  "seq_1    --TTACAAATTGACTTATTATA",
  "seq_2    GATTACAAATTGACTTATTATA",
  "seq_3    GATTACAAATTGACTTATTATA",
  "seq_5    GATTACAAATTGACTTATTATA",
  "seq_8    GATTACAAATTGACTTATTATA",
  "seq_10   ---TACAAATTGAATTATTATA",
  file = "matk.phy", sep = "\n")

  cat("5 15",
  "seq_1     GATTACAAATTGACT",
  "seq_3     GATTACAAATTGACT",
  "seq_4     GATTACAAATTGACT",
  "seq_5     GATTACAAATTGACT",
  "seq_8     GATTACAAATTGACT",
  file = "rbcla.phy", sep = "\n")

  cat("5 50",
  "seq_2          GTCTTATAAGAAAGAATAAGAAAG--AAATACAAA-------AAAAAAGA",
  "seq_3          GTCTTATAAGAAAGAAATAGAAAAGTAAAAAAAAA-------AAAAAAAG",
  "seq_5          GACATAAGACATAAAATAGAATACTCAATCAGAAACCAACCCATAAAAAC",
  "seq_8          ATTCCAAAATAAAATACAAAAAGAAAAAACTAGAAAGTTTTTTTTCTTTG",
  "seq_9          ATTCTTTGTTCTTTTTTTTCTTTAATCTTTAAATAAACCTTTTTTTTTTA",
  file = "trn1.phy", sep = "\n")

supermat(infiles = c("matk.phy", "rbcla.phy", "trn1.phy"))

Generate supermatrix with fasta files

To generate a supermatrix using a matched fasta file, the command is changed to the corresponding fasta file name.

library(phylotools)

cat(
  ">seq_1",  "--TTACAAATTGACTTATTATA",
  ">seq_2",  "GATTACAAATTGACTTATTATA",
  ">seq_3",  "GATTACAAATTGACTTATTATA",
  ">seq_5",  "GATTACAAATTGACTTATTATA",
  ">seq_8",  "GATTACAAATTGACTTATTATA",
  ">seq_10", "---TACAAATTGAATTATTATA",
  file = "matk.fasta", sep = "\n")

cat(
  ">seq_1", "GATTACAAATTGACT",
  ">seq_3", "GATTACAAATTGACT",
  ">seq_4", "GATTACAAATTGACT",
  ">seq_5", "GATTACAAATTGACT",
  ">seq_8", "GATTACAAATTGACT",
  file = "rbcla.fasta", sep = "\n")

cat(
  ">seq_2", "GTCTTATAAGAAAGAATAAGAAAG--AAATACAAA-------AAAAAAGA",
  ">seq_3", "GTCTTATAAGAAAGAAATAGAAAAGTAAAAAAAAA-------AAAAAAAG",
  ">seq_5", "GACATAAGACATAAAATAGAATACTCAATCAGAAACCAACCCATAAAAAC",
  ">seq_8", "ATTCCAAAATAAAATACAAAAAGAAAAAACTAGAAAGTTTTTTTTCTTTG",
  ">seq_9", "ATTCTTTGTTCTTTTTTTTCTTTAATCTTTAAATAAACCTTTTTTTTTTA",
  file = "trn1.fasta", sep = "\n")

supermat(infiles = c("matk.fasta", "rbcla.fasta", "trn1.fasta"))

Citation

citation("phylotools")

Further reading



helixcn/phylotools documentation built on Oct. 11, 2024, 4:11 a.m.