In heibl/megaptera: MEGAPhylogeny Techniques in R

To show how to use the megaptera package, we are going to construct a phylogenetic dataset of the mammal order Cetacea (whales and dolphins). With 80-90 extant species, Cetacea present an ideal group for demonstration purposes. And maybe a motivating one, as I hope that many users will share my fascination for this mysterious creatures. Last not least the monotypic genus of humpback whales served as an eponym of this package.

Introduction

This package provides tools for the semi-automated generation of large DNA sequences datasets from internet-accessible depositories, currently the Nucleotide database on GenBank (http://www.ncbi.nlm.nih.gov/nuccore) and BOLD SYSTEMS. It uses a number of successive steps to query, download, check, sort and assemble DNA sequences. The minimum input of information required by the algorithm is the name of a DNA sequence marker (e.g.~rbcL, trnS, cytB, ...) and one or more taxon names. First, a taxonomic classification for the taxa of interest is retrieved from the Taxonomy database on GenBank. The taxonomy will be used for reference sequence calculation and profile alignment. In the second step, GenBank is searched for all available sequences of a marker for the given set of species. The sequences are downloaded and, if more than one accession is available per species, species alignments are build. In a fourth step, these species alignments are searched for those alignments that contain totally identical 'zero-distance' sequences and a consensus sequence, hereafter termed 'reference sequence', is build from the zero-distance alignments. The algorithm considers the reference sequence as some sort of 'idealized' sequence of the genetic marker, to which, in a fifth step, all sequences are compared and their inclusion into the final alignment is decided upon. Knowing that there is no guarantee that all sequence information stored in the database is correct and lacking a priori measures to tell the right from the wrong sequences, the rationale behind this process is that we might be especially confident in the correctness of a sequence, if it is shared by all conspecifics in that database and our confidence might be the greater the more sequences are available for this comparison. Next, all included sequences are aligned and the pairwise-distances of the sequences are calculated. If the maximum genetic distance exceeds a certain threshold, the alignment is iteratively broken into smaller alignment blocks until the condition of the maximum distance threshold is satisfied. The resulting blocks are cleaned and concatenated into the given marker's alignment. Several such alignments can then be concatenated into a supermatrix or analyzed separately with supertree methods.

Required software

megaptera is exclusively written in R and makes extensive use of the existing capabilities of R to deal with phylogenetic data provided by the packages ape, seqinr, and ips.

For sequence alignment and for alignment masking, megaptera uses external software packages, which are usually much faster than pure R code, and we profit in using stable and millionfold-tested software package. Currently, there is only one option for both tasks: sequence alignment is done with mafft^[http://align.bmr.kyushu-u.ac.jp/mafft/software/] [@katohmiyata2002; @katohmiyata2005; @katohstandley2013] and the masking of doubtful alignment position with gblocks^[http://molevol.cmima.csic.es/castresana/Gblocks.html] [@castresana2000; @talaveracastresana2007], but further programs might be included in the future. See the respective websites for installation.

Data management relies on postgreSQL, a popular open relational database, for which an excellent interface with R is available (packages DBI and RPostgreSQL). See the postgreSQL website^[http://www.postgresql.org] for installation and and a basic introduction to SQL. Using a relational database system to store taxonomies, molecular sequences and their metadata might at first seem unnecessarily complicated for users who are used to handling sequence data files in formats like NEXUS, PHYLIP, or FASTA, but the benefits are clear: A database is much easier to maintain and evolve in a consistent fashion due to the powerful standard query language (SQL). In addition, database operations are usually much quicker compared to file-based input/output, a difference that is increasingly important when it comes to build larger and larger datasets.

Define a database connection

As all data storage in megaptera relies on a SQL database, we first have to define a database connection. This is done by creating an object of class dbPars. See ?dbPars for the arguments and their defaults. Having in mind our goal to create a phylogeny of whales and dolphins, we choose to name our database "cetacea". If you decided to use a password when setting up PostgreSQL, you will also have to specify the password argument.

db <- dbPars(dbname = "cetacea")

Download the NCBI Taxonomy

The next step is to download the current snapshot of the Taxonomy database on GenBank via FTP and store it in our postgreSQL database under the name ncbitaxonomy. This is convieniently done by calling ncbiTaxonomy giving our database connection as an argument.

ncbiTaxonomy(db)

You only have to use ncbiTaxonomy once even if you work on several megaptera projects. Then for any specific megaptera project a subset of this taxonomy will be extracted (by stepA) and used for reference sequence calculation and profile alignment downstream the pipeline. It is still a good idea, though, to rerun ncbiTaxonomy after a certain time to make sure your taxonomic classification stays up to date and you do not miss any organims that have newly entered the NCBI databases.