# set global chunk options: images will be 7x5 inches knitr::opts_chunk$set(fig.width=7, fig.height=10, fig.path="figs/") old <- options(digits = 4)
apex implements new classes and methods for analysing DNA sequences from multiple genes. It implements new classes extending object classes from ape and phangorn to store multiple gene data, and some useful wrappers mimicking existing functionalities of these packages for multiple genes. This document provides an overview of the package's content.
To install the development version from github:
library(devtools) install_github("thibautjombart/apex")
The stable version can be installed from CRAN using:
install.packages("apex")
Then, to load the package, use:
library("apex")
Two simple functions permit to import data from multiple alignements into multidna
objects:
read.multidna: reads multiple DNA alignments with various formats
read.multiFASTA: same for FASTA files
Both functions rely on the single-gene counterparts in ape and accept the same arguments. Each file should contain data from a given gene, where sequences should be named after individual labels only. Here is an example using a dataset from apex:
## get address of the file within apex files <- dir(system.file(package="apex"),patter="patr", full=TRUE) ## read these files x <- read.multiFASTA(files) x names(x@dna) # names of the genes oldpar <-par(mar=c(6,11,4,1)) plot(x) par(oldpar)
In addition to the above functions for importing data:
read.multiphyDat: reads multiple DNA alignments with various formats.
The arguments are the same as the single-gene read.phyDat
in phangorn*:
z <- read.multiphyDat(files, format="fasta") z
Two new classes of object extend existing data structures for multiple genes:
multidna: based on ape's DNAbin
class, useful for distance-based trees.
multiphyDat: based on phangorn's phyDat
class, useful for likelihood-based and parsimony trees.
Conversion between these classes can be done using multidna2multiPhydat
and multiPhydat2multidna
.
This formal (S4) class can be seen as a multi-gene extension of ape's DNAbin
class.
Data is stored as a list of DNAbin
objects, with additional slots for extra information.
The class definition can be obtained by:
getClassDef("multidna")
DNAbin
matrices, each corresponding to a given gene/locus, with matching rows (individuals)@dna
)Any of these slots can be accessed using @
, however accessor functions are available for most and are preferred (see examples below).
New multidna
objects can be created via different ways:
new("multidna", ...)
multiphyDat
object using multidna2multiphyDat
We illustrate the use of the constructor below (see ?new.multidna
) for more information.
We use ape's dataset woodmouse, which we artificially split in two 'genes', keeping the first 500 nucleotides for the first gene, and using the rest as second gene. Note that the individuals need not match across different genes: matching is handled by the constructor.
## empty object new("multidna") ## using a list of genes as input data(woodmouse) woodmouse genes <- list(gene1=woodmouse[,1:500], gene2=woodmouse[,501:965]) x <- new("multidna", genes) x ## access the various slots getNumInd(x) # The number of individuals getNumLoci(x) # The number of loci getLocusNames(x) # The names of the loci getSequenceNames(x) # A list of the names of the sequences at each locus getSequences(x) # A list of all loci getSequences(x, loci = 2, simplify = FALSE) # Just the second locus (a single element list) getSequences(x, loci = "gene1") # Just the first locus as a DNAbin object ## compare the input dataset and the new multidna oldpar <- par(mfrow=c(3,1), mar=c(6,6,2,1)) image(woodmouse) image(as.matrix(getSequences(x, 1))) image(as.matrix(getSequences(x, 2))) par(oldpar) ## same but with missing sequences and wrong order genes <- list(gene1=woodmouse[,1:500], gene2=woodmouse[c(5:1,14:15),501:965]) x <- new("multidna", genes) x oldpar <- par(mar=c(6,6,2,1)) plot(x) par(oldpar)
Like multidna
and ape's DNAbin
, the formal (S4) class multiphyDat
is a multi-gene extension of phangorn's phyDat
class.
Data is stored as a list of phyDat
objects, with additional slots for extra information.
The class definition can be obtained by:
getClassDef("multiphyDat")
phyDat
objects, each corresponding to a given gene/locus, with matching rows (individuals); unlike multidna
which is retrained to DNA sequences, this class can store any characters, including amino-acid sequences @dna
)Any of these slots can be accessed using @
(see example below).
As for multidna
, multiphyDat
objects can be created via different ways:
new("multiphyDat", ...)
multidna
object using multiphyDat2multidna
As before, we illustrate the use of the constructor below (see ?new.multiphyDat
) for more information.
data(Laurasiatherian) Laurasiatherian ## empty object new("multiphyDat") ## simple conversion after artificially splitting data into 2 genes genes <- list(gene1=Laurasiatherian[,1:1600], gene2=Laurasiatherian[,1601:3179]) x <- new("multiphyDat", genes, type="DNA") x
Several functions facilitate data handling:
concatenate: concatenate several genes into a single DNAbin
or phyDat
matrix
x[i,j]: subset x by individuals (i) and/or genes (j)
multidna2multiphyDat: converts from multidna
to multiphyDat
multiphyDat2multidna: converts from multiphyDat
to multidna
Example code:
files <- dir(system.file(package="apex"),patter="patr", full=TRUE) ## read files x <- read.multiFASTA(files) x oldpar <- par(mar=c(6,11,4,1)) plot(x) ## subset plot(x[1:3,2:4]) par(oldpar)
## concatenate y <- concatenate(x) y oldpar <- par(mar=c(5,8,4,1)) image(y) par(oldpar) ## concatenate multiphyDat object z <- multidna2multiphyDat(x) u <- concatenate(z) u tree <- pratchet(u, trace=0, all = FALSE) oldpar <- par(mar=c(1,1,1,1)) plot(tree, "u") par(oldpar)
Distance-based trees (e.g. Neighbor Joining) can be obtained for each gene in a multidna
object using getTree
## make trees, default parameters trees <- getTree(x) trees
plot(trees, 4, type="unrooted")
oldpar <- par(mfrow=c(2,2)) for(i in 1:length(trees))plot(trees[[i]], type="unr") par(oldpar)
As an alternative, all genes can be pooled into a single alignment to obtain a single tree using:
## make one single tree based on concatenated genes tree <- getTree(x, pool=TRUE) tree plot(tree, type="unrooted")
It is also possible to use functions from phangorn
to estimate with maximum likelihood trees.
Here is an example using the multiphyDat
object z
created in the previous section:
## input object z ## build trees pp <- pmlPart(bf ~ edge + nni, z, control = pml.control(trace = 0)) pp ## convert trees for plotting trees <- pmlPart2multiPhylo(pp)
plot(trees, 4)
par(mfrow=c(2,2)); for(i in 1:length(trees))plot(trees[[i]])
The following functions enable the export from apex to other packages:
multidna2genind: concatenates genes and export SNPs into a genind
object; alternatively, Multi-Locus Sequence Type (MLST) can be used to treat genes as separate locus and unique sequences as alleles.
multiphyDat2genind: does the same for multiphyDat object
This is illustrated below:
## find source files in apex library(adegenet) files <- dir(system.file(package="apex"),patter="patr", full=TRUE) ## import data x <- read.multiFASTA(files) x ## extract SNPs and export to genind obj1 <- multidna2genind(x) obj1
The MLST option can be useful for a quick diagnostic of diversity amongst individuals. While it is best suited to clonal organisms, we illustrate this procedure using our toy dataset:
obj3 <- multidna2genind(x, mlst=TRUE) obj3 ## alleles of the first locus (=sequences) alleles(obj3)[[1]]
options(old)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.