aphid | R Documentation |
aphid is an R package for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. Functions are included for multiple and pairwise sequence alignment, model construction and parameter optimization, calculation of conditional probabilities (using the forward, backward and Viterbi algorithms), tree-based sequence weighting, sequence simulation, and file import/export compatible with the HMMER software package. The package has a wide variety of uses including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification.
The aphid package is based on the algorithms outlined in the book 'Biological sequence analysis: probabilistic models of proteins and nucleic acids' by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison. This book is highly recommended for those wishing to develop a better understanding of HMMs and PHMMs, regardless of prior experience. Many of the examples in the function help pages are taken directly from the book, so that readers can learn to use the package as they work through the chapters.
There are also excellent rescources available for those wishing to use profile hidden
Markov models outside of the R environment. The aphid package maintains
compatibility with the HMMER software suite
through the file input and output functions readPHMM
and
writePHMM
.
The aphid package is designed to work in conjunction with the "DNAbin"
and "AAbin" object types produced by the ape
package
(Paradis et al 2004, 2012). This is an essential piece of software for those
using R for biological sequence analysis, and provides a binary coding format
for nucleotides and amino acids that maximizes memory and speed efficiency.
While aphid also works with standard character vectors and matrices,
it may not recognize the DNA and amino acid amibguity codes and therefore is not
guaranteed to treat them appropriately.
To maximize speed, the low-level dynamic programming functions such
as Viterbi
, forward
and backward
are written in C++ with the help of the Rcpp
package (Eddelbuettel & Francois 2011).
Note that R versions of these functions are also maintained
for the purposes of debugging, experimentation and code interpretation.
The aphid package creates two primary object classes, "HMM"
(hidden Markov models) and "PHMM"
(profile hidden Markov models)
with the functions deriveHMM
and derivePHMM
, respectively.
These objects are lists consisting of emission and transition probability matrices
(denoted E and A), vectors of non-position-specific background emission and transition
probabilies (denoted qe and qa) and other model metadata.
Objects of class "DPA"
(dynammic programming array) are also generated
by the Viterbi and forward/backward functions.
These are primarily created for succinct console printing.
A breif description of the primary aphid functions are provided with links to their help pages below.
readPHMM
parses a HMMER text file
into R and creates an object of class "PHMM"
writePHMM
writes a "PHMM"
object to a text file in
HMMER v3 format
plot.HMM
plots a "PHMM"
object as a cyclic directed graph
plot.PHMM
plots a "PHMM"
object as a directed graph with
sequential modules consisting of match, insert and delete states
deriveHMM
builds a "HMM"
object from a list of training
sequences
derivePHMM
builds a "PHMM"
object from a multiple sequence
alignment or a list of non-aligned sequences
map
optimizes profile hidden Markov model construction
using the maximum a posteriori algorithm
train
optimizes the parameters of a "HMM"
or
"PHMM"
object using a list of training sequences
align
performs a multiple sequence alignment
weight
assigns weights to sequences
Viterbi
finds the optimal path of a sequence through a HMM
or PHMM, and returns its log odds or probability given the model
forward
finds the full probability of a sequence
given a HMM or PHMM using the forward algorithm
backward
finds the full probability of a sequence
given a HMM or PHMM using the backward algorithm
posterior
finds the position-specific posterior probability
of a sequence given a HMM or PHMM
generate.HMM
simulates a random sequence from an HMM
generate.PHMM
simulates a random sequence from a PHMM
substitution
a collection of DNA and amino acid
substitution matrices from NCBI
including the PAM, BLOSUM, GONNET, DAYHOFF and NUC matrices
casino
data from the dishonest casino example of
Durbin et al (1998) chapter 3.2
globins
Small globin alignment data from
Durbin et al (1998) Figure 5.3
Shaun Wilkinson
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Eddelbuettel D, Francois R (2011) Rcpp: seamless R and C++ integration. Journal of Statistical Software 40, 1-18.
Finn RD, Clements J & Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research. 39, W29-W37. http://hmmer.org/.
HMMER: biosequence analysis using profile hidden Markov models. http://www.hmmer.org.
NCBI index of substitution matrices. ftp://ftp.ncbi.nih.gov/blast/matrices/.
Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.
Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.