aphid: The 'aphid' package for analysis with profile hidden Markov...

aphidR Documentation

The aphid package for analysis with profile hidden Markov models.

Description

aphid is an R package for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. Functions are included for multiple and pairwise sequence alignment, model construction and parameter optimization, calculation of conditional probabilities (using the forward, backward and Viterbi algorithms), tree-based sequence weighting, sequence simulation, and file import/export compatible with the HMMER software package. The package has a wide variety of uses including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification.

Details

The aphid package is based on the algorithms outlined in the book 'Biological sequence analysis: probabilistic models of proteins and nucleic acids' by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison. This book is highly recommended for those wishing to develop a better understanding of HMMs and PHMMs, regardless of prior experience. Many of the examples in the function help pages are taken directly from the book, so that readers can learn to use the package as they work through the chapters.

There are also excellent rescources available for those wishing to use profile hidden Markov models outside of the R environment. The aphid package maintains compatibility with the HMMER software suite through the file input and output functions readPHMM and writePHMM.

The aphid package is designed to work in conjunction with the "DNAbin" and "AAbin" object types produced by the ape package (Paradis et al 2004, 2012). This is an essential piece of software for those using R for biological sequence analysis, and provides a binary coding format for nucleotides and amino acids that maximizes memory and speed efficiency. While aphid also works with standard character vectors and matrices, it may not recognize the DNA and amino acid amibguity codes and therefore is not guaranteed to treat them appropriately.

To maximize speed, the low-level dynamic programming functions such as Viterbi, forward and backward are written in C++ with the help of the Rcpp package (Eddelbuettel & Francois 2011). Note that R versions of these functions are also maintained for the purposes of debugging, experimentation and code interpretation.

Classes

The aphid package creates two primary object classes, "HMM" (hidden Markov models) and "PHMM" (profile hidden Markov models) with the functions deriveHMM and derivePHMM, respectively. These objects are lists consisting of emission and transition probability matrices (denoted E and A), vectors of non-position-specific background emission and transition probabilies (denoted qe and qa) and other model metadata. Objects of class "DPA" (dynammic programming array) are also generated by the Viterbi and forward/backward functions. These are primarily created for succinct console printing.

Functions

A breif description of the primary aphid functions are provided with links to their help pages below.

File import and export

  • readPHMM parses a HMMER text file into R and creates an object of class "PHMM"

  • writePHMM writes a "PHMM" object to a text file in HMMER v3 format

Visualization

  • plot.HMM plots a "PHMM" object as a cyclic directed graph

  • plot.PHMM plots a "PHMM" object as a directed graph with sequential modules consisting of match, insert and delete states

Model building and training

  • deriveHMM builds a "HMM" object from a list of training sequences

  • derivePHMM builds a "PHMM" object from a multiple sequence alignment or a list of non-aligned sequences

  • map optimizes profile hidden Markov model construction using the maximum a posteriori algorithm

  • train optimizes the parameters of a "HMM" or "PHMM" object using a list of training sequences

Sequence alignment and weighting

  • align performs a multiple sequence alignment

  • weight assigns weights to sequences

Conditional probabilities

  • Viterbi finds the optimal path of a sequence through a HMM or PHMM, and returns its log odds or probability given the model

  • forward finds the full probability of a sequence given a HMM or PHMM using the forward algorithm

  • backward finds the full probability of a sequence given a HMM or PHMM using the backward algorithm

  • posterior finds the position-specific posterior probability of a sequence given a HMM or PHMM

Sequence simulation

  • generate.HMM simulates a random sequence from an HMM

  • generate.PHMM simulates a random sequence from a PHMM

Datasets

  • substitution a collection of DNA and amino acid substitution matrices from NCBI including the PAM, BLOSUM, GONNET, DAYHOFF and NUC matrices

  • casino data from the dishonest casino example of Durbin et al (1998) chapter 3.2

  • globins Small globin alignment data from Durbin et al (1998) Figure 5.3

Author(s)

Shaun Wilkinson

References

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Eddelbuettel D, Francois R (2011) Rcpp: seamless R and C++ integration. Journal of Statistical Software 40, 1-18.

Finn RD, Clements J & Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research. 39, W29-W37. http://hmmer.org/.

HMMER: biosequence analysis using profile hidden Markov models. http://www.hmmer.org.

NCBI index of substitution matrices. ftp://ftp.ncbi.nih.gov/blast/matrices/.

Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.

Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.


aphid documentation built on Dec. 5, 2022, 9:06 a.m.