weight: Sequence weighting.

View source: R/weight.R

weightR Documentation

Sequence weighting.

Description

Weighting schemes for DNA and amino acid sequences.

Usage

weight(x, ...)

## S3 method for class 'DNAbin'
weight(x, method = "Henikoff", k = 5, ...)

## S3 method for class 'AAbin'
weight(x, method = "Henikoff", k = 5, ...)

## S3 method for class 'list'
weight(x, method = "Henikoff", k = 5, residues = NULL, gap = "-", ...)

## S3 method for class 'dendrogram'
weight(x, method = "Gerstein", ...)

## Default S3 method:
weight(x, method = "Henikoff", k = 5, residues = NULL, gap = "-", ...)

Arguments

x

a list or matrix of sequences (usually a "DNAbin" or "AAbin" object). Alternatively x can be an object of class "dendrogram" for tree-base weighting.

...

additional arguments to be passed between methods.

method

a character string indicating the weighting method to be used. Currently the only methods available are a modified version of the maximum entropy weighting scheme proposed by Henikoff and Henikoff (1994) (method = "Henikoff") and the tree-based weighting scheme of Gerstein et al (1994) (method = "Gerstein").

k

integer representing the k-mer size to be used. Defaults to 5. Note that higher values of k may be slow to compute and use excessive memory due to the large numbers of calculations required.

residues

either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Furthermore, the default setting residues = NULL will not detect rare residues that are not present in the sequences, and thus will not assign them emission probabilities in the model. Specifying the residue alphabet is therefore recommended unless x is a "DNAbin" or "AAbin" object.

gap

the character used to represent gaps in the alignment matrix (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise.

Details

This is a generic function. If method = "Henikoff" the sequences are weighted using a modified version of the maximum entropy method proposed by Henikoff and Henikoff (1994). In this case the maximum entropy weights are calculated from a k-mer presence absence matrix instead of an alignment as originally described by Henikoff and Henikoff (1994). If method = "Gerstein" the agglomerative method of Gerstein et al (1994) is used to weight sequences based on their relatedness as derived from a phylogenetic tree. In this case a dendrogram is first derived using the cluster function in the kmer package. Methods are available for "dendrogram" objects, "DNAbin" and "AAbin" sequence objects (as lists or matrices) and sequences in standard character format provided either as lists or matrices.

For further details on sequence weighting schemes see Durbin et al (1998) chapter 5.8.

Value

a named vector of weights, the sum of which is equal to the total number of sequences (average weight = 1).

Author(s)

Shaun Wilkinson

References

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Gerstein M, Sonnhammer ELL, Chothia C (1994) Volume changes in protein evolution. Journal of Molecular Biology, 236, 1067-1078.

Henikoff S, Henikoff JG (1994) Position-based sequence weights. Journal of Molecular Biology, 243, 574-578.

Examples

  ## weight the sequences in the woodmouse dataset from the ape package
  library(ape)
  data(woodmouse)
  woodmouse.weights <- weight(woodmouse)
  woodmouse.weights

aphid documentation built on Dec. 5, 2022, 9:06 a.m.