Sequences: Class "Sequences"
In peplib: Peptide Library Analysis Methods

Description Objects from the Class Slots Extends Methods Warning Author(s) See Also Examples

This is a small extension of a matrix representation of the sequences. The sequences are represented as integers, where each integer corresponds to a character type from the alphabet. For example, if the sequence is ADC, and the alphabet is ['A', 'B', 'C', 'D'], the sequence will be [1,4,3]. The matrix itself has each sequence as a row, and the alphabet slot contains the key that shows how the integers correspond to the characters in the sequence.

Objects can be created by calls of the form new("Sequences", data, nrow, ncol, byrow, dimnames, ...).

.Data:: Object of class "matrix" The sequences, where each row is a sequence and the sequence is a series of integers
alphabet:: Object of class "character" The character representation of each integer
nseqs:: The number of unique sequences in the class.

Class "matrix", from data part. Class "array", by class "matrix", distance 2. Class "structure", by class "matrix", distance 3. Class "vector", by class "matrix", distance 4, with explicit coerce.

dist: dist(seqs, method="substitution", params=default.MetricParams,...): dist calculates the sequence-sequences distance matrix. Use method="substitution" to use a substitution matrix for weighting sequence mutations and use method="hamming" to use equal weighting of all mutations. Also accepts params=aMetricParams for using a substitution matrix other than the default. Each substitution is given a weight of 1 using the hamming method or the score from the corresponding substitution matrix when using the substitution method. The distance matrix is converted to a dissimilarity distance by making all elements negative and adding the maximum score/weight
plot: plot(seqs, clusterNumber=4, params=default.MetricParams, distanceMatrix=dist.Sequences(seqs, params=params), clusters=aclust(dmat, clusterNumber)): This method plots a summary plot of the sequences. Each point represents a sequence and the points plotted on a projection onto their two principal components as found from the distance matrix. Additionally, they are colored and placed into clusters using the given cluster number and the kmeans algorithm found in the stats package. This method provides a quick way of estimating the number of clusters in the sequences and looking for any simple patterns in the data. It also can be used to test different substitution matrices to see which best segregates data. For example, a BLOSUM90 substitution matrix may work well for very similar sequences, whereas a BLOSUM50 substitution matrix will work better for very different sequences. The distance matrix may be specified and the clusters as well.
rbind: rbind(seq1, seq2): This method just overwrites the traditional rbind method by passing the alphabet along. Note that most matrix methods editing methods do not return a Sequence class by default, except this rbind method.

The gap character is always assumed to be the last character in the sequence slot. Do not change this convention, since the distance method relies on this. Not all data.frame manipulations methods have been overridden. Thus, you may get a data.frame back instead of a Sequences object. Only rbind and array access has been overridden.

Andrew White

read.sequences, which allows you to create Sequence objects from a file, descriptors, which creates a Descriptors object for a Sequences object.

##load example data and plot it
data(TULASequences)
plot(TULASequences)

## Access all sequences which have a 4 in position 1
print(TULASequences[TULASequences[,1] == 4,])

## Access all sequences which have an tyrosine residue in position 1 and
## cluster

TULASequences.subset <- TULASequences[TULASequences[,1] == which(TULASequences@alphabet == 'Y'),]
plot(TULASequences.subset)

##Calculate distance matrix on this subset and use agglomerative
##  clustering to plotit

TULA.dmatrix <- dist(TULASequences.subset)
TULA.hclusters <- hclust(TULA.dmatrix)
plot(TULA.hclusters)