Descriptors-class: Class "Descriptors"
In peplib: Peptide Library Analysis Methods

Description Objects from the Class Arguments Details Slots Extends Author(s) See Also Examples

The descriptors class is an extension to the data.frame class and contains, in addition to the descriptors, information about any response data and p-values which describe the difference between the sequences vs. the space of possible sequences. The class should be created by a call to descriptors (see arguments and details below) or simpleDescriptors.

Objects can be created by calls of the form descriptors(seqs, response=numeric(0), base.frame=NA, do.var=TRUE, alags=c(1,2,3), do.mean=TRUE, do.counts=TRUE, do.position=TRUE, alphabet=seqs@alphabet, include.statistics=TRUE, accuracy=0.01)

seqs: A Sequences object
response: An optional array containing responses for each sequence. nrow(seqs) should be equal to length(response).
base.frame: A data.frame containing descriptors calculated on each amino acid. See details.
do.var: Calculate the additional descriptors which are the variance of single residue descriptors along the sequence
do.mean: Calculate the mean of the single residue descriptors along the sequence
do.counts: Provide descriptors of various counts, like the number of each residue type
do.position: Provide position specific descriptors
alphabet: The alphabet to use for calculating counts.
include.statistics: If TRUE, the function will calculate the p-values of the descriptors. See details.
accuracy: The accuracy of the computed statistics on the descriptors

The descriptor calculation methods used here are not as sophisticated as those provided in some of the more complete QSAR packages. Instead, it relies on making various permutations of descriptors calculated on single amino acids. There are two reasons for this. First, it is easy to calculate descriptors quickly, without relying on another program. Second, it is easier to treat calculating the distribution of the descriptors of the sequence space. The ability to calculate the descriptors across the sequence space also depends on the number of descriptors and the chain length of the sequence. The advantage of knowing descriptors on the whole sequence space is that it is easy to determine if a descriptor on the sequences is significant. For example, if the number of hydrogen bond donors is three standard deviations above the mean number of hydrogen bond donors over all sequence space, then that is a significant descriptor. This is expressed as a p-value, which is calculated from a wilcox.test. That is a non-parametric version of the Student's t-test.

The calculations are based on the given base.frame parameter. Given that matrix, which contains the descriptors calculated on all the individual amino acids, it is possible to calculate many sequence level descriptors. If the means are being calculated (do.mean=true), then the mean of the descriptors for each sequence is calculated. This doubles the number of descriptors. The same is true of the do.var, which uses variance along the sequence. The autocorrelation function can also be calculated along the chain, again increasing the number of resulting descriptors. This may be interesting for describing alternating patterns. The position specific descriptors are simply the individual descriptors at a certain position. For example, number of hydrogen bond donors at position 2.

One often is more interested in understanding what is common amongst the active sequences. This may be done by comparing a descriptor on the active sequences to the inactive sequences. Since inactive sequences are rarely collected in peptide libraries, we may approximate the inactive sequences as all sequences. This assumption only holds if there is a low number of active sequences relative to the size of the sequence diversity. This is often the case but must be observed during the experiment. With this assumption, p-values may be calculated for each descriptor. These p-values do not assume normality and are a measure of the overlap between the active sequences and inactive sequences. They are calculated using a Wilcox t-test. A low p-value is considered significant and such a desciptor may be considerd to be related to activity. Remember that a descriptor may be important in connection to a motif. Thus it is important to do both descriptors and motif discovery. include.staistics will calculate the p-values for each of the descriptors. This is only practical for smaller lengths; less than 10.

If base.frame is NA, then the default will be used, defaultBaseMatrix. See the documentation on that dataset for more information.

.Data:: Object of class "list" The descriptors as a data.frame. Each row is the desciptor set for a single sequence
response:: Object of class "numeric" An optional numeric array containing responses for the sequences.
names:: Object of class "character" The descriptor names (inherited from data.frame).
row.names:: Object of class "data.frameRowLabels"
.S3Class:: Object of class "character"
pvalues:: Object of class "numeric" An optional array containing estimated p-values for each descriptor. The p-value represents how different the descriptor set is as compared to a set of random peptides of the same length WITHOUT GAPS.

Class "data.frame", directly. Class "list", by class "data.frame", distance 2. Class "oldClass", by class "data.frame", distance 2. Class "vector", by class "data.frame", distance 3.

Andrew White

Sequences, defaultBaseMatrix, wilcox.test

#calculate some descriptors
data(SHP2Sequences)

#turn off most of the descriptors so it goes fast
SHP2desc <- descriptors(SHP2Sequences, do.var=FALSE,
alags=c(), do.mean=TRUE, do.counts=FALSE,
do.position=FALSE, include.statistics=FALSE)

#get some descriptors and response sets
data(AMPSequences)
data(AMPSequences.response)

AMPdesc <- descriptors(AMPSequences, response=AMPSequences.response[,1], do.var=FALSE,
alags=c(), do.mean=TRUE, do.counts=FALSE,
do.position=FALSE, include.statistics=FALSE)