dis: Dissimilarity score
In bios2mds: From Biological Sequences to Multidimensional Scaling

Description Usage Arguments Details Value Author(s) References Examples

Computes the dissimilarity score between two aligned amino acid sequences, based on substitution matrices.

1	dis(seq1, seq2, sub.mat.id = "PAM250", gap = NULL)

`seq1`	a character vector representing a first amino acid sequence.
`seq2`	a character vector representing a second amino acid sequence.
`sub.mat.id`	a string of characters indicating the amino acid substitution matrix to be taken into account for calculation. This should be one of "PAM40", "PAM80", "PAM120", "PAM160", "PAM250", "BLOSUM30", "BLOSUM45", "BLOSUM62", "BLOSUM80", "GONNET", "JTT", "JTT_TM" and "PHAT". Default is PAM250. See `sub.mat`.
`gap`	a numeric vector of length 2, indicating the penalty for initiating and extending a strand of two gaps (gap ignored, default).

Grishin and Grishin (2002) developped a method to calculate the similarity score with amino acid substitution matrices.

Let s be an amino acid substitution matrix with elements s(a, b), let A be an alignment of n sequences, A_{ik} is a symbol (amino acid or gap: '-') in the site k of the sequence i. For each pair of sequences i and j from A, the following equations(1), (2), (3) and (4) are calculated as follows:

(1)

S_{ij}, called score per site, is obtained as:

S_{ij} = ∑_{k \in K_{ij}}s(A_{ik}, A_{jk})/l(K_{ij})

where K_{ik} is the set of sites k such that A_{ik} != '-' and A_{jk} != '-' and l(K_{ij}) is the number of elements in K_{ij}.

(2)

T_{ij}, called average upper limit of the score per site, is obtained as:

T_{ij} = 0.5 ∑_{k \in K_{ij}}(s(A_{ik}, A_{ik}) + s(A_{jk}, A_{jk}))/l(K_{ij})

(3)

S^{rand}_{ij}, called score per site expected from random sequences, is obtained as:

S^{rand}_{ij} = ∑^{20}_{a=1}∑^{20}_{b=1}f^i_j(a)f^j_i(b)s(a, b)

where f^i_j(a) is the frequency of amino acid 'a' in i^{th} protein sequence of A over all sites in K_{ij}.

(4)

V_{ij}, called normalized score (Feng and Doolittle, 1997), is obtained as:

V_{ij} = \frac{S_{ij} - S^{rand}_{ij}}{T_{ij} - S^{rand}_{ij}}

The normalized score V_{ij} ranges from 0 (for random sequences) to 1 (for identical sequences). However, for very divergent sequences, V_{ij} can become negative due to statistical errors. In this case, dis attributes 0 to negative scores.

The dissimilarity score D_{ij} between sequences i and j is obtained from the similarity score as:

D_{ij} = V_{ij} - 1

A single numeric value representing the dissimilarity score.

Julien Pele

Grishin VN and Grishin NV (2002) Euclidian space and grouping of biological objects. Bioinformatics 18:1523-1534.

Feng DF and Doolittle RF (1997) Converting amino acid alignment scores into measures of evolutionary time: a simulation study of various relationships. J Mol Evol 44:361-370.

# calculating dis between the sequences of CLTR1_HUMAN and CLTR2_HUMAN:
aln <- import.fasta(system.file("msa/human_gpcr.fa", package = "bios2mds"))
dis <- dis(aln$CLTR1_HUMAN, aln$CLTR2_HUMAN)
dis

Loading required package: amap
Loading required package: e1071
Loading required package: scales
Loading required package: cluster
Loading required package: rgl
Warning messages:
1: In rgl.init(initValue, onlyNULL) : RGL: unable to open X11 display
2: 'rgl_init' failed, running with rgl.useNULL = TRUE 
3: .onUnload failed in unloadNamespace() for 'rgl', details:
  call: fun(...)
  error: object 'rgl_quit' not found 
[1] 0.423