Scoring matrices
Description
Predefined substitution matrices for nucleotide and amino acid alignments.
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13  data(BLOSUM45)
data(BLOSUM50)
data(BLOSUM62)
data(BLOSUM80)
data(BLOSUM100)
data(PAM30)
data(PAM40)
data(PAM70)
data(PAM120)
data(PAM250)
nucleotideSubstitutionMatrix(match = 1, mismatch = 0, baseOnly = FALSE, type = "DNA")
qualitySubstitutionMatrices(fuzzyMatch = c(0, 1), alphabetLength = 4L, qualityClass = "PhredQuality", bitScale = 1)
errorSubstitutionMatrices(errorProbability, fuzzyMatch = c(0, 1), alphabetLength = 4L, bitScale = 1)

Arguments
match 
the scoring for a nucleotide match. 
mismatch 
the scoring for a nucleotide mismatch. 
baseOnly 

type 
either "DNA" or "RNA". 
fuzzyMatch 
a named or unnamed numeric vector representing the base match probability. 
errorProbability 
a named or unnamed numeric vector representing the error probability. 
alphabetLength 
an integer representing the number of letters in the underlying string alphabet. For DNA and RNA, this would be 4L. For Amino Acids, this could be 20L. 
qualityClass 
a character string of 
bitScale 
a numeric value to scale the qualitybased substitution matrices. By default, this is 1, representing bitscale scoring. 
Format
The BLOSUM and PAM matrices are square symmetric matrices with integer coefficients, whose row and column names are identical and unique: each name is a single letter representing a nucleotide or an amino acid.
nucleotideSubstitutionMatrix
produces a substitution matrix for all IUPAC
nucleic acid codes based upon match and mismatch parameters.
errorSubstitutionMatrices
produces a two element list of numeric
square symmetric matrices, one for matches and one for mismatches.
qualitySubstitutionMatrices
produces the substitution matrices
for Phred or Solexa qualitybased reads.
Details
The BLOSUM and PAM matrices are not unique. For example, the definition of the widely used BLOSUM62 matrix varies depending on the source, and even a given source can provide different versions of "BLOSUM62" without keeping track of the changes over time. NCBI provides many matrices here ftp://ftp.ncbi.nih.gov/blast/matrices/ but their definitions don't match those of the matrices bundled with their standalone BLAST software available here ftp://ftp.ncbi.nih.gov/blast/
The BLOSUM45, BLOSUM62, BLOSUM80, PAM30 and PAM70 matrices were taken from NCBI standalone BLAST software.
The BLOSUM50, BLOSUM100, PAM40, PAM120 and PAM250 matrices were taken from ftp://ftp.ncbi.nih.gov/blast/matrices/
The quality matrices computed in qualitySubstitutionMatrices
are
based on the paper by Ketil Malde. Let ε_i be the probability
of an error in the base read. For "Phred"
quality measures Q
in [0, 99], these error probabilities are given by
ε_i = 10^{Q/10}. For "Solexa"
quality measures
Q in [5, 99], they are given by
ε_i = 1  1/(1 + 10^{Q/10}).
Assuming independence within and between base reads, the combined error
probability of a mismatch when the underlying bases do match is
ε_c = ε_1 + ε_2  (n/(n1)) * ε_1 * ε_2,
where n is the number of letters in the underlying alphabet. Using
ε_c, the substitution score is given by when two bases match is given by
b * \log_2(γ_{x,y} * (1  ε_c) * n + (1  γ_{x,y}) * ε_c * (n/(n1))),
where b is the bitscaling for the scoring and γ_{x,y} is the
probability that characters x and y represents the same underlying
information (e.g. using IUPAC, γ_{A,A} = 1 and γ_{A,N} = 1/4.
In the arguments listed above fuzzyMatch
represents γ_{x,y}
and errorProbability
represents ε_i.
Author(s)
H. Pagès and P. Aboyoun
References
K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics, Feb 23, 2008.
See Also
pairwiseAlignment
,
PairwiseAlignmentsclass,
DNAStringclass,
AAStringclass,
PhredQualityclass,
SolexaQualityclass,
IlluminaQualityclass
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46  s1 <
DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
s2 <
DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")
## Fit a global pairwise alignment using edit distance scoring
pairwiseAlignment(s1, s2,
substitutionMatrix = nucleotideSubstitutionMatrix(0, 1, TRUE),
gapOpening = 0, gapExtension = 1)
## Examine qualitybased match and mismatch bit scores for DNA/RNA
## strings in pairwiseAlignment.
## By default patternQuality and subjectQuality are PhredQuality(22L).
qualityMatrices < qualitySubstitutionMatrices()
qualityMatrices["22", "22", "1"]
qualityMatrices["22", "22", "0"]
pairwiseAlignment(s1, s2)
## Get the substitution scores when the error probability is 0.1
subscores < errorSubstitutionMatrices(errorProbability = 0.1)
submat < matrix(subscores[,,"0"], 4, 4)
diag(submat) < subscores[,,"1"]
dimnames(submat) < list(DNA_ALPHABET[1:4], DNA_ALPHABET[1:4])
submat
pairwiseAlignment(s1, s2, substitutionMatrix = submat)
## Align two amino acid sequences with the BLOSUM62 matrix
aa1 < AAString("HXBLVYMGCHFDCXVBEHIKQZ")
aa2 < AAString("QRNYMYCFQCISGNEYKQN")
pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 3, gapExtension = 1)
## See how the gap penalty influences the alignment
pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 6, gapExtension = 2)
## See how the substitution matrix influences the alignment
pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM50", gapOpening = 3, gapExtension = 1)
if (interactive()) {
## Compare our BLOSUM62 with BLOSUM62 from ftp://ftp.ncbi.nih.gov/blast/matrices/
data(BLOSUM62)
BLOSUM62["Q", "Z"]
file < "ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62"
b62 < as.matrix(read.table(file, check.names=FALSE))
b62["Q", "Z"]
}
