substitution_matrices: Scoring matrices

Description Usage Arguments Format Details Author(s) References See Also Examples

Description

Predefined substitution matrices for nucleotide and amino acid alignments.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
data(BLOSUM45)
data(BLOSUM50)
data(BLOSUM62)
data(BLOSUM80)
data(BLOSUM100)
data(PAM30)
data(PAM40)
data(PAM70)
data(PAM120)
data(PAM250)
nucleotideSubstitutionMatrix(match = 1, mismatch = 0, baseOnly = FALSE, type = "DNA")
qualitySubstitutionMatrices(fuzzyMatch = c(0, 1), alphabetLength = 4L, qualityClass = "PhredQuality", bitScale = 1)
errorSubstitutionMatrices(errorProbability, fuzzyMatch = c(0, 1), alphabetLength = 4L, bitScale = 1)

Arguments

match

the scoring for a nucleotide match.

mismatch

the scoring for a nucleotide mismatch.

baseOnly

TRUE or FALSE. If TRUE, only uses the letters in the "base" alphabet i.e. "A", "C", "G", "T".

type

either "DNA" or "RNA".

fuzzyMatch

a named or unnamed numeric vector representing the base match probability.

errorProbability

a named or unnamed numeric vector representing the error probability.

alphabetLength

an integer representing the number of letters in the underlying string alphabet. For DNA and RNA, this would be 4L. For Amino Acids, this could be 20L.

qualityClass

a character string of "PhredQuality", "SolexaQuality", or "IlluminaQuality".

bitScale

a numeric value to scale the quality-based substitution matrices. By default, this is 1, representing bit-scale scoring.

Format

The BLOSUM and PAM matrices are square symmetric matrices with integer coefficients, whose row and column names are identical and unique: each name is a single letter representing a nucleotide or an amino acid.

nucleotideSubstitutionMatrix produces a substitution matrix for all IUPAC nucleic acid codes based upon match and mismatch parameters.

errorSubstitutionMatrices produces a two element list of numeric square symmetric matrices, one for matches and one for mismatches.

qualitySubstitutionMatrices produces the substitution matrices for Phred or Solexa quality-based reads.

Details

The BLOSUM and PAM matrices are not unique. For example, the definition of the widely used BLOSUM62 matrix varies depending on the source, and even a given source can provide different versions of "BLOSUM62" without keeping track of the changes over time. NCBI provides many matrices here ftp://ftp.ncbi.nih.gov/blast/matrices/ but their definitions don't match those of the matrices bundled with their stand-alone BLAST software available here ftp://ftp.ncbi.nih.gov/blast/

The BLOSUM45, BLOSUM62, BLOSUM80, PAM30 and PAM70 matrices were taken from NCBI stand-alone BLAST software.

The BLOSUM50, BLOSUM100, PAM40, PAM120 and PAM250 matrices were taken from ftp://ftp.ncbi.nih.gov/blast/matrices/

The quality matrices computed in qualitySubstitutionMatrices are based on the paper by Ketil Malde. Let ε_i be the probability of an error in the base read. For "Phred" quality measures Q in [0, 99], these error probabilities are given by ε_i = 10^{-Q/10}. For "Solexa" quality measures Q in [-5, 99], they are given by ε_i = 1 - 1/(1 + 10^{-Q/10}). Assuming independence within and between base reads, the combined error probability of a mismatch when the underlying bases do match is ε_c = ε_1 + ε_2 - (n/(n-1)) * ε_1 * ε_2, where n is the number of letters in the underlying alphabet. Using ε_c, the substitution score is given by when two bases match is given by b * \log_2(γ_{x,y} * (1 - ε_c) * n + (1 - γ_{x,y}) * ε_c * (n/(n-1))), where b is the bit-scaling for the scoring and γ_{x,y} is the probability that characters x and y represents the same underlying information (e.g. using IUPAC, γ_{A,A} = 1 and γ_{A,N} = 1/4. In the arguments listed above fuzzyMatch represents γ_{x,y} and errorProbability represents ε_i.

Author(s)

H. Pag<c3><a8>s and P. Aboyoun

References

K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics, Feb 23, 2008.

See Also

pairwiseAlignment, PairwiseAlignments-class, DNAString-class, AAString-class, PhredQuality-class, SolexaQuality-class, IlluminaQuality-class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
  s1 <- 
    DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
  s2 <-
    DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")

  ## Fit a global pairwise alignment using edit distance scoring
  pairwiseAlignment(s1, s2,
                    substitutionMatrix = nucleotideSubstitutionMatrix(0, -1, TRUE),
                    gapOpening = 0, gapExtension = 1)

  ## Examine quality-based match and mismatch bit scores for DNA/RNA
  ## strings in pairwiseAlignment.
  ## By default patternQuality and subjectQuality are PhredQuality(22L).
  qualityMatrices <- qualitySubstitutionMatrices()
  qualityMatrices["22", "22", "1"]
  qualityMatrices["22", "22", "0"]

  pairwiseAlignment(s1, s2)

  ## Get the substitution scores when the error probability is 0.1
  subscores <- errorSubstitutionMatrices(errorProbability = 0.1)
  submat <- matrix(subscores[,,"0"], 4, 4)
  diag(submat) <- subscores[,,"1"]
  dimnames(submat) <- list(DNA_ALPHABET[1:4], DNA_ALPHABET[1:4])
  submat
  pairwiseAlignment(s1, s2, substitutionMatrix = submat)

  ## Align two amino acid sequences with the BLOSUM62 matrix
  aa1 <- AAString("HXBLVYMGCHFDCXVBEHIKQZ")
  aa2 <- AAString("QRNYMYCFQCISGNEYKQN")
  pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 3, gapExtension = 1)

  ## See how the gap penalty influences the alignment
  pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM62", gapOpening = 6, gapExtension = 2)

  ## See how the substitution matrix influences the alignment
  pairwiseAlignment(aa1, aa2, substitutionMatrix = "BLOSUM50", gapOpening = 3, gapExtension = 1)

  if (interactive()) {
    ## Compare our BLOSUM62 with BLOSUM62 from ftp://ftp.ncbi.nih.gov/blast/matrices/
    data(BLOSUM62)
    BLOSUM62["Q", "Z"]
    file <- "ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62"
    b62 <- as.matrix(read.table(file, check.names=FALSE))
    b62["Q", "Z"]
  }

Example output

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colMeans, colSums, colnames, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, rank, rbind, rowMeans, rowSums, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which,
    which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] ACTTCACCAGCTCCCTGGCGG-TAAGTTGATC-A-AAGGA-A-ACGCA-A-AGTTTTCAAG 
subject: [1] GTTTCACTA-CTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC 
score: -25 
[1] 1.981756
[1] -5.899286
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] ACTTCACCAGCTCCCTGGCGGTAAGTTGATC---AAAGG---AAACGCAAAGTTTTCAAG 
subject: [1] GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC 
score: -102.487 
          A         C         G         T
A  1.701919 -2.006426 -2.006426 -2.006426
C -2.006426  1.701919 -2.006426 -2.006426
G -2.006426 -2.006426  1.701919 -2.006426
T -2.006426 -2.006426 -2.006426  1.701919
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] ACTTCACCAGCTCCCTGGCGGTAAGTTGAT------CAAAGGAAACGCAAAGTTTTCAAG 
subject: [1] GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC 
score: -27.38831 
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] HXBLVYMGCHFDCXV--BEHIKQZ 
subject: [1] QRN--YMYC-FQC-ISGNEY-KQN 
score: 38 
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] HXBLVYMGCHFDCXVBEHIKQZ 
subject: [1] QRN--YMYC-FQCISGNEYKQN 
score: 23 
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] HXBLVYMGCHFDCXV--BEHIKQZ 
subject: [1] QRN--YMYC-FQC-ISGNEY-KQN 
score: 57 

Biostrings documentation built on Nov. 8, 2020, 11:12 p.m.