Disordered Matrices Vignette

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Substitution Matrices for Intrinsically Disordered Proteins

Quick Intro

Intrinsically Disordered Proteins (IDPs) have many distinct properties compared to structured proteins, such as amino acid composition and rate of evolution. The most commonly used amino acid substitution matrices (PAM and BLOSUM) are derived from or favor structured proteins and therefore are not the most appropriate to use when analyzing IDPs. The following amino acid substitution matrices are derived from sets of IDPs and thus, are more appropriate.

A detailed comparison is available at Trivedi & Nagarajaram, 2019.
Additionally, see the documentation for each matrix for quick information about the derivation and use of each matrix.

The following matrices are available within idpr:

Bolded Matrices are the best-preforming


Installation

The idpr package can be installed from Bioconductor with the following line of code. It requires the BiocManager package to be installed

#BiocManager::install("idpr") 

The most recent version of the package can be installed with the following line of code. This requires the devtools package to be installed.

#devtools::install_github("wmm27/idpr")

To access the matrices, the idpr package needs to be loaded

library(idpr)

Background and Motivation

It has been shown that solvent-exposed amino acids experience higher levels of evolution compared to the internal residues of a protein ( Franzosa & Xia, 2009). Intrinsically Disordered Proteins (IDPs) and Intrinsically Disordered Regions (IDRs) are, by definition, more solvent exposed. One can predict that these proteins have accelerated evolution, and this has been observed ( Brown et al., 2011).

Additionally, a study that compared proteins with both ordered and disordered regions showed the IDRs, on average, evolved at faster rates than the ordered regions ( Brown et al., 2002).

The most commonly used amino acid substitution matrices are BLOSUM and PAM. These are currently the only accepted matrices for NCBI BLAST ( Johnson et al., 2008), as well as EMBOSS Needle and EMBOSS Water web serveries ( Madeira et al., 2019).

The sequences used to calculate the values for both of these matrices are from highly conserved protein families (Dayhoff et al. 1978; Henikoff & Henikoff, 1992). While they do not explicitly exclude IDPs, the sequences used for analysis are enriched in highly-ordered proteins. Structured proteins have served as the basis for sequence-level comparisons due to their residue and position conservation. IDPs do not experience the same level of residue-identity conservation that has been observed in proteins with a specific, well-folded structure. IDPs often conserve the overall chemistry and the disordered state of the protein, but are less conserved at the residue and position level. Therefore using BLOSUM or PAM for studies comparing IDP evolution is not the most appropriate method, since these matrices account for the evolutionary observations of ordered proteins ( Brown et al., 2011).

Additionally, the amino acid composition and accepted point mutations for IDPs differ compared to structured proteins. IDPs and IDRs tend to be enriched in hydrophillic and charged residues ( Uversky, 2019). In addition to a distinct composition, IDPs tend to accept indels at higher rates ( Brown et al., 2011).

Therefore, there are multiple reasons as to why BLOSUM and PAM matrices do not accurately assess the evolutionary history of IDPs. The works of Trivedi & Nagarajaram, 2019, Brown et al., 2009, and Radivojac et al., 2001 can serve as references for IDP-based substitution matrices. Each use experimentally confirmed or computationally predicted IDPs to create the data set of protein sequences and protein families to calculate the accepted point mutations specific to disordered sequences. These are the EDSSMat set of matrices, the "Disorder" set of matrices, and the DUNMat matrix.

Unlike most commonly used web-based tools, many sequence comparison functions within other R packages allow for custom substitution matrices. idpr has added 11 disorder-based substitution matrices for IDP sequence conservation studies for user convenience.

Trivedi & Nagarajaram, 2019 provides many details for comparing the EDSSMat matrices to the other disorder matrices and other matrices like BLOSUM and PAM within sequence alignments. The paper also calculated the optimal gap parameters for sequence alignments depending on how disordered the aligned sequences were predicted to be. Protein families were sorted into 3 categories: Proteins containing Less Disorder (LD), defined as [0-20%] disorder, Moderate Disorder (MD), defined as (20-40%] disorder, and High Disorder (HD), defined as (40-100%] disorder.

Additionally, EDSSMat62 was shown to identify both close and distant homologs of a specific IDP while other matrices could only identify some close homologs. Please see Trivedi & Nagarajaram, 2019
for additional information and for comparisons to other matrices.

Matrices

EDSS Matrices

EDSSMat62, like the other EDSS matrices, is derived from alignment blocks of computationally predicted IDPs. Trivedi & Nagarajaram, 2019
go into detail of methods and comparisons. EDSSMat62 is the best preforming matrix in their study and was shown to identify both close and distant homologs of Secretogranin-1, an IDP, while other matrices could only identify some close homologs. On average, EDSSMat62 attained smaller E-values when aligning Highly Disordered (HD) proteins compared to BLOSUM and PAM matrices, and against some other disorder-based substitution matrices.

EDSS Matrices are symmetric, 24x24 matrices. They contain the 20 standard amino acids and 4 ambiguous residues: B, Asparagine or Aspartic Acid (Asx); Z, Glutamine or Glutamic Acid (Glx); X, Unspecified or unknown amino acid; and *, Stop.

EDSSMat62

"Disorder" Matrices

In their study, Trivedi & Nagarajaram, 2019 show that the Disorder40 and Disorder85 matrices, on average, score lower E-values for the HD proteins. EDSSMat62 on average attains lower E-values for the Low and Medium disordered proteins. The "Disorder" matrices are from Brown et al., 2009, which identified protein families based on homology to experimentally confirmed IDPs. The "Disorder" matrices were not designed with the goal of improving sequence alignments. The authors note that the purpose of these matrices were to compare evolutionary characteristics of disordered and ordered proteins. Therefore information for using these in sequence alignments, like gap scores and entropy, are not provided by the original authors.
Trivedi & Nagarajaram, 2019 determined the optimal gap parameters for these matrices to compare against EDSSMat62.

Disorder40, Disorder60, and Disorder85 are symmetric, 24x24 matrices. They contain the 20 standard amino acids and 4 ambiguous residues: B, Asparagine or Aspartic Acid (Asx); Z, Glutamine or Glutamic Acid (Glx); X, Unspecified or unknown amino acid; and *, Stop.

Disorder40
Disorder85

DUNMat

DUNMat was described in Radivojac et al., 2001. This was an early attempt at improving alignments for IDPs. It is included in idpr due to its significance in developing substitution matrices for IDPs. It is a symmetric 20x20 matrix. Trivedi & Nagarajaram, 2019 show that EDSSMat62, on average, attains lower E-values for HD proteins.

DUNMat

Examples

Pairwise Sequence Alignments (PSAs)

P53 is a well-known IDP that has been experimentally determined. Information was accessed by DisProt, a database for IDPs ( Hatos et al., 2019). (https://www.disprot.org/)

These examples are to showcase the differences of BLOSUM62, a commonly used matrix that can be loaded into R via the Biostrings package, and EDSSMat62, a matrix comes loaded with idpr specific for IDPs. EDSSMat62 can be used with any function that accepts custom substitution matrices, and these examples show some uses of it.

Cellular tumor antigen p53

The amino acid sequences were acquired from UniProt (UniProt Consortium, 2019) and stored within the idpr package for examples.

P53_MOUSE <- TP53Sequences[1]
print(P53_MOUSE)
P53_HUMAN <- TP53Sequences[2]
print(P53_HUMAN)
P53_GORILLA <- GorillaTP53
print(P53_GORILLA)

PSA of Human TP53 to Gorilla and Mouse P53 BLOSUM62 ( Henikoff & Henikoff, 1992) with default gap opening/extension from EMBOSS Needle ( Madeira et al., 2019). pairwiseAlignment function and BLOSUM62 matrix are from Biostrings

library(Biostrings)
data("BLOSUM62") #loads the matrix from the Biostrings package
HUMAN_MOUSE_BLOSUM_PSA <- pairwiseAlignment(P53_MOUSE, P53_HUMAN,
                                            substitutionMatrix = BLOSUM62,
                                            gapOpening = 10,
                                            gapExtension = 0.5)
print(HUMAN_MOUSE_BLOSUM_PSA)

HUMAN_GORILLA_BLOSUM_PSA <- pairwiseAlignment(P53_GORILLA, P53_HUMAN,
                                             substitutionMatrix = BLOSUM62,
                                             gapOpening = 10,
                                             gapExtension = 0.5)
print(HUMAN_GORILLA_BLOSUM_PSA)

PSA of Human TP53 to Gorilla and Mouse P53 using EDSSMat62 with gap costs recommended for IDPs in ( Trivedi & Nagarajaram, 2019).

HUMAN_MOUSE_EDSS_PSA <- pairwiseAlignment(P53_MOUSE, P53_HUMAN,
                                            substitutionMatrix = EDSSMat62,
                                            gapOpening = 19,
                                            gapExtension = 2)
print(HUMAN_MOUSE_EDSS_PSA)

HUMAN_GORILLA_EDSS_PSA <- pairwiseAlignment(P53_GORILLA, P53_HUMAN,
                                             substitutionMatrix = EDSSMat62,
                                             gapOpening = 19,
                                             gapExtension = 2)
print(HUMAN_GORILLA_EDSS_PSA)

Comparing the printed alignment for each matrix comparing Human and Mouse P53, we can see that the alignments differ depending on the matrix used to align. This is also seen in the second alignments for each matrix that compares Human and Gorilla P53.

Using BLOSUM, the scores between each PSA differ greatly by over 100 points. Using EDSSMat62 we can see almost identical scores between the alignment comparing Human p53 to Mouse p53 and in the alignment comparing Human p53 to Gorilla p53. Scores are not directly comparable between matrices, but this shows relative to each other, the scores with EDSSMat62 are similar for this specific case. There will be some instances where BLOSUM62 may appear to have better scores, but as previously stated, Trivedi & Nagarajaram, 2019 show that on average EDSSMat62 attains lower E-values for alignments of IDPs.

Multiple Sequence Alignments (MSAs)

In addition to PSAs, Multiple Sequence Alignments (MSAs) offer additional analysis for protein families.

First I will load several P53 sequences stored within idpr. These sequences were selected due to their highly similar identity on UniProt( The UniProt Consortium, 2019).

TP53_Sequences <- TP53Sequences
print(TP53_Sequences)

MSA of P53 proteins using the msa package

library(msa)
BLOSUM_MSA <- msa(TP53_Sequences,
                 type = "protein",
                 substitutionMatrix = BLOSUM62,
                 gapOpening = 10,
                 gapExtension = 0.5)

print(BLOSUM_MSA, show="complete")
EDSS_MSA <- msa(TP53_Sequences,
                type = "protein",
                substitutionMatrix = EDSSMat62,
                gapOpening = 19,
                gapExtension = 2)

print(EDSS_MSA, show="complete")

Distance Trees

The user guide to msa shows an example of converting the sequence alignment (like the one generated above) into a phylogenetic tree of sequence distances. Therefore, the IDP-specific matrices can be used for this type of analysis. The conversion uses both the ape and seqinr packages.

EDSS_MSA_Tree <- msa::msaConvert(EDSS_MSA, type="seqinr::alignment")
d <- seqinr::dist.alignment(EDSS_MSA_Tree, "identity")
p53Tree <- ape::nj(d)
plot(p53Tree, main="Phylogenetic Tree of p53 Sequences\nAligned with EDSSMat62")

References

Packages

citation("Biostrings")
citation("msa")
citation("ape")
citation("seqinr")

Citations

Brown, C. J., Johnson, A. K., & Daughdrill, G. W. (2009). Comparing Models of Evolution for Ordered and Disordered Proteins. Molecular Biology and Evolution, 27(3), 609-621. doi:10.1093/molbev/msp277

Brown, C. J., Johnson, A. K., Dunker, A. K., & Daughdrill, G. W. (2011). Evolution and disorder. Current Opinion in Structural Biology, 21(3), 441-446. doi:https://doi.org/10.1016/j.sbi.2011.02.005

Brown, C. J., Takayama, S., Campen, A. M., Vise, P., Marshall, T. W., Oldfield, C. J., . . . Dunker, A. K. (2002). Evolutionary rate heterogeneity in proteins with long disordered regions. Journal of molecular evolution, 55(1), 104.

Dayhoff, M., Schwartz, R., & Orcutt, B. (1978). 22 a model of evolutionary change in proteins. In Atlas of protein sequence and structure (Vol. 5, pp. 345-352): National Biomedical Research Foundation Silver Spring MD.

Franzosa, E. A., & Xia, Y. (2009). Structural Determinants of Protein Evolution Are Context-Sensitive at the Residue Level. Molecular Biology and Evolution, 26(10), 2387-2395. doi:10.1093/molbev/msp146

Hatos, A., Hajdu-Soltész, B., Monzon, A. M., Palopoli, N., Álvarez, L., Aykac-Fas, B., . . . Piovesan, D. (2019). DisProt: intrinsic protein disorder annotation in 2020. Nucleic acids research, 48(D1), D269-D276. doi:10.1093/nar/gkz975

Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89(22), 10915-10919. doi:10.1073/pnas.89.22.10915

Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S., & Madden, T. L. (2008). NCBI BLAST: a better web interface. Nucleic acids research, 36(suppl_2), W5-W9. doi:10.1093/nar/gkn201

Madeira, F., Park, Y. M., Lee, J., Buso, N., Gur, T., Madhusoodanan, N., . . . Finn, R. D. (2019). The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic acids research, 47(W1), W636-W641.

Radivojac, P., Obradovic, Z., Brown, C. J., & Dunker, A. K. (2001). Improving sequence alignments for intrinsically disordered proteins. In Biocomputing 2002 (pp. 589-600): World Scientific.

Trivedi, R., & Nagarajaram, H. A. (2019). Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Scientific Reports, 9(1), 16380. doi:10.1038/s41598-019-52532-8

UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515.

Uversky, V. N. (2019). Intrinsically Disordered Proteins and Their “Mysterious” (Meta)Physics. Frontiers in Physics, 7(10). doi:10.3389/fphy.2019.00010

Additional Information

R Version

R.version.string

System Infomation

as.data.frame(Sys.info())
citation()


Try the idpr package in your browser

Any scripts or data that you put into this service are public.

idpr documentation built on Dec. 26, 2020, 6 p.m.