knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 )
library(idpr)
The composition of amino acids and the overall chemistry of Intrinsically
Disordered Proteins (IDPs) are distinctly different from that of ordered
proteins. Each amino acid has unique chemical characteristics that can either
favor a compact or an extended structure (Uversky, 2019) . Disorder-promoting
residues, those enriched in IDPs, are typically hydrophilic, charged, or small
residues. Order promoting residues, those enriched in structured proteins,
tend to be aliphatic, hydrophobic, aromatic, or form tertiary structures
(Uversky, 2013).
Therefore, there is a distinct difference of biochemistry
between IDPs and ordered proteins.
It was shown in Uversky, Gillespie, & Fink (2000) that both high net charge and
low mean hydropathy are properties of IDPs. One explanation is that a high net
charge leads to increased repulsion of residues causing an extended structure
and low hydrophobicity will reduce the hydrophobic interactions causing reduced
protein packing. When both average net charge and mean scaled hydropathy are
plotted, IDPs occupy a unique area on the plot. The barrier between unfolded
and compact proteins is:
$$
An alternative version of the Charge Hydropathy plot mentioned was shown
in Uversky (2016), where the average net charge is shown rather than the
absolute value. This creates two cutoff lines. One for positively charged
peptides:
$$
This plot allows a distinction between negative and positive proteins while preserving the information of the charge-hydropathy plot.
The idpr package can be installed from Bioconductor with the following line of code. It requires the BiocManager package to be installed
#BiocManager::install("idpr")
The most recent version of the package can be installed with the following line of code. This requires the devtools package to be installed.
#devtools::install_github("wmm27/idpr")
Methods are originally described in Uversky, Gillespie, & Fink (2000). The calculations shown here are to explain how chargeHydropathyPlot() obtains the values for graphing, and are done automatically within the function.
Mean scaled hydropathy is calculated by normalizing the Kyte and Doolittle scale on a scale of 0 to 1, with Arg having a hydropathy of 0 and Ile having a hydropathy of 1, averaged by the sequence length (Kyte & Doolittle, 1982; Uversky, 2016).
Net charge is calculated with the Henderson-Hasselbalch equation (Po & Senozan, 2001). While there is not one agreed upon set of pKa values, the Isoelectric Point Calculator does a great job at unifying these data sets (Kozlowski, 2016). The net charge is then averaged over the sequence length.
Both values are plotted in the Charge-Hydropathy plot to identify a protein or proteins location in that space.
Examples will use the H. sapiens TP53 sequence, acquired from UniProt (UniProt Consortium 2019) and stored within the idpr package for examples.
HUMAN_P53 <- TP53Sequences[2] print(HUMAN_P53)
The first portion is calculating the mean scaled hydropathy of the sequence. This will correspond to the value on the X-axis
P53_msh <- meanScaledHydropathy(HUMAN_P53) print(P53_msh)
The second value is the net charge. Setting averaged = TRUE will average the calculated net charge by the sequence length.This will correspond to the value on the Y-axis. Total Net charge can be calculated by setting averaged to FALSE.
P53_nc <- netCharge(HUMAN_P53, averaged = TRUE) print(P53_nc)
The chargeHydropathyFunction will automatically calculate these values and plot it as a point within the charge-hydropathy space according to Uversky, Gillespie, & Fink (2000) / Uversky (2016)
Since a ggplot is returned, the user can assign the plot to an object to modify aesthetics and/or add annotations. The example shown here will label the cooridnates on the returned plot.
P53_msh <- round(P53_msh, 3) P53_nc <- round(P53_nc, 3)
chargeHydropathyPlot(HUMAN_P53) + ggplot2::annotate("text", x = P53_msh, y = P53_nc + 0.1, #offset from point label = paste("(", P53_msh, ", ", P53_nc, ")", collapse = "", sep = ""))
The plot above shows calculations for one protein, but the function can accept multiple sequences.
The following examples will use highly similar TP53 sequences, acquired from UniProt (UniProt Consortium 2019) that were stored within the idpr package for examples.
TP53_Sequences <- TP53Sequences print(TP53_Sequences)
gg <- chargeHydropathyPlot( sequence = TP53_Sequences, pKaSet = "IPC_protein") plot(gg)
Since it is a ggplot, the user can add additional parameters to the plot like labels, custom themes, and other supported features. An example of this is shown in "Methods".
If you do not wish to use the IPC_protein pKa set, you can change this to any accepted by netCharge().
chargeHydropathyPlot( sequence = TP53_Sequences, pKaSet = "EMBOSS") #Using EMBOSS pKa set
meanScaledHydropathy is used to calculate the scaled hydropathy of a sequence. See the introduction and methods for more details.
meanScaledHydropathy(HUMAN_P53)
scaledHydropathyGlobal is used to match the scaled hydropathy of an amino acid sequence for each residue in the sequence.
The results can be a data frame of matched values. It will result in a data frame with 3 columns. The first column is "Position", which indicates the numeric position of the residue in the submitted sequence. The second column is "AA", which indicates the amino acid residue as a single letter. The third column is "Hydropathy", which indicates the value of scaled hydropathy for that residue, which was matched by the function.
P53_shg <- scaledHydropathyGlobal(HUMAN_P53) head(P53_shg)
Or results can be returned as a plot
scaledHydropathyGlobal(HUMAN_P53, plotResults = TRUE)
(This is not the most aesthetically pleasing plot, so a sequenceMap from idpr is reccomended in this case for visualizations.)
P53_shg <- scaledHydropathyGlobal(HUMAN_P53) sequenceMap(sequence = P53_shg$AA, property = P53_shg$Hydropathy, customColors = c("chocolate1", "grey65", "skyblue3"))
scaledHydropathyLocal is a function used to calculate the average hydrophobicity of a sequence using a sliding window. The results are returned as a data frame or a plot.
The scaled hydropathy is matched to each residue then the average of each window is calculated.
This can be helpful to identify regions of high or low hydrophobicity within a protein. This may help with identifying IDRs as well.
Results can be returned as a data frame of window hydropathy scores. It will result in a data frame with 3 columns. The first column is "Position", which indicates the numeric position of the residue in the submitted sequence. The second column is "Window" which indicates all the residues within the sliding window that were used for calculations. The "CenterResidue" column specifies the amino acid residue as a single letter that is located in the center of the window and is located at the number specified by "Position" And finally, the "WindowHydropathy" is the calculated average hydropathy for the residues specified in the "Window" column.
P53_shl <- scaledHydropathyLocal(HUMAN_P53, plotResults = FALSE) head(P53_shl)
The results can also return a plot that shows each window's hydropathy score along the sequence.
scaledHydropathyLocal(HUMAN_P53, plotResults = TRUE)
The window size can be specified as well by the "window" argument. This must be an odd integer
scaledHydropathyLocal(HUMAN_P53, window = 3, plotResults = TRUE)
netCharge is used to calculate the net charge of a sequence. See the introduction and methods for more details.
netCharge(HUMAN_P53)
Setting averaged = TRUE will average the calculated net charge by the sequence length (Shown before in methods).
netCharge(HUMAN_P53, averaged = TRUE)
netCharge relies on the Henderson-Hasselbalch equation (via the hendersonHasselbalch function). Therefore the pH and pKa are critical for calculations. netCharge allows for different pH values using the pH argument.
netCharge(HUMAN_P53, pH = 5.5) netCharge(HUMAN_P53, pH = 7) netCharge(HUMAN_P53, pH = 8)
There are also many pKa sets that are preloaded in idpr. pKa datasets used within this vignette are cited. See the documentation for netCharge or pKaData within idpr for additional information and citations for available pKa sets. Additionally, see Kozlowski (2016) for further details on pKa data sets.
netCharge(HUMAN_P53, pKaSet = "IPC_protein") netCharge(HUMAN_P53, pKaSet = "IPC_peptide") netCharge(HUMAN_P53, pKaSet = "EMBOSS")
Alternatively, the user may supply a custom pKa dataset. The format must be a data frame where: Column 1 must be a character vector of residues AND Column 2 must be a numeric vector of pKa values. This can be helpful if there is a data set the user prefers or if there are noncannonical amino acids. Here is an example using the pKa values from Wikipedia (Proteinogenic amino acid, n.d.). https://en.wikipedia.org/wiki/Proteinogenic_amino_acid#Chemical_properties
charged_aa <- c("C", "D", "E", "H", "K", "R", "Y", "NH3", "COO") aa_pKa <- c(8.55, 3.67, 4.25, 6.54, 10.40, 12.3, 9.84, 9.28, 1.99) custom_pKa <- data.frame(AA = charged_aa, pKa = aa_pKa)
print(custom_pKa) netCharge(HUMAN_P53, pKaSet = custom_pKa, includeTermini = FALSE)
chargeCalculationGlobal is a function used to calculate the charge of each residue, independent of other amino acids, within a sequence. The results are returned as a data frame (default) or a plot.
chargeCalculationGlobal accepts the same pKa and pH arguments as netCharge.
The results can return a data frame of charge for each residue
It will result in a data frame with 3 columns. The first column is "Position", which indicates the numeric position of the residue in the submitted sequence. The second column is "AA", which indicates the amino acid residue as a single letter. The third column is "Charge", which indicates the calculated charge for the residue at the specified pH.
P53_ccg <- chargeCalculationGlobal(HUMAN_P53) head(P53_ccg)
The results can return a ggplot visualizing the charge distribution.
chargeCalculationGlobal(HUMAN_P53, plotResults = TRUE)
(This is not the most aesthetically pleasing plot, so a sequenceMap from idpr is recommended in this case for visualizations.)
P53_ccg <- chargeCalculationGlobal(HUMAN_P53) #repeating from above sequenceMap(sequence = P53_ccg$AA, property = P53_ccg$Charge, customColors = c("red", "blue", "grey65"))
The C-terminus here has a charge of ~ -2 since the function aggregates the termini values with residue charges by default. If you wish to calculate the termini as separate values, use sumTermini = FALSE. This will add 2 residues to the data frame as "NH3" and "COO"
P53_ccg <- chargeCalculationGlobal(HUMAN_P53, sumTermini = FALSE) head(P53_ccg)
If you wish to completely ignore the termini for calculation, set includeTermini = FALSE.
P53_ccg <- chargeCalculationGlobal(HUMAN_P53, includeTermini = FALSE) head(P53_ccg)
chargeCalculationLocal is a function used to calculate the charge of a sequence using a sliding window. The results are returned as a data frame (default) or a plot.
chargeCalculationLocal accepts the same pKa and pH arguments as netCharge.
Unlike chargeCalculationGlobal, the chargeCalculationLocal function does not consider termini for calculations.
Results can be returned as a data frame of window hydropathy scores. It will result in a data frame with 4 columns. The first column is "Position", which indicates the numeric position of the residue in the submitted sequence. The "CenterResidue" column specifies the amino acid residue as a single letter that is located in the center of the window and is located at the number specified by "Position". The "Window" column contains all the residues within the sliding window that were used for calculating the charge. And finally, the "windowCharge" is the calculated average charge for the residues specified in the "Window" column.
P53_cgl <- chargeCalculationLocal(HUMAN_P53) head(P53_cgl)
Alternatively, results can be returned as a plot of each window's charge.
chargeCalculationLocal(HUMAN_P53, plotResults = TRUE)
The window size can be specified as well via the "window" argument. This must be an odd integer
chargeCalculationLocal(HUMAN_P53, window = 11, plotResults = TRUE)
citation("ggplot2")
Consortium, T. U. (2018). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515. doi:10.1093/nar/gky1049
Kozlowski, L. P. (2016). IPC – Isoelectric Point Calculator. Biology Direct, 11(1), 55. doi:10.1186/s13062-016-0159-9
Kyte, J., & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. Journal of molecular biology, 157(1), 105-132.
Po, H. N., & Senozan, N. (2001). The Henderson-Hasselbalch equation: its history and limitations. Journal of Chemical Education, 78(11), 1499.
Proteinogenic amino acid. (n.d.). In Wikipedia. Retrieved July 12th, 2020. https://en.wikipedia.org/wiki/Proteinogenic_amino_acid#Chemical_properties
Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16(6), 276-277. doi:10.1016/S0168-9525(00)02024-2
Uversky, V. N. (2013). A decade and a half of protein intrinsic disorder: Biology still waits for physics. Protein Science, 22(6), 693-724. doi:10.1002/pro.2261
Uversky, V. N. (2016). Paradoxes and wonders of intrinsic disorder: Complexity of simplicity. Intrinsically Disordered Proteins, 4(1), e1135015. doi:10.1080/21690707.2015.1135015
Uversky, V. N. (2019). Intrinsically Disordered Proteins and Their “Mysterious” (Meta)Physics. Frontiers in Physics, 7(10). doi:10.3389/fphy.2019.00010
Uversky, V. N., Gillespie, J. R., & Fink, A. L. (2000). Why are “natively unfolded” proteins unstructured under physiologic conditions?. Proteins: structure, function, and bioinformatics, 41(3), 415-427. https://doi.org/10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7
R Version
R.version.string
System Information
as.data.frame(Sys.info())
sessionInfo()
citation()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.