The PeptideRanger
package contains a random forest (RF) classifier that predicts the detectability of peptides in mass spectrometry (MS). Detectability is predicted using sequence derived physiochemical features and independent of biological abundance. The package also contains tools to retrain the RF on datasets more relevant to experimental conditions, customize peptidomes, and select the top peptides for proteins of interest based on user provided priorities.
Functions available in the PeptideRanger
package:
create_peptidome()
: Creates a list of all unique tryptic peptides for a given proteome fasta file.
peptides_inReference()
: Adds peptide observation information from a reference/database to a peptidome.
train_RFmodel()
: Trains an RF classifier for peptide detectability predictions using database observations.
prioritize_peptides()
: Selects top n peptides for a list of target protein UniProt IDs based on RF scores and/or user provided priorities.
peptide_predictions()
: Scores peptide sequences by MS detectability using input RF classifier.
Data available in the PeptideRanger
package:
CPTAC_exp_counts
: Dataframe of all peptides found in a 2021/02/18 download of the CPTAC database and the number of experiments they were observed in (column = n_obs_pep
)
ProtDB_exp_counts
: Dataframe of all peptides found in a 2020/12 download of the ProteomicsDB database and the number of experiments they were observed in (column = n_obs_pep
)
RFmodel_CPTAC
: Random forest classifier trained on the CPTAC database to predict peptide detectability.
RFmodel_ProteomicsDB
: Random forest classifier trained on the ProteomicsDB database to predict peptide detectability.
SwissProt2018_peptidome
: List of all unique peptides (peptidome) yielded from an in-silico tryptic digestion of a 2018/08/03 download of the human proteome from SwissProt. This digest includes peptides yielded from 0-2 missed cleavage (MC) events and in an amino acid range of 6-25 residues.
SwissProt2018_peptidome_synth
: Filtered version of SwissProt2018_peptidome
that does not include peptides that are challening to synthesize. This currently excludes peptides with N-term Q and E residues and with C-term P residues.
tm_peptides
: List of peptides that overlap with transmembrane domain sequences annotated in UniProt.
The PeptideRanger
package can be installed using the devtools
package.
install.packages('devtools')
Direct installation from GitHub can be performed using:
require(devtools) install_github("rr-2/PeptideRanger")
The package can be cloned from the repository and installed locally using:
require(devtools) install("./PeptideRanger")
Load the PeptideRanger
package.
library(PeptideRanger)
For these examples, a random sample of 100 protein UniProt IDs will be used to represent a list of target proteins of interest.
set.seed("3284") target_proteins <- sample(unique(PeptideRanger::SwissProt2018_peptidome$uniprot), 100)
In this example, the top 5 peptides for each of the target proteins are selected based on RF detectability prediction scores. This demonstrates minimal usage of PeptideRanger
in which a default RF classifier and peptidome provided in the pacakge are utilized and the only user input is the list of target proteins.
prioritized_peptides <- prioritize_peptides(uniprot_list = target_proteins, max_n = 5, peptidome = PeptideRanger::SwissProt2018_peptidome, prediction_model = PeptideRanger::RFmodel_ProteomicsDB, priorities = "RF_score", priority_thresholds = 0)
The output dataframe will contain the top 5 peptides for each target protein. For example, below are the top 5 peptides for UniProt ID: Q9UJX0
print(prioritized_peptides[prioritized_peptides$uniprot == "Q9UJX0",], row.names = FALSE)
In this example, a novel proteome of interest is available and it is desired to generate a list of peptides to synthesize for application in a targeted quantification strategy that will be applied to the target protein list. The novel proteome could be an updated version of the human proteome, a proteome that includes non-canonical protein isoforms, or the proteome of a different organism.
First, a peptidome must be generated from this new proteome. Given that the peptides are intended for synthesis, synth_peps
is set to TRUE
, meaning that peptides that are challenging to synthesize are filtered out of the peptidome. aa_range
sets the size range of peptides retained in the peptidome while missed_cleavages
sets the range of missed cleavage events peptides can result from.
novel_peptidome <- create_peptidome(proteome_dir = "data/uniprot_homo_20180803_swissprot.fasta", missed_cleavages = c(0,2), synth_peps = TRUE, aa_range = c(6,25))
novel_peptidome <- create_peptidome(proteome_dir = "directory/to/novel_proteome.fasta", missed_cleavages = c(0,2), synth_peps = TRUE, aa_range = c(6,25))
The resulting dataframe contains a list of all unique tryptic peptides for this proteome, the uniprot ID and gene symbol of their parent protein, their start and end locations in their parent protein, their size, and the number of proteolytic missed cleavages they resulted from. For example, below are the first 10 peptides by start location for Uniprot ID: Q3T906
novel_peptidome <- novel_peptidome[novel_peptidome$uniprot == "Q3T906",] print(novel_peptidome[1:10,], row.names = FALSE)
This peptidome can then be used directly as input to the peptidome
parameter of the prioritize_peptides()
function. In this example there is interest in selecting up to 10 peptides for each protein, and so max_n = 10
.
prioritized_peptides <- prioritize_peptides(uniprot_list = target_proteins, max_n = 10, peptidome = novel_peptidome, prediction_model = PeptideRanger::RFmodel_ProteomicsDB, priorities = "RF_score", priority_thresholds = 0)
Below are the top 10 peptides for UniProt ID: Q3T906 by RF detectability score.
print(prioritized_peptides[prioritized_peptides$uniprot == "Q3T906",], row.names = FALSE)
In this example, a new database of peptide observations in proteomics experiments with relevant technical conditions is available. Rather than use the RF trained on ProteomicsDB, which covers a diversity of mass spectrometers, labelling techniques, and experiment types, a predictor trained on this new database may be more specific to the experimental conditions being used. In this case, CPTAC is a large database of cancer proteomics experiments that all utilize tandem mass tags and orbitrap MS. Additionally, there are other priorities we would like to consider ahead of the RF detectability predictions, which include the presence of a peptide in a library of available synthetic peptides that have been previously used, and if the peptide is ubiquitous in the database (observed in >= 80% of all CPTAC experiments)
The example library of synthetic peptides will include 200 peptides randomly selected from the target proteins.
synthetic_library <- dplyr::sample_n(PeptideRanger::SwissProt2018_peptidome_synth[PeptideRanger::SwissProt2018_peptidome_synth$uniprot %in% target_proteins,], 200)
The peptides_inReference()
function is then used to merge the peptide experiment counts in CPTAC with the peptides in the peptidome. If a peptide in the peptidome is not observed in CPTAC, it is assigned an observation value of 0.
pep_reference
dataframe must be labelled "sequence"exp_counts_col
must be set to the name of the column containing the number of observations or experiment counts in the database.detection_freq = TRUE
: the frequency of appearance of each peptide will be calculated (n of experiments observed/total n database experiments).detection_ratio = TRUE
: the detectability ratio of each peptide will be calculated (n of experiments observed/n of experiments parent protein observed).reference_name
will be used as a prefix label for all columns added to the peptidome by this function. CPTAC_peptidome <- peptides_inReference(peptidome = PeptideRanger::SwissProt2018_peptidome_synth, pep_reference = PeptideRanger::CPTAC_exp_counts, reference_name = "CPTAC", exp_counts_col = "n_obs_pep", detection_freq = TRUE, detection_ratio = TRUE)
Next, all peptides in the peptidome that are also found in the library of synthetic peptides will be flagged. Given that no exp_counts_col
is specified, peptides will be labelled TRUE
in the synthetic_library
column if they are present in it, and FALSE
if they are not.
CPTAC_peptidome <- peptides_inReference(peptidome = CPTAC_peptidome, pep_reference = synthetic_library, reference_name = "synth_library")
Below are the columns added to the peptidome, with 10 peptides from UniProt ID: O96019 used as an example. CPTAC_ratio
will be used for training the new RF classifier while CPTAC_freq
and synth_library
will be used as priorities during selection of the top peptides.
output_peptidome <- CPTAC_peptidome[CPTAC_peptidome$uniprot == "O96019",] output_peptidome$symbol <- NULL output_peptidome$start <- NULL output_peptidome$end <- NULL output_peptidome$size <- NULL output_peptidome$missed_cleavages <- NULL output_peptidome$uniprot <- NULL output_peptidome$CPTAC_n_obs_prot <- NULL print(output_peptidome[1:10,], row.names = FALSE)
The train_RFmodel()
function is then used to generate the CPTAC RF.
CPTAC_RFmodel <- train_RFmodel(peptidome = CPTAC_peptidome, reference_name = "CPTAC")
Finally, the peptides are selected in order of the priorities
parameter.
For each protein:
TRUE
) in the synth_library
CPTAC_freq
for those with a freq >= 0.8RF_score
peptide_list <- prioritize_peptides(uniprot_list = target_proteins, max_n = 5, peptidome = CPTAC_peptidome, prediction_model = CPTAC_RFmodel, priorities = c("synth_library", "CPTAC_freq", "RF_score"), priority_thresholds = c(TRUE, 0.8, 0))
Below is an example output for the UniProt ID: O96019. The first peptide has synth_library = TRUE
, the next 2 peptides have CPTAC_freq >= 0.8
, and the final 2 peptides have the next highest RF_score
.
output <- peptide_list[peptide_list$uniprot == "O96019",] output$uniprot <- NULL output$missed_cleavages <- NULL output$CPTAC_n_obs_prot <- NULL output$CPTAC_n_obs_pep <- NULL output$size <- NULL output$start <- NULL output$end <- NULL output$symbol <- NULL print(output, row.names = FALSE)
All other peptidome columns such as size, start, end, and symbol are also retained in the dataframe of prioritized peptides, but are not shown here.
peptide_predictions()
The peptide_predictions()
function scores peptide detectability using the input prediction_model
for a list of peptide sequences. An example list of 10 peptide sequences (peps_ofInterest
) will be randomly selected from the SwissProt2018_peptidome
.
set.seed(7169) peps_ofInterest <- dplyr::sample_n(PeptideRanger::SwissProt2018_peptidome, 10)
peptide_predictions(peptides = "PEPTIDERANGER", prediction_model = PeptideRanger::RFmodel_ProteomicsDB)
Missed cleavages are used a feature for the RF predictor. If they are not available, the function will estimate them based on the sequence (this will not recognize protein N-term Methionine residue cleavages).
predictions <- peptide_predictions(peptides = peps_ofInterest$sequence, prediction_model = PeptideRanger::RFmodel_ProteomicsDB)
predictions
If missed cleavages are available, they can be input as a list to the missed_cleavages
parameter.
predictions <- peptide_predictions(peptides = peps_ofInterest$sequence, prediction_model = PeptideRanger::RFmodel_ProteomicsDB, missed_cleavages = peps_ofInterest$missed_cleavages)
predictions
create_peptidome()
The create_peptidome()
function takes the proteome fasta file directory directly as input and returns a dataframe of unique peptides resulting from an in-silico tryptic digestion.
fasta_directory <- "data/uniprot_homo_20180803_swissprot.fasta"
fasta_directory <- "directory/to/human_proteome.fasta"
A peptidome of unique peptides can be created by providing only a directory to the fasta file. In this case, default parameters will retain peptides resulting from 0-2 missed cleavages and in the range of 6-25 amino acids.
peptidome <- create_peptidome(proteome_dir = fasta_directory)
The resulting peptidome dataframe will appear with the below columns:
peptidome[1:10,]
If desired, the number of missed cleavages and the amino acid range of peptides resulting from in-silico digestion can be specified, which may better suit a database being compared to. With missed_cleavages = 0
only full digested peptides will be in the peptidome.
peptidome <- create_peptidome(proteome_dir = fasta_directory, missed_cleavages = 0, aa_range = c(7,25) )
If peptides are intended for synthesis, synth_peps
can be set to TRUE
, which filters out peptides with sequences that are challenging to synthesize. This currently includes peptides with N-term Q and E residues and C-term P residues.
peptidome <- create_peptidome(proteome_dir = fasta_directory, missed_cleavages = c(0,2), aa_range = c(6,20), synth_peps = TRUE)
peptides_inReference()
The peptides_inReference()
function adds information from a database or dataset to a peptidome. This is useful for flagging peptides in a list of interest, caclulating detectability ratios for training a RF, or calculating the frequency of peptide appearance in a database for use as a priority.
The peptides_inReference()
function can be used to flag peptides if they are present in a list. For example, PeptideRanger::tm_peptides
dataframe contains a list of peptides that overlap with transmembrane domains annotated in UniProt. The column of peptides in the pep_reference
dataframe must be labelled sequence
.
peptidome <- peptides_inReference(peptidome = PeptideRanger::SwissProt2018_peptidome, pep_reference = PeptideRanger::tm_peptides, reference_name = "transmembrane")
Peptides in the peptidome that are present in PeptideRanger::tm_peptides
are labelled TRUE
in the transmembrane
column.
output <- peptidome[peptidome$sequence %in% PeptideRanger::tm_peptides$sequence,] output$symbol <- NULL output$missed_cleavages <- NULL print(output[1:10,], row.names = FALSE)
The RF in PeptideRanger
is trained to predict peptide detectability ratios (n observations of peptide/n observations of parent protein). Peptide observations from a database can be merged with a peptidome, and if detection_ratio = TRUE
and the column of experiment counts (exp_counts_col
) in the pep_reference
is specified, detectability ratios and the number of parent protein observations will be calculated.
ProtDB_peptidome <- peptides_inReference(peptidome = PeptideRanger::SwissProt2018_peptidome, pep_reference = PeptideRanger::ProtDB_exp_counts, reference_name = "ProtDB", exp_counts_col = "n_obs_pep", detection_ratio = TRUE)
output <- ProtDB_peptidome[1:10,] output$missed_cleavages <- NULL output$symbol <- NULL output$start <- NULL output$end <- NULL output$size <- NULL print(output, row.names = FALSE)
If detection_freq = TRUE
and exp_counts_col
is specified, then the function will merge the number of peptide observations to peptides in the peptidome and calculate the fraction of total experiments in the database that each peptide was observed in.
CPTAC_peptidome <- peptides_inReference(peptidome = PeptideRanger::SwissProt2018_peptidome_synth, pep_reference = PeptideRanger::CPTAC_exp_counts, reference_name = "CPTAC", exp_counts_col = "n_obs_pep", detection_freq = TRUE)
output <- CPTAC_peptidome[1:10,] output$missed_cleavages <- NULL output$symbol <- NULL output$start <- NULL output$end <- NULL output$size <- NULL print(output, row.names = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.