README.md

title: PeptideRanger author: date: output: html_document: keep_md: yes self_contained: no

Introduction

The PeptideRanger package contains a random forest (RF) classifier that predicts the detectability of peptides in mass spectrometry (MS). Detectability is predicted using sequence derived physiochemical features and independent of biological abundance. The package also contains tools to retrain the RF on datasets more relevant to experimental conditions, customize peptidomes, and select the top peptides for proteins of interest based on user provided priorities.

If you find this package helpful, please cite: https://pubs.acs.org/doi/10.1021/acs.jproteome.2c00538

Overview of Package Functions

Functions available in the PeptideRanger package:

Overview of Package Data

Data available in the PeptideRanger package:

Installation

The PeptideRanger package can be installed using the devtools package.

install.packages('devtools')

GitHub Installation

Direct installation from GitHub can be performed using:

require(devtools)
install_github("rr-2/PeptideRanger")

Local Installation

The package can be cloned from the repository and installed locally using:

require(devtools)
install("./PeptideRanger")

Workflow Examples

Load the PeptideRanger package.

library(PeptideRanger)

For these examples, a random sample of 100 protein UniProt IDs will be used to represent a list of target proteins of interest.

set.seed("3284")
target_proteins <- sample(unique(SwissProt2018_peptidome$uniprot), 100)

Ex.1: Prioritize Peptides by RF Predictions

In this example, the top 5 peptides for each of the target proteins are selected based on RF detectability prediction scores. This demonstrates minimal usage of PeptideRanger in which a default RF classifier and peptidome provided in the pacakge are utilized and the only user input is the list of target proteins.

prioritized_peptides <- prioritize_peptides(uniprot_list = target_proteins,
                                            max_n = 5,
                                            peptidome = SwissProt2018_peptidome,
                                            prediction_model = RFmodel_ProteomicsDB,
                                            priorities = "RF_score",
                                            priority_thresholds = 0)

The output dataframe will contain the top 5 peptides for each target protein. For example, below are the top 5 peptides for UniProt ID: Q9UJX0

##  symbol uniprot missed_cleavages       sequence start end size  RF_score
##  OSGIN1  Q9UJX0                0       MLYPEYHK   415 422    8 0.8095300
##  OSGIN1  Q9UJX0                0     PDTDFGGNMK   178 187   10 0.6843551
##  OSGIN1  Q9UJX0                0 NPIDVDPFTYQSTR   508 521   14 0.6659614
##  OSGIN1  Q9UJX0                0 AVDDPGLVFNQLPK   401 414   14 0.6381387
##  OSGIN1  Q9UJX0                0    EHAIPHVVLGR   197 207   11 0.6317748

Ex.2: Prioritize Peptides from a Novel Proteome

In this example, a novel proteome of interest is available and it is desired to generate a list of peptides to synthesize for application in a targeted quantification strategy that will be applied to the target protein list. The novel proteome could be an updated version of the human proteome, a proteome that includes non-canonical protein isoforms, or the proteome of a different organism.

First, a peptidome must be generated from this new proteome. Given that the peptides are intended for synthesis, synth_peps is set to TRUE, meaning that peptides that are challenging to synthesize are filtered out of the peptidome. aa_range sets the size range of peptides retained in the peptidome while missed_cleavages sets the range of missed cleavage events peptides can result from.

novel_peptidome <- create_peptidome(proteome_dir = "directory/to/novel_proteome.fasta",
                                    missed_cleavages = c(0,2),
                                    synth_peps = TRUE,
                                    aa_range = c(6,25))

The resulting dataframe contains a list of all unique tryptic peptides for this proteome, the uniprot ID and gene symbol of their parent protein, their start and end locations in their parent protein, their size, and the number of proteolytic missed cleavages they resulted from. For example, below are the first 10 peptides by start location for Uniprot ID: Q3T906

##  symbol uniprot missed_cleavages                sequence start end size
##  GNPTAB  Q3T906                0             DQYHVLFDSYR    47  57   11
##  GNPTAB  Q3T906                0                  DNIAGK    58  63    6
##  GNPTAB  Q3T906                0 LCLPMPIDVVYTWVNGTDLELLK    69  91   23
##  GNPTAB  Q3T906                0                 NTTEPTK   114 120    7
##  GNPTAB  Q3T906                0       VPMLVLDPALPANITLK   136 152   17
##  GNPTAB  Q3T906                0    DLPSLYPSFHSASDIFNVAK   153 172   20
##  GNPTAB  Q3T906                0         NPSTNVSVVVFDSTK   175 189   15
##  GNPTAB  Q3T906                0             DVEDAHSGLLK   190 200   11
##  GNPTAB  Q3T906                0                 GYLTTDK   210 216    7
##  GNPTAB  Q3T906                0                LPENLSSK   247 254    8

This peptidome can then be used directly as input to the peptidome parameter of the prioritize_peptides() function. In this example there is interest in selecting up to 10 peptides for each protein, and so max_n = 10.

prioritized_peptides <- prioritize_peptides(uniprot_list = target_proteins,
                                            max_n = 10,
                                            peptidome = novel_peptidome,
                                            prediction_model = RFmodel_ProteomicsDB,
                                            priorities = "RF_score",
                                            priority_thresholds = 0)

Below are the top 10 peptides for UniProt ID: Q3T906 by RF detectability score.

##  symbol uniprot missed_cleavages       sequence start end size  RF_score
##  GNPTAB  Q3T906                0      LTFPAVSVK   788 796    9 0.8141367
##  GNPTAB  Q3T906                0    IPLVNISLLPK   695 705   11 0.7574879
##  GNPTAB  Q3T906                0        DFQELNK   276 282    7 0.6949187
##  GNPTAB  Q3T906                0       DTFADSLR   929 936    8 0.6915563
##  GNPTAB  Q3T906                0     MQITVEVDTR   627 636   10 0.6850533
##  GNPTAB  Q3T906                0 PPSLIVPLESQMTK   835 848   14 0.6735383
##  GNPTAB  Q3T906                0     TQLAYFTDSK   912 921   10 0.6734935
##  GNPTAB  Q3T906                0   SILPNSLGVSER   773 784   12 0.6552670
##  GNPTAB  Q3T906                0    DQYHVLFDSYR    47  57   11 0.6547674
##  GNPTAB  Q3T906                0   FIYLNDDVMFGK   402 413   12 0.6410661

Ex.3: Prioritize Peptides using a Novel Database and Multiple Priorities

In this example, a new database of peptide observations in proteomics experiments with relevant technical conditions is available. Rather than use the RF trained on ProteomicsDB, which covers a diversity of mass spectrometers, labelling techniques, and experiment types, a predictor trained on this new database may be more specific to the experimental conditions being used. In this case, CPTAC is a large database of cancer proteomics experiments that all utilize tandem mass tags and orbitrap MS. Additionally, there are other priorities we would like to consider ahead of the RF detectability predictions, which include the presence of a peptide in a library of available synthetic peptides that have been previously used, and if the peptide is ubiquitous in the database (observed in >= 80% of all CPTAC experiments)

The example library of synthetic peptides will include 200 peptides randomly selected from the target proteins.

synthetic_library <- dplyr::sample_n(SwissProt2018_peptidome_synth[SwissProt2018_peptidome_synth$uniprot %in% target_proteins,], 200)

First, the peptides_inReference() function is used to merge peptide experiment counts in CPTAC with peptides in the peptidome. Second, it is used again to flag peptides from the synthetic library as TRUE in the peptidome.

peptides_inReference() Details: The column of peptides in the pep_reference dataframe must be labelled "sequence" exp_counts_col must be set to the name of the column containing the number of observations or experiment counts in the database. detection_freq = TRUE: the frequency of appearance of each peptide will be calculated (n of experiments observed/total n database experiments). detection_ratio = TRUE: the detectability ratio of each peptide will be calculated (n of experiments observed/n of experiments parent protein observed). * reference_name will be used as a prefix label for all columns added to the peptidome by this function.

Merge CPTAC Counts with Peptidome

If a peptide in the peptidome is not observed in CPTAC, it is assigned an observation value of 0.

CPTAC_peptidome <- peptides_inReference(peptidome = SwissProt2018_peptidome_synth,
                                        pep_reference = CPTAC_exp_counts,
                                        reference_name = "CPTAC",
                                        exp_counts_col = "n_obs_pep",
                                        detection_freq = TRUE,
                                        detection_ratio = TRUE)

Flag Peptides in Synthetic Library

Given that no exp_counts_col is specified, peptides will be labelled TRUE in the synthetic_library column if they are present in it, and FALSE if they are not.

CPTAC_peptidome <- peptides_inReference(peptidome = CPTAC_peptidome,
                                        pep_reference = synthetic_library,
                                        reference_name = "synth_library")

Below are the columns added to the peptidome, with 10 peptides from UniProt ID: O96019 used as an example. CPTAC_ratio will be used for training the new RF classifier while CPTAC_freq and synth_library will be used as priorities during selection of the top peptides.

##                 sequence CPTAC_n_obs_pep CPTAC_freq CPTAC_ratio synth_library
##  SGGVYGGDEVGALVFDIGSYTVR               0 0.00000000       0.000         FALSE
##            VDFPTAIGMVVER             474 0.83597884       0.948         FALSE
##            DDGSTLMEIDGDK             309 0.54497354       0.618         FALSE
##      NGMVEDWDSFQAILDHTYK             360 0.63492063       0.720         FALSE
##       SEASLHPVLMSEAPWNTR             367 0.64726631       0.734         FALSE
##       LTELMFEHYNIPAFFLCK             299 0.52733686       0.598         FALSE
##            SPLAGDFITMQCR             384 0.67724868       0.768         FALSE
##                   LPQVTR               0 0.00000000       0.000         FALSE
##   PGLYGSVIVAGGNTLIQSFTDR               6 0.01058201       0.012         FALSE
##               LIANNTTVER             500 0.88183422       1.000         FALSE

The train_RFmodel() function is then used to generate the CPTAC RF.

CPTAC_RFmodel <- train_RFmodel(peptidome = CPTAC_peptidome,
                               reference_name = "CPTAC")

Finally, the peptides are selected in order of the priorities parameter. For each protein:

  1. Peptides are selected if they are present (TRUE) in the synth_library
  2. Peptides are selected in order of CPTAC_freq for those with a freq >= 0.8
  3. Peptides are selected in order of RF_score
peptide_list <- prioritize_peptides(uniprot_list = target_proteins,
                                    max_n = 5,
                                    peptidome = CPTAC_peptidome,
                                    prediction_model = CPTAC_RFmodel,
                                    priorities = c("synth_library", "CPTAC_freq", "RF_score"),
                                    priority_thresholds = c(TRUE, 0.8, 0))

Below is an example output for the UniProt ID: O96019. The first peptide has synth_library = TRUE, the next 2 peptides have CPTAC_freq >= 0.8, and the final 2 peptides have the next highest RF_score.

##                  sequence CPTAC_freq CPTAC_ratio synth_library   RF_score
##  RFSSWIGGSILASLGTFQQMWISK  0.0000000       0.000          TRUE 0.02582461
##                LIANNTTVER  0.8818342       1.000         FALSE 0.86585046
##             VDFPTAIGMVVER  0.8359788       0.948         FALSE 0.89937822
##             SPLAGDFITMQCR  0.6772487       0.768         FALSE 0.91270409
##             DDGSTLMEIDGDK  0.5449735       0.618         FALSE 0.81960323

All other peptidome columns such as size, start, end, and symbol are also retained in the dataframe of prioritized peptides, but are not shown here.

Function Examples

peptide_predictions()

The peptide_predictions() function scores peptide detectability using the input prediction_model for a list of peptide sequences. An example list of 10 peptide sequences (peps_ofInterest) will be randomly selected from the SwissProt2018_peptidome.

set.seed(7169)
peps_ofInterest <- dplyr::sample_n(SwissProt2018_peptidome, 10)

Ex.1: Single Peptide Input
peptide_predictions(peptides = "PEPTIDERANGER",
                    prediction_model = RFmodel_ProteomicsDB)
##        sequence  RF_score
## 1 PEPTIDERANGER 0.4055608

Ex.2: Peptide List Input - Missed Cleavages Estimated

Missed cleavages are used a feature for the RF predictor. If they are not available, the function will estimate them based on the sequence (this will not recognize protein N-term Methionine residue cleavages).

predictions <- peptide_predictions(peptides = peps_ofInterest$sequence,
                                   prediction_model = RFmodel_ProteomicsDB)

##                  sequence    RF_score
## 1                 EQWNSLK 0.653723404
## 2    HRSNIHVVNEWIANDISSTK 0.063997345
## 3                ADVAQMVR 0.641083720
## 4    EPPAAPCPARCDVSRCPSPR 0.055959867
## 5  EDRGFTAVVGDFGLAEKIPVYR 0.003070638
## 6  GLELEPGAGLFVAQAGGADPDK 0.467457966
## 7             RYDTVCGFCRK 0.002640244
## 8            KLQLELNQEREK 0.057616995
## 9          MYLVVVTSLYENNK 0.555479494
## 10               QDNGGCAK 0.163569477

Ex.3: Peptide List - Missed Cleavages Provided

If missed cleavages are available, they can be input as a list to the missed_cleavages parameter.

predictions <- peptide_predictions(peptides = peps_ofInterest$sequence,
                                   prediction_model = RFmodel_ProteomicsDB,
                                   missed_cleavages = peps_ofInterest$missed_cleavages)
##                  sequence    RF_score
## 1                 EQWNSLK 0.653723404
## 2    HRSNIHVVNEWIANDISSTK 0.063997345
## 3                ADVAQMVR 0.641083720
## 4    EPPAAPCPARCDVSRCPSPR 0.055959867
## 5  EDRGFTAVVGDFGLAEKIPVYR 0.003070638
## 6  GLELEPGAGLFVAQAGGADPDK 0.467457966
## 7             RYDTVCGFCRK 0.002640244
## 8            KLQLELNQEREK 0.057616995
## 9          MYLVVVTSLYENNK 0.555479494
## 10               QDNGGCAK 0.163569477

create_peptidome()

The create_peptidome() function takes the proteome fasta file directory directly as input and returns a dataframe of unique peptides resulting from an in-silico tryptic digestion.

fasta_directory <- "directory/to/human_proteome.fasta"

Ex.1: Default Parameters

A peptidome of unique peptides can be created by providing only a directory to the fasta file. In this case, default parameters will retain peptides resulting from 0-2 missed cleavages and in the range of 6-25 amino acids.

peptidome <- create_peptidome(proteome_dir = fasta_directory)

The resulting peptidome dataframe will appear with the below columns:

##    symbol uniprot missed_cleavages             sequence start end size
## 1  NDUFB6  O95139                0             TGYTPDEK     2   9    8
## 2  NDUFB6  O95139                0              DQELSPR    25  31    7
## 3  NDUFB6  O95139                0             EPVLPPQK    32  39    8
## 4  NDUFB6  O95139                0               MGPMEK    40  45    6
## 5  NDUFB6  O95139                0              MVHGVYK    60  66    7
## 6  NDUFB6  O95139                0 SIFVFTHVLVPVWIIHYYMK    68  87   20
## 7  NDUFB6  O95139                0               YHVSEK    88  93    6
## 8  NDUFB6  O95139                0              PYGIVEK    94 100    7
## 9  NDUFB6  O95139                0   IFPGDTILETGEVIPPMK   104 121   18
## 10 NDUFB6  O95139                0              EFPDQHH   122 128    7

Ex.2: Specified Size and Missed Cleavage Ranges

If desired, the number of missed cleavages and the amino acid range of peptides resulting from in-silico digestion can be specified, which may better suit a database being compared to. With missed_cleavages = 0 only full digested peptides will be in the peptidome.

peptidome <- create_peptidome(proteome_dir = fasta_directory,
                              missed_cleavages = 0,
                              aa_range = c(7,25) )

Ex.3: Synthetic Peptides

If peptides are intended for synthesis, synth_peps can be set to TRUE, which filters out peptides with sequences that are challenging to synthesize. This currently includes peptides with N-term Q and E residues and C-term P residues.

peptidome <- create_peptidome(proteome_dir = fasta_directory,
                              missed_cleavages = c(0,2),
                              aa_range = c(6,20),
                              synth_peps = TRUE)

peptides_inReference()

The peptides_inReference() function adds information from a database or dataset to a peptidome. This is useful for flagging peptides in a list of interest, caclulating detectability ratios for training a RF, or calculating the frequency of peptide appearance in a database for use as a priority.

Ex.1: Flag Transmembrane Peptides

The peptides_inReference() function can be used to flag peptides if they are present in a list. For example, tm_peptides dataframe contains a list of peptides that overlap with transmembrane domains annotated in UniProt. The column of peptides in the pep_reference dataframe must be labelled sequence.

peptidome <- peptides_inReference(peptidome = SwissProt2018_peptidome,
                                  pep_reference = tm_peptides,
                                  reference_name = "transmembrane")

Peptides in the peptidome that are present in tm_peptides are labelled TRUE in the transmembrane column.

##  uniprot                 sequence start  end size transmembrane
##   U3KPV4   IFWRQILLTLGLLGLFLYGLPK    13   34   22          TRUE
##   Q9NPC4               VCTLFIIGFK    20   29   10          TRUE
##   Q9NPC4       FTFFVSIMIYWHVVGEPK    30   47   18          TRUE
##   Q9NPC4     FTFFVSIMIYWHVVGEPKEK    30   49   20          TRUE
##   Q9UNA3 KELQLSLSVTLLLVCGFLYQFTLK     3   26   24          TRUE
##   O95477      SMPLFMTLAWIYSVAVIIK   639  657   19          TRUE
##   O95477                   GIVYEK   658  663    6          TRUE
##   O95477             LSVALAFVGGSK  1041 1052   12          TRUE
##   O95477         VVILDEPTAGVDPYSR  1053 1068   16          TRUE
##   O95477                LNNINDILK  1796 1804    9          TRUE

Ex.2: Prepare Peptidome for RF Training

The RF in PeptideRanger is trained to predict peptide detectability ratios (n observations of peptide/n observations of parent protein). Peptide observations from a database can be merged with a peptidome, and if detection_ratio = TRUE and the column of experiment counts (exp_counts_col) in the pep_reference is specified, detectability ratios and the number of parent protein observations will be calculated.

ProtDB_peptidome <- peptides_inReference(peptidome = SwissProt2018_peptidome,
                                         pep_reference = ProtDB_exp_counts,
                                         reference_name = "ProtDB",
                                         exp_counts_col = "n_obs_pep",
                                         detection_ratio = TRUE)
##  uniprot               sequence ProtDB_n_obs_pep ProtDB_n_obs_prot ProtDB_ratio
##   P04217 PLANVTLTCQAHLETPDFQLFK                0               349   0.00000000
##   P04217       NGVAQEPVHLDSPAIK               56               349   0.16045845
##   P04217           HQFLLTGDTQGR              165               349   0.47277937
##   P04217           SGLSTGWTQLSK              107               349   0.30659026
##   P04217               LLELTGPK              319               349   0.91404011
##   P04217   SLPAPWLSMAPVSWITPGLK               51               349   0.14613181
##   P04217                 TTAVCR               20               349   0.05730659
##   P04217                GVTFLLR              349               349   1.00000000
##   P04217      VTLTCVAPLSGVDFQLR               91               349   0.26074499
##   P04217                 ELLVPR               81               349   0.23209169

Ex.3: Calculate Frequency of Appearance in a Database

If detection_freq = TRUE and exp_counts_col is specified, then the function will merge the number of peptide observations to peptides in the peptidome and calculate the fraction of total experiments in the database that each peptide was observed in.

CPTAC_peptidome <- peptides_inReference(peptidome = SwissProt2018_peptidome_synth,
                                        pep_reference = CPTAC_exp_counts,
                                        reference_name = "CPTAC",
                                        exp_counts_col = "n_obs_pep",
                                        detection_freq = TRUE)
##  uniprot             sequence CPTAC_n_obs_pep CPTAC_freq
##   O95139             TGYTPDEK             334  0.5890653
##   O95139              DQELSPR             255  0.4497354
##   O95139               MGPMEK               0  0.0000000
##   O95139              MVHGVYK             231  0.4074074
##   O95139 SIFVFTHVLVPVWIIHYYMK               0  0.0000000
##   O95139               YHVSEK               0  0.0000000
##   O95139              PYGIVEK              81  0.1428571
##   O95139   IFPGDTILETGEVIPPMK             468  0.8253968
##   O75438             VNLLQIVR             342  0.6031746
##   O75438 DHWVHVLVPMGFVIGCYLDR              98  0.1728395


rr-2/PeptideRanger documentation built on May 27, 2023, 4:43 p.m.