map_peptides: Map peptides to their locations within a protein

Description Usage Arguments Value Examples

View source: R/map_peptides.R

Description

Takes a ThermoFisher MSF file and finds the location of each peptide within its corresponding protein sequence. In cases where a single peptide maps to multiple locations within a protein sequence, only the first location is reported. If a peptide maps ambiguously to multiple proteins, all locations are reported with data from each peptide-protein combination on a separate row.

Usage

1
map_peptides(msf_file, min_conf = "High", prot_regex = "")

Arguments

msf_file

A file path to a ThermoFisher MSF file.

min_conf

"High", "Medium", or "Low". The minimum peptide confidence level to retrieve from MSF file.

prot_regex

Regular expression where the first group matches a protein name or ID from the protein description. Regex must contain ONE group. The protein description is typically generated from a fasta reference file that was used for the database search.

Value

A dataframe containing start and stop positions (relative to the parent protein sequence) for each peptide in the database.

peptide_id

a unique peptide ID

spectrum_id

a unique spectrum ID

protein_id

unique protein group ID to which this peptide maps

protein_desc

protein description from reference database used to assign peptides to protein groups, parsed according to prot_regex

peptide_sequence

amino acid sequence (does not show post-translational modifications)

pep_score

PEP score

q_value

Q-value score

protein_sequence

parent protein sequence

start

start position of peptide within protein sequence

end

end position of peptide within protein sequence

Examples

1

Example output

Source: local data frame [28 x 10]
Groups: <by row>

# A tibble: 28 x 10
   peptide_id spectrum_id protein_id protein_desc peptide_sequence pep_score
        <int>       <int>      <int> <chr>        <chr>                <dbl>
 1      27146       15646     807657 NP_041997.1  AALTDQVALGK       0.000533
 2      27177       15663     807657 NP_041997.1  AALTDQVALGK       0.000515
 3      35484       20122     807657 NP_041997.1  ANFQADQIIAK       0.0116  
 4      35511       20136     807657 NP_041997.1  ANFQADQIIAK       0.000491
 5      37869       21360     807657 NP_041997.1  TQAAYLAPGENLDDK   0.000128
 6      37913       21384     807657 NP_041997.1  TQAAYLAPGENLDDK   0.000468
 7      38957       21935     807657 NP_041997.1  SAQFPVLGR         0.00419 
 8      40200       22580     807657 NP_041997.1  SAQFPVLGR         0.00115 
 9      50946       28239     807657 NP_041997.1  LALFLK            0.00199 
10      50972       28253     807657 NP_041997.1  LALFLK            0.00474 
# ... with 18 more rows, and 4 more variables: q_value <dbl>,
#   protein_sequence <chr>, start <int>, end <int>

parsemsf documentation built on May 2, 2019, 6:33 a.m.