Introduction to ParseMSF
In parsemsf: Parse ThermoFisher MSF Files and Estimate Protein Abundances

references: - id: silva title: 'Absolute quantification of proteins by LCMSE: A Virtue of parallel MS acquisition' author: - family: Silva given: JC - family: Gorenstein given: MV - family: Li given: GZ - family: Vissers given: JP - family: Geromanos given: SJ container-title: Mol Cell Proteomics volume: 5 URL: 'https://doi.org/10.1074/mcp.M500230-MCP200' DOI: 10.1074/mcp.M500230-MCP200 issue: 1 page: 144-56 type: article-journal issued: year: 2006 month: 1

Loading peptide information from a ThermoFisher MSF

The ParseMSF package provides several functions for inspecting ThermoFisher MSF files. The most useful of these functions is make_area_table, which constructs a data frame containing all peptides and their corresponding peak areas. This data frame also includes protein information (protein_desc) for each peptide.

NOTE: Only ThermoFisher MSF files generated by Proteome Discoverer 1.4.x are supported. Using ParseMSF functions with a file produced by any other version of Proteome Discoverer may produce unexpected results.

library(parsemsf)

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
area_table <- make_area_table(parsemsf_example("test_db.msf"))
knitr::kable(head(area_table))

See the documentation for make_area_table for a description of each column.

Estimating protein abundances

The peak area information stored in one or more ThermoFisher MSF files can be used to estimate protein abundances. The combine_tech_reps function estimates these abundances across one or more technical replicates. Technical replicates are typically different mass spectrometry injections of the same biological sample. The combine_tech_reps function will produce more accurate protein abundance estimates if it is provided with multiple technical replicates.

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
abundances <- quantitate(c(parsemsf_example("test_db.msf"), 
                           parsemsf_example("test_db2.msf")))

knitr::kable(head(abundances))

Abundances are estimated by taking the top three most abundant peptides by area, and averaging them together (area_mean) [@silva]. If provided multiple technical replicates, quantitate will, by default, estimate protein abundances by matching peptides across technical replicates. That is, it will only average areas from peptides that are present in both technical replicates. The number unique peptides used to estimate the protein abundances are given by peps_per_rep.

Protein abundances can also be estimated from a single ThermoFisher MSF File.

# Replace `parsemsf_example("test_db.msf")` with the path to a ThermoFisher MSF file
abundances <- quantitate(parsemsf_example("test_db.msf"))

knitr::kable(head(abundances))

Inspecting distribution of peptides within a protein

The ParseMSF package includes a function for inspecting the distribution of peptides within a single protein. The map_peptides function produces a data frame of peptides with their respective locations within the protein sequence.

peptide_locs <- map_peptides(parsemsf_example("test_db.msf"))

# Select columns with start and end locations
peptide_locs <- peptide_locs[c("peptide_id", "protein_desc", 
                               "peptide_sequence", "start", "end")]

knitr::kable(head(peptide_locs))

We can plot these peptide locations with the ggplot2 and dplyr packages.

library(ggplot2)
library(dplyr)

peptide_summary <- peptide_locs %>% 
  group_by(start, end) %>%
  summarize(spectral_count = n()) # Count peptides

pep_plot <- ggplot(peptide_summary,
       aes(x = start, xend = end, y = spectral_count, yend = spectral_count)) +
  geom_segment(size = 1) +
  ylim(0, 5) + 
  xlab("peptide position within protein") +
  ylab("peptide count")

pep_plot