BiocStyle::markdown()

Package: r Biocpkg("PSMatch")
Authors: r packageDescription("PSMatch")[["Author"]]
Last modified: r file.info("PSM.Rmd")$mtime
Compiled: r date()

library("PSMatch")

Installation instructions

To install the package from Bioconductor, make sure you have the BiocManager package, available from CRAN, and then run

BiocManager::install("PSMatch")

Introduction

This vignette is one among several illustrating how to use the PSMatch package, focusing on the handling and processing of proteomics identification data using the PSM class. For a general overview of the package, see the PSMatch package manual page (?PSMatch) and references therein.

Handling and processing identification data

Loading PSM data

We are going to use an mzid file from the msdata package.

f <- msdata::ident(full.names = TRUE, pattern = "TMT")
basename(f)

The PSM() function parses one or multiple mzid files and returns an object of class PSM. This class is a simple extension of the DFrame class, representing the peptide-spectrum matches in a tabular format.

library("PSMatch")
id <- PSM(f)
id
n_matches <- nrow(id)
n_scans <- length(unique(id$spectrumID))
n_seqs <- length(unique(id$sequence))
n_cols <- ncol(id)

This table contains r n_matches matches for r n_scans scans and r n_seqs peptides sequences, each annotated by r n_cols variables.

nrow(id) ## number of matches
length(unique(id$spectrumID)) ## number of scans
length(unique(id$sequence))   ## number of peptide sequences
names(id)

The PSM data are read as is, without any filtering. As we can see below, we still have all the hits from the forward and reverse (decoy) databases.

table(id$isDecoy)

Keeping all matches

The data also contains multiple matches for several spectra. The table below shows the number of individual MS scans that have 1, 2, ... up to 5 matches.

table(table(id$spectrumID))

More specifically, we can see below how scan 1774 has 4 matches, all to sequence RTRYQAEVR, which itself matches to 4 different proteins:

i <- grep("scan=1774", id$spectrumID)
id[i, ]
id[i, "DatabaseAccess"]

If the goal is to keep all the matches, but arranged by scan/spectrum, one can reduce the DataFrame object by the spectrumID variable, so that each scan correponds to a single row that still stores all values[^rownames]:

[^rownames]: The rownames aren't needed here and are removed to reduce to output in the the next code chunks displaying parts of id2.

id2 <- reducePSMs(id, id$spectrumID)
rownames(id2) <- NULL ## rownames not needed here
dim(id2)

The resulting object contains a single entrie for scan 1774 with information for the multiple matches stored as a list within the table cell.

j <- grep("scan=1774", id2$spectrumID)
id2[j, ]
id2[j, "DatabaseAccess"]

The identification data could be used to annotate an raw mass spectrometry Spectra object (see the Spectra::joinSpectraData() function for details).

Filtering data

Often, the PSM data is filtered to only retain reliable matches. The MSnID package can be used to set thresholds to attain user-defined PSM, peptide or protein-level FDRs. Here, we will simply filter out wrong or the least reliable identifications.

Remove decoy hits

id <- filterPsmDecoy(id)
id

Keep first rank matches

id <- filterPsmRank(id)
id

Remove shared peptides

The data still contains shared peptides, i.e. those that different proteins. For example QKTRCATRAFKANKGRAR matches proteins ECA2869 and ECA4278.

id[id$sequence == "QKTRCATRAFKANKGRAR", "DatabaseAccess"]

We can filter these out to retain unique peptides.

id <- filterPsmShared(id)
id

This last filtering leaves us with r nrow(id) PSMs.

Note that the ConnectedComponents approach defined in this package allows one to explore protein groups defined by such shared peptides more thoroughly and make informed decision as to which shared peptides to retain and which ones to drop. For details see the related vignette or the ConnectedComponents manual page.

All filters in one function

This can also be achieved with the filterPSMs() function:

id <- PSM(f)
filterPSMs(id)

The mzR and mzID parsers

The PSM() function can take two different values for the parser parameter, namely "mzR" (the default value) and "mzID".

system.time(id1 <- PSM(f, parser = "mzR"))
system.time(id2 <- PSM(f, parser = "mzID"))

Other differences in the two parsers include the columns that are returned, the way they name them, and, as will shown below the matches that are returned. Note for instance (and this will be important later), that there is no equivalent of "modLocation" in id2.

names(id1)
names(id2)

We also have different number of matches in the two tables:

nrow(id1)
nrow(id2)
table(id1$isDecoy)
table(id2$isdecoy)

Let's first filter the PSM tables to facilitate focus the comparison of relevant scans. Note that the default filterPSMs() arguments are set to work with both parser.

id1_filtered <- filterPSMs(id1)

id2_filtered <- filterPSMs(id2)

As can be seen, we are also left with r nrow(id1_filtered) vs r nrow(id2_filtered) PSMs after filtering.

The difference doesn't stem from different scans, given that the spectum identifiers are identical in both tables:

identical(sort(unique(id1_filtered$spectrumID)),
          sort(unique(id2_filtered$spectrumid)))

The difference is obvious when we tally a table of spectrum id occurences in the filtered tables. In id2_filtered, each scan is unique, i.e matched only once.

anyDuplicated(id2_filtered$spectrumid)

However, for id1_filtered, we see that some scans are still repeat up to 4 times in the table:

table(table(id1_filtered$spectrumID))

The example below shows that these differences stem from the modification location ("modLocation"), that is not report by the mzID parser:

k <- names(which(table(id1_filtered$spectrumID) == 4))
id1_filtered[id1_filtered$spectrumID == k, "sequence"]
id1_filtered[id1_filtered$spectrumID == k, "modLocation"]
id1_filtered[id1_filtered$spectrumID == k, "modName"]

If we remove the "modLocation" column, we recoved the same number of PSMs than with the mzID parser.

id1_filtered$modLocation <- NULL
nrow(unique(id1_filtered))
nrow(unique(id2_filtered))

Session information

sessionInfo()


rformassspectrometry/PSM documentation built on March 20, 2024, 9:16 a.m.