In protViz/prozor: Minimal Protein Set Explaining Peptide Spectrum Matches

knitr::opts_chunk$set(echo = TRUE, message = FALSE)

Introduction

This Vignette describes how the prozor::greedy function can be used to infer proteins form peptide identifications using the Occam's razor principle. The method tries to find a minimal set of porteins which can explain all the peptides identified.

library(prozor)

rm(list = ls())

file = system.file("extdata/IDResults.txt.gz" , package = "prozor")
specMeta <- readr::read_tsv(file)
head(specMeta) |> knitr::kable()

Annotate peptide sequences with protein sequences from two fasta files on with reviewed entries only (sp) and the other with reviewed and Trembl entries (sp/tr).

resAll <-
    prozor::readPeptideFasta(
        system.file("p1000_db1_example/Annotation_allSeq.fasta.gz" , package = "prozor"))

resCan <-
    prozor::readPeptideFasta(
        system.file("p1000_db1_example/Annotation_canSeq.fasta.gz" , package = "prozor"))

barplot(c(All = length(resAll), Canonical = length(resCan)))

length(unique(specMeta$peptideSeq))
upeptide <- unique(specMeta$peptideSeq)
debug( prozor::annotatePeptides)
annotAll <- prozor::annotatePeptides(upeptide, resAll)
head(subset(annotAll,select =  -proteinSequence)) |> knitr::kable()

annotCan <- prozor::annotatePeptides(upeptide, resCan)

barplot(c(All = nrow(annotAll), Canonical = nrow(annotCan)), ylab = "# peptide protein matches.")

We can see that using the larger fasta database reduces the proportion of proteotypic peptides. A proteotypic peptide is one which matches only a single protein.

PCProteotypic_all <-
    sum(table(annotAll$peptideSeq) == 1) / length(table(annotAll$peptideSeq)) * 100
PCProteotypic_canonical <-
    sum(table(annotCan$peptideSeq) == 1) / length(table(annotCan$peptideSeq)) * 100

barplot(
    c(All = PCProteotypic_all, Canonical =  PCProteotypic_canonical),
    las = 2,
    ylab = "% proteotypic"
)

Protein Inference

We can now identify a minimal set of proteins explaining all the peptides observed for both databases

library(Matrix)
precursors <-
    unique(subset(specMeta, select = c(
        peptideModSeq, precursorCharge, peptideSeq
    )))

Protein Inference for database with Trembl identifiers

library(Matrix)
annotatedPrecursors <- merge(precursors ,
                             subset(annotAll, select = c(peptideSeq, proteinID)),
                             by.x = "peptideSeq",
                             by.y = "peptideSeq")


xx <-
    prepareMatrix(annotatedPrecursors,
                  proteinID = "proteinID",
                  peptideID = "peptideSeq")

image(xx)
dim(xx)
xx[1:10, 1:100]
xxAll <- greedy_parsimony(xx)

Protein Inference for Reviewed/Canonical database

annotatedPrecursors <-
    merge(precursors ,
          subset(annotCan, select = c(peptideSeq, proteinID)),
          by.x = "peptideSeq",
          by.y = "peptideSeq")


xx <-
    prepareMatrix(annotatedPrecursors ,
                  proteinID = "proteinID",
                  peptideID = "peptideSeq")
image(xx)
xx[80:100,1:10]
xxCAN <- greedy_parsimony(xx)

Conclusion

We see that the number of proteins needed to explain all the peptides is practically identical for both databases. Also in practice using a database with more entries does not lead to more identified proteins. On the contrary, it might even reduce the number of porteins identified.

barplot(c(All_before = length(unique(annotAll$proteinID)), All_after = length(unique(unlist(
    xxAll
))) , Canonical_before =  length(unique(annotCan$proteinID)), Canonical_after = length(unique(unlist(
    xxCAN
)))))