knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE )

Here we describe how you can estimate the false discovery rate (FDR) when matching your spectra against a FASTA sequence database using spectra search engines such as Mascot or Comet. The sequence database contains forward sequences and the same number of reverse sequences.

The data needs to have two columns, one with a label, to distinguish targets and decoys, the other with a score. Then you also need to know if a larger score or a smaller score is better.
In our example data, we have two scores: `score`

and `score2`

. While the `score`

is better if it smaller (e.g, posterior erro probability), `score2`

is better if it is larger.

x <- seq(0,10,by = 0.1) y1 <- dnorm(x,mean = 3, sd = 1) y2 <- dnorm(x,mean = 5, sd = 1) plot(x, y1, type = "l", xlab = 'score') lines(x, y2, col = 3) abline(v = 3.5, col = 2) text(c(2,4.2) , y = c(.2,.2), labels = c("TN","FP")) text(c(3,7) , y = c(.1,.1), labels = c("FN","TP"), col = 3) legend("topright", legend = c("Forward", "Decoys"), lty = c(1,1), col = c(3,1))

rm(list = ls()) library(dplyr) library(prozor) data(fdrSample) x <- dplyr::arrange(fdrSample, score2) knitr::kable(head(fdrSample))

What is also required is that the number of Targets and Decoys are the same. This is given for mass spectrometry database searches. Our dataset here is already truncated at a 5% FDR. Therefore the number of decoys is much smaller.

table(grepl("REV_",fdrSample$proteinID))

In our example, we will use the package to filter the data further for a 1% FDR.

Computing the FDR can be done by calling the function `computeFDR`

. `plotFDR`

than shows the score distribution for the targets (black) and decoys (red) as well as the FDR for the two scores implemented (x axis).

fdr1 <- computeFDR(fdrSample$score, grepl("REV_",fdrSample$proteinID),larger_better = FALSE) plotFDR(fdr1) legend("topright", legend = c("FPR", "qValue/FDR"), lty = c(1,1), col = c(4,3))

The output is a named list which can be easily converted into a data frame. We next will briefly discuss the elements of the output.

knitr::kable(head(data.frame(fdr1)))

We define

* false postivies (FP) as the number of passing decoy assignments * true positives (TP) as the number of passing forward hits.

Which is the "Fraction of incorrect assignments above score threshold". The multiplier 2 is needed here since we assume that also the forward sequences have false assignments.

$$ FPR = \frac{2 \cdot FP}{TP + FP} $$

This is taken from the reference by [@Elias2007]. [@Storey2003] defines FPR differently. Kaell points out that the FPR here actually should be named FDR and that "Many proteomics papers incorrectly refer to this quantity as the *false positive rate*."

"The FDR associated with a particular score threshold is defined as the expected
percentage of accepted PSMs that are incorrect, where an
*accepted PSM* is one that scores above the threshold (Many
proteomics papers incorrectly refer to this quantity as the *false positive rate*)." ([@Kaell2007])

The Simple FDR intrudced by ([@Kaell2007]) is defined by :

$$ SimpleFDR = \frac{FP}{TP} $$

"For a given score threshold, we count the number of decoy PSMs above the threshold and the number of target PSMs above the threshold. We can now estimate the FDR by simply computing the ratio of these two values (SimpleFDR)."[@Kaell2007]

plot(fdr1$score, fdr1$SimpleFDR, type = "l", xlim = c(0,0.002), ylim = c(0,0.0005)) lines(fdr1$score, fdr1$qValue_SimpleFDR, col = 3, type = "l", xlim = c(0,0.002), ylim = c(-0.002,0)) legend("topleft", legend = c("FDR", "qValue"), lty = c(1,1), col = c(1,3))

Although the score is getting better (smaller) the FDR may increase since the number of TP in the denominator decreases while the number of FP stays the same. Therefore Storey and Tibshirani proposed the *q-value*, "which in our case is defined as the minimal FDR threshold at which a given PSM is accepted"[@Kaell2007].

Most frequently you will need to get the score for an FDR in order to filter your data. To report your data with an FDR of 1% instead of 5% you can execute this code:

(score01G <- predictScoreFDR(fdr1,qValue = 5,method = "FPR")) dim(dplyr::filter(fdrSample, score < score01G)) (score01G <- predictScoreFDR(fdr1,qValue = 1,method = "FPR")) dim(dplyr::filter(fdrSample, score < score01G)) (score01K <- predictScoreFDR(fdr1,qValue = 1,method = "SimpleFDR")) dim(dplyr::filter(fdrSample, score < score01K))

Since the scores are sorted to compute the FDR, we return also the *order* column.
This column can be used to align the ID's with the scores.

knitr::kable(head(data.frame(ID = fdrSample$proteinID[fdr1$order], fdr1)))

For convenience we provide the function `computeFDRwithID`

which integrates the reordering of the ID's.

fdr1 <- computeFDRwithID(fdrSample$score,fdrSample$proteinID, decoy = "REV_",larger_better = FALSE) knitr::kable(head(data.frame(fdr1)))

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.