Introduction

In this vignette we will demonstrate how to use the 'search all, asses subset' strategy (all-sub) to correctly calculated the FDR on a PSM subset of interest in a shotgun proteomic experiment.

In short, the basic all-sub workflow comprises the following steps:

  1. Search the experimental MS2 spectra against all proteins potentially present in the sample.
  2. Remove all PSMs that match a protein you are not interesed in.
  3. Calculate the FDR on the resulting subset of PSMs.

A preprint of the manuscript on the search-all-assess FDR procedure can be found on bioRxiv (https://doi.org/10.1101/094581).

Mass spectrometrists should search for all peptides, but assess only the ones they care about
Adriaan Sticker, Lieven Clement, Lennart Martens [-@Sticker2017]

False discovery rate (FDR) estimation

Let $x$ be the PSM score and assume that larger score values indicate a better match to the theoretical spectrum. The FDR is known to have a Bayesian interpretation upon defining a mixture model for the distribution of $x$ [@Efron2008a]. \begin{equation} f(x) = \pi_0 f_0(x) + (1-\pi_0) f_1(x), \end{equation} with $f(x)$ the target distribution of the test statistic, $f_0(x)$ the mixture component corresponding to incorrect PSMs, $f_1(x)$ the mixture component corresponding to the correct PSMs and $\pi_0$ the fraction of incorrect PSMs.

Based on the mixture distribution the FDR can be defined as the posterior probability \begin{equation} \text{FDR}(x) = Pr[\text{PSM is incorrect}\vert X\geq x]. \end{equation} Let $F(x)$, $F_0(x)$ and $F_1(x)$ indicate the complementary cumulative distribution function (CCDF) \begin{equation} F_{.}(x)=\int\limits_{X=x}^{+\infty} f_{.}(X) dx \end{equation} so \begin{equation} F(x)= \pi_0 F_0(x) + (1-\pi_0) F_1(x) \end{equation} and the FDR becomes \begin{equation} \text{FDR}(x)=\frac{\pi_0 F_0(x) }{F(x)} \end{equation}

The empirical Bayes/two-groups approach also has the great virtue that independence of the scores is not necessary as long as (1) the estimated CCDF of the marginal distribution is roughly unbiased and (2) the null component is estimated empirically [@Efron2008a].

Target-decoy approach

In a competitive target decoy search, FDRs are estimated by dividing the number of accepted decoys PSMs by the number of accepted target PSMs at a certain MS-GF+ score cutoff [@Elias2007]. For FDR estimation in the subset of interest, use only target and decoy PSMs belonging to the subset in the calculation. We assure a monotonic increase of FDR estimates with decreasing score values by replacing the FDR at a particular score $x$ by the minimum TDA FDR in the interval $[-\infty,x]$.

Note, that in a competitive target decoy search the FDR with TDA is estimated by \begin{equation} \widehat{\text{FDR}}(x) =\frac{# decoys | X\geq x}{# targets | X\geq x}, \end{equation} which can be rewritten as \small \begin{eqnarray} \widehat{\text{FDR}}(x) &=&\frac{\frac{#decoys}{#targets}\frac{# decoys | X\geq x}{# decoys}}{\frac{# targets | X\geq x}{# targets}}\ &=&\frac{\hat \pi_0 \bar F_0(x)}{\bar F(x)} \end{eqnarray} with $\bar F_0(x)$ and $\bar F(x)$ the empirical complementary cumulative distribution functions (ECCDFs) estimated using decoy and target PSMs, respectively. So, in a competitive TDA the FDR implicitly relies on the estimator of the proportion of incorrect PSMs among the targets \begin{equation} \hat \pi_0=\frac{#decoys}{#targets}. \end{equation}

Stable target-decoy approach for subsets

In small subsets we typically can expect a low number of decoy PSMs and the $\bar F_0(x)$ will be a poor estimator of $F_0(x)$. For improving the stability of the target-decoy approach in small subsets we propose to estimate the mixture component for incorrect PSMs using a large set of decoys with similar properties as the incorrect subset PSMs. Hence, the method only relies on the assumption that the incorrect target PSM scores follow the same distribution as the PSM scores of the decoy set.

This large set of decoys PSM scores can be designed by the user. However often, the set of decoys in the complete search seems a good candidate set. This allows for the estimation of the mixture component of the incorrect PSMs and the marginal distribution of all PSMs non-parametrically, i.e. based on the ECCDFs $\bar F_0(x)$ using a large decoy set (e.g. of the complete search) and $\bar F(x)$ derived from the targets in the subset, respectively.

The mixture proportion $\pi_0$ can still be estimated based on the ratio of the number of decoys on the number of targets in the subset upon a competitive target decoy search. However, the estimator $\hat \pi_0$ often becomes zero in very small subsets, inducing an FDR estimate of zero for the set of all subset target PSMs. We therefore propose to use the estimator \begin{equation} \hat \pi_0= \begin{cases} \frac{#\text{decoys}\text{subset}+1}{#\text{targets}\text{subset}}, & \text{if}\ #\text{targets} > #\text{decoys}\ 1, & \text{if}\ #\text{targets} \leq #\text{decoys}\ \end{cases} , \end{equation} which has a lower bound of $\frac{1}{\text{targets}_\text{subset}}$ and an upper bound of 1. Note that this estimator has an upward bias in small subsets, which disappears for moderate to large subsets. In the extreme case that we observe only one target PSM, $\hat \pi_0$ is 1 and the FDR estimator reduces to a regular p-value.

Note, that a conservative estimator of the FDR is also available using the link between the Bayesian and the Benjamini-Hochberg (BH) FDR procedure, i.e. by setting the mixture proportion of the mixture component for incorrect PSMs to $\pi_0=1$ [@Efron2008a].

\begin{equation} \widehat{\text{FDR}}(x)=\frac{\bar F_0(x)}{\bar F(x)}. \end{equation} In very small subsets, the latter FDR estimator is still valid as long as a good estimator of the mixture component for incorrect PSMs is available. Benjamini and Hochberg, 1995, for instance, included an example with 15 comparisons in their seminal paper [@BenjaminiHochberg1995]. Also note that their FDR estimator also reduces to a regular p-value in the extreme case that the subset of interest consists of one target PSM, i.e. we set $\pi_0=1$ and $\bar F=1$ for this extreme edge case.

The ECCDF $\bar F_0(x)$ will be zero for every target PSM with a score higher then the maximum decoy PSM score, inducing an FDR estimate of zero for these target PSMs. Decreasing the total number of decoys will lower the maximum decoy PSM score and thus lower the threshold at which the estimated FDR of the target PSMs will be zero. This means when only few decoys PSMs are observed, FDR estimates have high sample-to-sample variability and are biased to lower values. \todo{bias zie je in simulatie bij classical FDR} FDR estimates are too liberal and underestimates the true FDR. Therefore we recommend to have at least 1000 decoys to obtain sufficiently reliable FDR estimates in most situations. Lastly, we assure a monotonic increase of FDR estimates with decreasing score values by replacing the FDR at a particular score $x$ by the minimum TDA FDR in the interval $[-\infty,x]$.

Example data analysis

The data

In this vignette we use data from a Pyrocococus furiosis sample run on a LTQ-Orbitrap Velos mass spectrometer. The data can be found in the PRIDE repository with indentifier PXD001077. The Pyrocococus furiosis reference proteome fasta file was downloaded from Uniprot on April 22, 2016. In this use case we are interested in Pyrococcus proteins related to transferase activity as defined by their Gene Ontology identifier (GO:0016740). A fasta file with these proteins was also downloaded from Uniprot.

Database search

In theory, the all-sub method can be used on the results from any search engine that report peptide spectrum matches (PSMs) with a score. However, this R package only provides a parser for the MS-GF+ MZident output files. Output from other search engines should be parsed and presented in the format of the output from the 'parse_msgf_mzid()' function (see 'help(saas::parse_msgf_mzid)'in R). It should be realtively straightforward to adapt the code from 'parse_msgf_mzid()' to work with other search engines. It's very important that searches are run on a concatenated target-decoy database.

The Pyrococcus dataset was searched against all Pyrococcus proteins with the MS-GF+ Search engine (v2016.10.26). The MS-GF+ parameter settings used are:

 -t 10ppm -ti 0,1 -tda 1 -m 3 -inst 1 -e 1 -protocol 0 -ntt 2 -minLength 6 -maxLength 30 
 -minCharge 2 -maxCharge 4 -n 1

The modification file provide to MS-GF+ is:

NumMods=2

57.021464,C,fix,any,Carbamidomethyl
15.994915,M,opt,any,Oxidation
H1O3P1,STY,opt,any,Phospho

For an explanation on how to use MS-GF+ and an explanation on all parameter settings, please read the MS-GF+ documentation at:

https://omics.pnl.gov/software/ms-gf

The MS-GF+ MZident output file from this search is included in this package for use in subsequent analysis steps.

Reading a MS-GF+ MZident output file.

## Load the saas library
library(saas)
## Load some convenience functions
library(dplyr, quietly = TRUE, warn.conflicts = FALSE) 

## Location of the zipped data files
zip_file_path = system.file("extdata", "extdata.zip", package = "saas")
## Unzip and get the (temporary) location of the mzid file with the MS-GF+ search results.
mzid_file_path = unzip(zip_file_path, 'pyrococcus.mzid',exdir = tempdir())

## Read and parse the mzid file
data = parse_msgf_mzid(mzid_file_path)

glimpse(data)

Preprocessing the MS-GF+ search results.

By default, the 'preprocess()' function removes all PSMs that assigned to both a decoy and target sequence. All rows in the dataframe that belong to the same PSMs (eg. one PSMs that match multiple proteins) are collapsed into one row.

When a path to a fasta file with the protein_ids from a subset of proteins in the fasta headers is provided, a new column is added that indicates if this PSM belongs to this subset of interest. These protein_ids should match the protein_ids from orginal fasta file used in the MS-GF+ search (and as indicated in the protein_id column in the data frame). This information is needed to use the all-sub method.

Optionally, it's also possible to remove PSM's that can be assigned to multiple proteins.

## Unzip and get the (temporary) location of the file with fasta headers containing 
## the protein_ids from the protein subset of interest.
fasta_file_path = unzip(zip_file_path, 'transferase_activity_[GO:0016740].fasta', exdir = tempdir())

## Preprocess the data.
data_prep = preprocess(data, is_subset = fasta_file_path)

glimpse(data_prep)

Evaluating the decoy PSM distribution.

The implementation of the all-sub in this packages adopt the target-decoy approach for FDR estimation. A subset FDR can be obtained by applying TDA on the subset target and decoy PSMs.

However, in very small subsets with little subset target and decoy PSMs, there can be a large sample-to-sample variability on the FDR estimates. We provide extensions to the basic subset TDA FDR to obtain more stable FDR estimates. Key is the use of a large set of decoy PSMs to reliable estimate the distribution of incorrect target PSM scores. By default we use all decoys from the complete search (subset and non-subset PSMs). The user can also define another set of extra decoys and use this for FDR calculation (see example 2).

Bot the classical TDA FDR and our more stable TDA FDR relie on the assumption that the distribution of incorrect target PSM scores can be approximated by the decoy PSM score distribution.

In this package we provide diagnostic plots to verify the assumption that the decoy PSM score distribution follows the incorrect target PSM distribution. Tthe 'score_higher' parameter in 'plot_diag()' indicates if higher score values mean a more confident PSM. We take the MS-GF+ SpecEValue as the PSM score for FDR calculation and smaller SpecEValue indicate a better match. Therefore we have to set the 'score_higher' parameter to 'FALSE'. Note that in the manuscript we used $- log ($MS-GF+ SpecEValue$)$ as the score for visualisation purposes. Please consult the documentation of your search engine of choice on how to interpret the PSM score.

diagnostics = plot_diag(data_prep, score_higher = FALSE)
diagnostics$all

Panel a shows the posterior distribution of pi_0 given the observed number of target and decoy PSMs in the subset. The vertical line indicates the conservative pi_0 estimate used in the calculations. At very high $\pi_0$ uncertainty (broad peak), you can also opt to use the Benjamine Hochberg procedure to minimize sample to sample variability which would mean that pi_0 would be set to 1 (see FDR_BH in the next section). However, this will come at the expense of too conservative PSM lists.

The distributional assumption for the decoys can be verified through a PP-plot where the empirical Cumulative Distribution Function (eCDF) of the decoys is plotted against the eCDF of the subset target PSMs. The PP-plots in panel b - d display the target subset PSMs plotted against all decoy PSMs from the complete search, the decoy subset PSMs plotted against all decoy PSMs from the complete search, and the target subset PSMs plotted against the decoy PSMs from the complete search, respectively. The full line in panel b and d indicates a line with a slope of $\pi_0$. The full line in panel c indicates the identity line. When the distributional assumption holds then the first part of the plot in b and d should be linear with a slope that equals $\pi_0$. The second part of the plot deviates from the line towards higher percentiles and will ultimately become vertical (decoy percentile = 1). If we see this profile in panel b, we have a good indication that the set of decoys from the complete search is representative for the mixture component for incorrect PSMs of the target mixture distribution. Deviations from this pattern might be subtle, therefore we provide the PP plots in c and d to support the conclusion drawn from panel b. When there is high uncertainty on pi_0 as indicated by a, then the linear pattern in the data points might deviate from the drawn solid line, but should still be more or less linear. The PP-plot in panel c shows the subset decoy PSMs plotted against all decoy PSMs. The whole plot should follow the identity line, indicating that the complete set of decoys is a good representation of the subset decoys. To verify that the subset decoys are representative for the mixture component for incorrect PSMs of the target mixture distribution, we look at the PP-plot of the subset decoys against the subset targets in panel d. The profile should look as described for panel b. If the profile matches in panel d but does not for panel b, then we suggest to not use the extra decoy set and use only the subset decoys for FDR estimation. When the profile does not match in panel d, the subset decoys might not be representative for incorrect PSMs. This can indicate that $\pi_0$ is estimated incorrectly, since this is based on the subset PSM scores. In this case, the first part of the plot in panel d can deviate from the (incorrect) $\pi_0$ slope line. But if this first part is linear, it still indicates that the extra set of decoys is representative for the mixture component of incorrect target PSMs. Since $\pi_0$ is unknow in this case we set $\pi_0$ to 1 (see FDR_BH in the next section).

We can conclude from the above diagnostic plots that the large decoy set is an appropiate candiate set to estimate the incorrect subset target PSM distribution.

Estimating FDR on the subset of interest.

The output from 'preprocess()' can be immediatly used to estimate the FDR in the PSM subset of interest.

data_results = calculate_fdr(data_prep, score_higher = FALSE)
glimpse(data_results)

We can choose in the output from three different all-sub FDR estimations: The estimated stable FDR at this score cutoff for subset PSMs.

  1. FDR: Calculated according the classical TDA method. Does not work well small subsets.
  2. FDR_stable: Our improved FDR method which is more stable in small subsets. For large subsets, FDR_stable estimates will be close to FDR estimates.
  3. FDR_BH: Bejamini Hochberg FDR procedure. Use this when you have a large decoy set but no decoy information on the subset PSMs (e.g. when the search engine does not return decoy protein ids). This FDR estimate is more conservative then FDR and FDR_stable.

The 'transferase activity' subset is rather small with 104 target and 37 decoy PSMs.

count(data_results, subset, decoy)

From the diagnostic plots we concluded that the set of decoy PSMs from the complete search is a good candidate set to approximate the incorrect target PSM distribution.

So, in this case it's recommended to use the more stable FDR estimates in FDR_stable.

results_1_FDR = filter(data_results, subset, !decoy, FDR_stable >= .01) 
glimpse(results_1_FDR)

At 1\% FDR, we return 31 PSMs.

Example 2: Provide an external set of decoys {#example2}

sessionInfo()

sessionInfo()


compomics/search-all-assess-subset documentation built on May 13, 2019, 9:55 p.m.