In vladpetyuk/PlexedPiper: Pipeline for isobaric quantification.

knitr::opts_chunk$set(message=FALSE, warning=FALSE)
# knitr::opts_chunk$set(echo=T, message=F, warning=F, fig.align='center', out.width='10cm')

PlexedPiper is the main library for the pipeline. MSnID package is a flexible way of handling MS/MS identifications. It handles MS/MS data filtering part.

library(PlexedPiper)
library(MSnID)

Data source

This pipeline shows how to process TMT data that is processed outside of PNNL DMS. This implies that raw files processed by MSGF+ and MASIC tools and the processing results are in corresponding directories. For convienience the results of MSGF+ and MASIC processing are provided in a companion PlexedPiperTestData package.

Prepare MS/MS IDs

Read the MS-GF+ output

This simply reads parsed to text output of MS-GF+ search engine. The text files are collated together and the resulting data.frame used to create MSnID object.

path_to_MSGF_results <- system.file(
   "extdata/global/msgf_output",
   package = "PlexedPiperTestData")
msnid <- read_msgf_data(path_to_MSGF_results)
show(msnid)

This MSnID methods corrects for isotope selection error. Occasionally, instrument selects not the lightest or monoisotopic peak, but a peak with +1 or more C13. While MS-GF+ is still capable of correctly identifying those, the downstream calculations of mass measurement error need to be fixed. This method corrects mass measurement errors accounting for selection of other than monoisotopic peak.

msnid <- correct_peak_selection(msnid)

MS/MS ID filter and peptide level

Main MS/MS filtering step. The filter_msgf_data_peptide_level function wraps-up MSnID::optimize_filter method and uses PepQValue from MS-GF+ output and absolute deviation of mass measurement error of parent ions (in ppm).

show(msnid)
msnid <- filter_msgf_data_peptide_level(msnid, 0.01)
show(msnid)

Switching annotation from RefSeq to gene symbols

This RefSeq to gene symbols conversion step can be skipped if the downstream analysis relies on RefSeq IDs.

RefSeq is a very redundant database and contains lots of tentative splice isoforms and unconfirmed ORFs. As a result, a lot of peptides (~60%) are shared across multiple RefSeq accessions. This adds to uncertainity which exactly protein we are quantifying. One solution is to switch to less redundant annotations like gene symbol. Moreover gene symbols are more human-interpretable and quite commonly used in downstream analysis.

msnid <- remap_accessions_refseq_to_gene(msnid, 
                                         organism_name="Rattus norvegicus")

This step converts sequences names in the fasta file used for MS/MS searches from RefSeq to gene symbols. This is necessary if we are going to count the number of peptide identifications per protein or protein sequence coverage with accessions switched from RefSeq to gene symbols.

path_to_FASTA <- system.file(
   "extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz",
   package = "PlexedPiperTestData")
temp_dir <- tempdir()
file.copy(path_to_FASTA, temp_dir)
path_to_FASTA <- file.path(temp_dir, basename(path_to_FASTA))
path_to_FASTA_gene <- remap_accessions_refseq_to_gene_fasta(
   path_to_FASTA, organism_name="Rattus norvegicus")

MS/MS ID filter at protein level

A while ago proteomics field established hard-and-fast two peptide per protein rule. That is if a protein identified with one peptide it is not considered a confident identification. It works OK, but there is a room for improvement. For example this rule really penalizes short protens. Then there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in reversed sequence. Thus, here we propose to normalize number of peptides per protein length and use that as a filtering criterion.

msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA_gene)
show(msnid)
msnid <- filter_msgf_data_protein_level(msnid, 0.01)
show(msnid)

Inference of parsimonious protein set

The situation when sequence of a certain peptide matches multiple proteins adds complication to the downstream quantitative analysis as it is not clear which protein product this peptide is originating from. There are common ways for selecting a protein set. One is simply retain only uniquely matching peptides (unique_only=TRUE). Shared peptides are simply discarded in this case. Alternatively (in case of unique_only=FALSE) the shared peptides assigned to the proteins with larger number of uniquely mapping peptides. If there is a choice between multiple proteins with equal number of uniquely mapping peptides, the shared peptides is assigned to the first protein according to alphanumeric order.

msnid <- infer_parsimonious_accessions(msnid, unique_only=TRUE)
show(msnid)

Remove decoy accessions

msnid <- apply_filter(msnid, "!isDecoy")

Prepare reporter ion intensities

Read MASIC ouput

MASIC is a tool for extracting ion intensities. With proper parameter settings it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition it reporter a number of other helpful metrics. Of the most interest are interference score of the parent ion level and S/N at reporter ion level. Interference score reflects what proportion of ion population isolated for fragmentation is due to the targeted ion. The other words 1 - InterferenceScore is due to co-isolated species that have similar elution time and parent ion m/z. S/N at reporter level is the metric that computed by Thermo software.

path_to_MASIC_results <- system.file("extdata/global/masic_output", package = "PlexedPiperTestData")
masic_data <- read_masic_data(path_to_MASIC_results, interference_score=TRUE)
head(masic_data)

Filtering MASIC data

In this case (currently recommended filters) we require that at least 50% of ion population is due to the targeted ion and no filter at S/N level.

nrow(masic_data)
masic_data <- filter_masic_data(masic_data, 0.5, 0)
nrow(masic_data)

Creating cross-tab

Laying out study desing

To convert from PSMs and reporter ion intensities to meaningful quantitative data it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study desing captured by three tables - fractions, samples, references.

Fractions

This links dataset to original plex.

library(readr)
fractions <- read_tsv(system.file("extdata/study_design/fractions.txt", package = "PlexedPiperTestData"))
head(fractions)

Samples

This links plex ID, reporter channel with samples. There is a minor difference between ReporterAlias and MeasurementName. MeasurementName is the sample name that will be finally reported. ReporterAlias is an itermediate between ReporterName and MeasurementName and is used for defining what the reference is.

samples <- read_tsv(system.file("extdata/study_design/samples.txt", package = "PlexedPiperTestData"))
head(samples,10)

References

Reference can be a certain channel, average of multiple channels or 1. In general form it is an expression with ReporterAlias names as variables. It is evaluated for each PlexID/QuantBlock combination and applied to divide reporter ion intensities within corresponding PlexID/QuantBlock. QuantBlock can be though of as a way of defining sub-plex. In a typical TMT experiment QuantBlock is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1..5) with reference for each QuantBlock.

references <- read_tsv(system.file("extdata/study_design/references.txt", package = "PlexedPiperTestData"))
head(references)

Creating quantitative cross-tab

Final step when MS/MS IDs and reporter ions linked together, aggregated down to peptide or accession (i.e. protein) level. To retain protein IDs while aggregating to peptide level we suggest aggregation_level <- c("accession","peptide")

aggregation_level <- c("accession")
quant_cross_tab <- create_crosstab(msnid, 
                                   masic_data, 
                                   aggregation_level, 
                                   fractions, samples, references)
dim(quant_cross_tab)
head(quant_cross_tab)

unlink(".Rcache", recursive=TRUE)

vladpetyuk/PlexedPiper documentation built on June 24, 2021, 8:59 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vladpetyuk/PlexedPiper
Pipeline for isobaric quantification.

In vladpetyuk/PlexedPiper: Pipeline for isobaric quantification.

Data source

Prepare MS/MS IDs

Read the MS-GF+ output

MS/MS ID filter and peptide level

Switching annotation from RefSeq to gene symbols

MS/MS ID filter at protein level

Inference of parsimonious protein set

Remove decoy accessions

Prepare reporter ion intensities

Read MASIC ouput

Filtering MASIC data

Creating cross-tab

Laying out study desing

Fractions

Samples

References

Creating quantitative cross-tab

R Package Documentation

Browse R Packages

We want your feedback!

vladpetyuk/PlexedPiper Pipeline for isobaric quantification.

In vladpetyuk/PlexedPiper: Pipeline for isobaric quantification.

Data source

Prepare MS/MS IDs

Read the MS-GF+ output

MS/MS ID filter and peptide level

Switching annotation from RefSeq to gene symbols

MS/MS ID filter at protein level

Inference of parsimonious protein set

Remove decoy accessions

Prepare reporter ion intensities

Read MASIC ouput

Filtering MASIC data

Creating cross-tab

Laying out study desing

Fractions

Samples

References

Creating quantitative cross-tab

R Package Documentation

Browse R Packages

We want your feedback!

vladpetyuk/PlexedPiper
Pipeline for isobaric quantification.