knitr::opts_chunk$set(message=FALSE, warning=FALSE) # knitr::opts_chunk$set(echo=T, message=F, warning=F, fig.align='center', out.width='10cm')
PlexedPiper
is the main library for the pipeline. MSnID
package is a
flexible way of handling MS/MS identifications. It handles MS/MS data
filtering part.
library(PlexedPiper) library(MSnID)
This pipeline shows how to process TMT data that is processed outside of PNNL DMS.
This implies that raw files processed by MSGF+ and MASIC tools and the
processing results are in corresponding directories. For convienience the results
of MSGF+ and MASIC processing are provided in a companion PlexedPiperTestData
package.
This simply reads parsed to text output of MS-GF+ search engine. The text files
are collated together and the resulting data.frame
used to create MSnID object.
path_to_MSGF_results <- system.file( "extdata/global/msgf_output", package = "PlexedPiperTestData") msnid <- read_msgf_data(path_to_MSGF_results) show(msnid)
This MSnID
methods corrects for isotope selection error. Occasionally,
instrument selects not the lightest or monoisotopic peak, but a peak
with +1 or more C13. While MS-GF+ is still capable of correctly identifying
those, the downstream calculations of mass measurement error need to be fixed.
This method corrects mass measurement errors accounting for selection of
other than monoisotopic peak.
msnid <- correct_peak_selection(msnid)
Main MS/MS filtering step. The filter_msgf_data_peptide_level
function
wraps-up MSnID::optimize_filter
method and uses PepQValue
from MS-GF+ output
and absolute deviation of mass measurement error of parent ions (in ppm).
show(msnid) msnid <- filter_msgf_data_peptide_level(msnid, 0.01) show(msnid)
This RefSeq to gene symbols conversion step can be skipped if the downstream analysis relies on RefSeq IDs.
RefSeq is a very redundant database and contains lots of tentative splice isoforms and unconfirmed ORFs. As a result, a lot of peptides (~60%) are shared across multiple RefSeq accessions. This adds to uncertainity which exactly protein we are quantifying. One solution is to switch to less redundant annotations like gene symbol. Moreover gene symbols are more human-interpretable and quite commonly used in downstream analysis.
msnid <- remap_accessions_refseq_to_gene(msnid, organism_name="Rattus norvegicus")
This step converts sequences names in the fasta file used for MS/MS searches from RefSeq to gene symbols. This is necessary if we are going to count the number of peptide identifications per protein or protein sequence coverage with accessions switched from RefSeq to gene symbols.
path_to_FASTA <- system.file( "extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz", package = "PlexedPiperTestData") temp_dir <- tempdir() file.copy(path_to_FASTA, temp_dir) path_to_FASTA <- file.path(temp_dir, basename(path_to_FASTA)) path_to_FASTA_gene <- remap_accessions_refseq_to_gene_fasta( path_to_FASTA, organism_name="Rattus norvegicus")
A while ago proteomics field established hard-and-fast two peptide per protein rule. That is if a protein identified with one peptide it is not considered a confident identification. It works OK, but there is a room for improvement. For example this rule really penalizes short protens. Then there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in reversed sequence. Thus, here we propose to normalize number of peptides per protein length and use that as a filtering criterion.
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA_gene) show(msnid) msnid <- filter_msgf_data_protein_level(msnid, 0.01) show(msnid)
The situation when sequence of a certain peptide matches multiple proteins adds
complication to the downstream quantitative analysis as it is not clear which
protein product this peptide is originating from. There are common ways for
selecting a protein set. One is simply retain only uniquely matching peptides
(unique_only=TRUE
). Shared peptides are simply discarded in this case.
Alternatively (in case of unique_only=FALSE
) the shared peptides assigned to
the proteins with larger number of uniquely mapping peptides. If there is a
choice between multiple proteins with equal number of uniquely mapping peptides,
the shared peptides is assigned to the first protein according to alphanumeric
order.
msnid <- infer_parsimonious_accessions(msnid, unique_only=TRUE) show(msnid)
msnid <- apply_filter(msnid, "!isDecoy")
MASIC is a tool for extracting ion intensities. With proper parameter settings it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition it reporter a number of other helpful metrics. Of the most interest are interference score of the parent ion level and S/N at reporter ion level. Interference score reflects what proportion of ion population isolated for fragmentation is due to the targeted ion. The other words 1 - InterferenceScore is due to co-isolated species that have similar elution time and parent ion m/z. S/N at reporter level is the metric that computed by Thermo software.
path_to_MASIC_results <- system.file("extdata/global/masic_output", package = "PlexedPiperTestData") masic_data <- read_masic_data(path_to_MASIC_results, interference_score=TRUE) head(masic_data)
In this case (currently recommended filters) we require that at least 50% of ion population is due to the targeted ion and no filter at S/N level.
nrow(masic_data) masic_data <- filter_masic_data(masic_data, 0.5, 0) nrow(masic_data)
To convert from PSMs and reporter ion intensities to meaningful quantitative data it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study desing captured by three tables - fractions, samples, references.
This links dataset to original plex.
library(readr) fractions <- read_tsv(system.file("extdata/study_design/fractions.txt", package = "PlexedPiperTestData")) head(fractions)
This links plex ID, reporter channel with samples. There is a minor difference between ReporterAlias and MeasurementName. MeasurementName is the sample name that will be finally reported. ReporterAlias is an itermediate between ReporterName and MeasurementName and is used for defining what the reference is.
samples <- read_tsv(system.file("extdata/study_design/samples.txt", package = "PlexedPiperTestData")) head(samples,10)
Reference can be a certain channel, average of multiple channels or 1. In general form it is an expression with ReporterAlias names as variables. It is evaluated for each PlexID/QuantBlock combination and applied to divide reporter ion intensities within corresponding PlexID/QuantBlock. QuantBlock can be though of as a way of defining sub-plex. In a typical TMT experiment QuantBlock is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1..5) with reference for each QuantBlock.
references <- read_tsv(system.file("extdata/study_design/references.txt", package = "PlexedPiperTestData")) head(references)
Final step when MS/MS IDs and reporter ions linked together, aggregated down to
peptide or accession (i.e. protein) level. To retain protein IDs while
aggregating to peptide level we suggest
aggregation_level <- c("accession","peptide")
aggregation_level <- c("accession") quant_cross_tab <- create_crosstab(msnid, masic_data, aggregation_level, fractions, samples, references) dim(quant_cross_tab) head(quant_cross_tab)
unlink(".Rcache", recursive=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.