knitr::opts_chunk$set(message=FALSE, warning = FALSE) # knitr::opts_chunk$set(echo=T, message=F, warning=F, fig.align='center', out.width='10cm')
The vignette describes and reproduces all the steps that aimed to confirm frameshifts in the Euplotes crassus proteome. The global 8M urea soluble proteome was digested using conventional trypsin protocol and alternatively with Glu-C protease under high pH (7.5) conditions. The latter restricts specificity of Glu-C cleavages to C-terminal of glutamic acid (E). The peptides resulting from trypsin digest were fractionated using two different approaches: with strong cation exchange (SCX) and high pH reverse phase (HPRP) chromatographies. The peptides from Glu-C digest were fractionated using HPRP only.
The datasets were deposited to PRIDE and available by this link http://dx.doi.org/10.6019/PXD004333. Summary of the datasets shown in the table below:
library(knitr) dat_sum <- data.frame( `Dataset Prefix`=c('Euplotes_1_SCX','Euplotes_1_HPRP_1','Euplotes_1_HPRP_2'), `Digestion Enzyme`=c('trypsin','trypsin','Glu-C (pH 7.5)'), `Fractionation Chromatography Type`=c('SCX','HPRP','HPRP'), stringsAsFactors = FALSE, check.names = FALSE) kable(dat_sum)
Preprocessing of the raw
files prior MS/MS searches was done in two steps.
First, the raw files were processed with DeconMSn to correct for
wrong assignments of monoisotopic peaks. The parameters are as follows:
DeconMSN.exe -I35 -G1 -F1 -L6810 -B200 -T5000 -M3 -XCDTA
At the second step the peak files
were processed with DtaRefinery
to perform post-acquisition recalibaration of parent ion mass-to-charge ratios.
The peak lists (concatenated dta files in this case) were searched using
MS-GF+ tool against
6-frame translated Euplotes Crassus genome concatenated with
tentatively frameshifted sequences and common contaminants. The
6-frame translated FASTA file, DtaRefinery
and MS-GF+
parameter files
are available in extdata
folder of the EuplotesCrassus.proteome
package.
For example:
fpath <- system.file("extdata", "MSGFDB_GluC_StatCysAlk_10ppmParTol.txt", package="EuplotesCrassus.proteome") cat(readLines(fpath, n=12), sep = '\n')
To download the datasets we will take advantage of rpx R package. Note, this step may take awhile (10-30 min) depending on the speed of the internet connection. However, if they are downloaded the script will use the available datasets instead of downloading them again.
library(rpx) id <- "PXD004333" px <- PXDataset(id) repoFiles <- pxfiles(px) mzids <- grep('*msgfplus.mzid.gz', repoFiles, value=T) system.time(pxget(px, mzids))
The FASTA files containing 595 sequences with frameshifts availabe as a
part of this package and available as system.file("extdata", "Euplotes_Crassus_frameshifts.fasta", package="EuplotesCrassus.proteome")
.
There is an additional FASTA file with frameshift locations marked with
exclamation mark !
.
library(Biostrings) fasta_clean <- readAAStringSet( system.file("extdata", "Euplotes_Crassus_frameshifts.fasta", package="EuplotesCrassus.proteome"), format="fasta", nrec=-1L, skip=0L, use.names=TRUE) fasta_marks <- readAAStringSet( system.file("extdata", "Euplotes_Crassus_frameshifts_with_mark.fasta", package="EuplotesCrassus.proteome"), format="fasta", nrec=-1L, skip=0L, use.names=TRUE) length(fasta_clean)
For processing of MS/MS identification we will use MSnID R package. First step is to read the LC-MS/MS datasets corresponding to 25 SCX fractions.
library(MSnID) trypscx <- grep('Euplotes_1_SCX_.*msgfplus.mzid.gz', repoFiles, value=T) trypscxPrj <- MSnID() system.time(trypscxPrj <- read_mzIDs(trypscxPrj, trypscx, backend = 'mzR'))
Assess the peptide termini for their corresponding cleavage patterns. We will lleave peptides that resuted only from proper trypsin cleavave events. That is we won't allow peptide resulting from irregular clevages.
trypscxPrj <- assess_termini(trypscxPrj, validCleavagePattern="[KR]\\.[^P]") trypscxPrj <- apply_filter(trypscxPrj, "numIrregCleavages == 0")
Note, that for this project we are interested only in peptides covering the
sites of the frameshifting events. So if a peptide identification can
be explained by a regular protein sequence we are not interested in
pursuing this identification. The protein/accession names of normal
(non-frameshifted) sequences starts with Contig
or Contaminant
. If
the FASTA entry sequence is a results of the frameshift event if starts with
comp
. Therefore in the code below we retain only peptide-to-spectrum matches
that can appear only due to frameshifted sequences.
#' Rule on how to split the names. #' Contig + Contaminants - main piece #' comp - sequences with frameshifts trypscxPrj.main <- apply_filter(trypscxPrj, "!grepl('comp', accession)") trypscxPrj.fmsh <- apply_filter(trypscxPrj, "grepl('comp', accession)") #' if peptide matches to the main piece we don't care about it trypscxPrj.fmsh <- apply_filter(trypscxPrj.fmsh, "!(peptide %in% peptides(trypscxPrj.main))") show(trypscxPrj.fmsh)
Setting-up and optimizing filtering options for MS/MS identifications. Since the number of peptides mapping frameshifted sequences is rather low we will loosed up the FDR of the identification up to 5%, however, then follow-up with manual spectra validation.
trypscxPrj.fmsh$mme.ppm <- abs(mass_measurement_error(trypscxPrj.fmsh)) trypscxPrj.fmsh$score <- -log10(trypscxPrj.fmsh$`MS.GF.SpecEValue`) trypscxPrj.fmsh <- apply_filter(trypscxPrj.fmsh, "mme.ppm < 10") filtr <- MSnIDFilter(trypscxPrj.fmsh) filtr$mme.ppm <- list(comparison="<", threshold=5.0) filtr$score <- list(comparison=">", threshold=8.0) #' pre-optimization with brute-force approach filtr.grid <- optimize_filter(filtr, trypscxPrj.fmsh, fdr.max=0.05, method="Grid", level="peptide", n.iter=20000) evaluate_filter(trypscxPrj.fmsh, filtr.grid) #' fine tune with optimization using simulated annealing technique filtr.sann <- optimize_filter(filtr.grid, trypscxPrj.fmsh, fdr.max=0.05, method="SANN", level="peptide", n.iter=20000) evaluate_filter(trypscxPrj.fmsh, filtr.sann) trypscxPrj.fmsh <- apply_filter(trypscxPrj.fmsh, filtr.sann) show(trypscxPrj.fmsh)
Finally we will extract only those peptides that exactly span the frameshift
sites. That is their sequences should be present/identifiable in normal
FASTA file, however missing in the file with frameshifts masked with the
exclamation mark !
.
#' extract only those that map frameshift sites library(dplyr) pepSeq <- unique(trypscxPrj.fmsh$pepSeq) pepSeqMapped_to_clean <- pepSeq %>% sapply(grep, x=fasta_clean) %>% sapply(length) %>% subset(.>0) %>% names pepSeqMapped_to_with_marks <- pepSeq %>% sapply(grep, x=fasta_marks) %>% sapply(length) %>% subset(.>0) %>% names pepSeqFmsh_trypscx <- setdiff(pepSeqMapped_to_clean, pepSeqMapped_to_with_marks) print(pepSeqFmsh_trypscx)
Reporting extra information on the peptide sequences spanning frameshift sites: dataset, scan, charge, score, and mass measurement error.
meta_tryp_scx <- trypscxPrj.fmsh %>% apply_filter('pepSeq %in% pepSeqFmsh_trypscx') %>% psms %>% select(spectrumFile,MS.GF.SpecEValue,mme.ppm,spectrumID,chargeState,peptide) %>% rename(SpecEValue = MS.GF.SpecEValue, charge = chargeState, `MME (ppm)`=mme.ppm) %>% mutate(spectrumFile = sub('_msgfplus.mzid.gz','',spectrumFile)) library(xtable) print(xtable(meta_tryp_scx, display = c('d','s','e','f','s','d','s')), include.rownames=FALSE, comment = FALSE, size='scriptsize', floating = F)
\pagebreak
All the processing steps are conceptually the same as in the section above.
tryphprp <- grep('Euplotes_1_HPRP_1_.*msgfplus.mzid.gz', repoFiles, value=T) tryphprpPrj <- MSnID() system.time(tryphprpPrj <- read_mzIDs(tryphprpPrj, tryphprp, backend = 'mzR'))
tryphprpPrj <- assess_termini(tryphprpPrj, validCleavagePattern="[KR]\\.[^P]") tryphprpPrj <- apply_filter(tryphprpPrj, "numIrregCleavages == 0")
tryphprpPrj.main <- apply_filter(tryphprpPrj, "!grepl('comp', accession)") tryphprpPrj.fmsh <- apply_filter(tryphprpPrj, "grepl('comp', accession)") tryphprpPrj.fmsh <- apply_filter(tryphprpPrj.fmsh, "!(peptide %in% peptides(tryphprpPrj.main))") show(tryphprpPrj.fmsh)
tryphprpPrj.fmsh$mme.ppm <- abs(mass_measurement_error(tryphprpPrj.fmsh)) tryphprpPrj.fmsh$score <- -log10(tryphprpPrj.fmsh$`MS.GF.SpecEValue`) tryphprpPrj.fmsh <- apply_filter(tryphprpPrj.fmsh, "mme.ppm < 10") filtr <- MSnIDFilter(tryphprpPrj.fmsh) filtr$mme.ppm <- list(comparison="<", threshold=5.0) filtr$score <- list(comparison=">", threshold=8.0) filtr.grid <- optimize_filter(filtr, tryphprpPrj.fmsh, fdr.max=0.05, method="Grid", level="peptide", n.iter=20000) evaluate_filter(tryphprpPrj.fmsh, filtr.grid) filtr.sann <- optimize_filter(filtr.grid, tryphprpPrj.fmsh, fdr.max=0.05, method="SANN", level="peptide", n.iter=20000) evaluate_filter(tryphprpPrj.fmsh, filtr.sann) tryphprpPrj.fmsh <- apply_filter(tryphprpPrj.fmsh, filtr.sann) show(tryphprpPrj.fmsh)
library(dplyr) pepSeq <- unique(tryphprpPrj.fmsh$pepSeq) pepSeqMapped_to_clean <- pepSeq %>% sapply(grep, x=fasta_clean) %>% sapply(length) %>% subset(.>0) %>% names pepSeqMapped_to_with_marks <- pepSeq %>% sapply(grep, x=fasta_marks) %>% sapply(length) %>% subset(.>0) %>% names pepSeqFmsh_tryphprp <- setdiff(pepSeqMapped_to_clean, pepSeqMapped_to_with_marks) print(pepSeqFmsh_tryphprp)
meta_tryp_hprp <- tryphprpPrj.fmsh %>% apply_filter('pepSeq %in% pepSeqFmsh_tryphprp') %>% psms %>% select(spectrumFile,MS.GF.SpecEValue,mme.ppm,spectrumID,chargeState,peptide) %>% rename(SpecEValue = MS.GF.SpecEValue, charge = chargeState, `MME (ppm)`=mme.ppm) %>% mutate(spectrumFile = sub('_msgfplus.mzid.gz','',spectrumFile)) library(xtable) print(xtable(meta_tryp_hprp, display = c('d','s','e','f','s','d','s')), include.rownames=FALSE, comment = FALSE, size='scriptsize', floating = F)
\pagebreak
All the processing steps are conceptually the same as in the section above. The only substantial diffence is the specification of the enzyme digestion rule.
gluchprp <- grep('Euplotes_1_HPRP_2_.*msgfplus.mzid.gz', repoFiles, value=T) gluchprpPrj <- MSnID() system.time(gluchprpPrj <- read_mzIDs(gluchprpPrj, gluchprp, backend = 'mzR'))
gluchprpPrj <- assess_termini(gluchprpPrj, validCleavagePattern="E\\.[^P]") gluchprpPrj <- apply_filter(gluchprpPrj, "numIrregCleavages == 0")
gluchprpPrj.main <- apply_filter(gluchprpPrj, "!grepl('comp', accession)") gluchprpPrj.fmsh <- apply_filter(gluchprpPrj, "grepl('comp', accession)") gluchprpPrj.fmsh <- apply_filter(gluchprpPrj.fmsh, "!(peptide %in% peptides(gluchprpPrj.main))") show(gluchprpPrj.fmsh)
gluchprpPrj.fmsh$mme.ppm <- abs(mass_measurement_error(gluchprpPrj.fmsh)) gluchprpPrj.fmsh$score <- -log10(gluchprpPrj.fmsh$`MS.GF.SpecEValue`) gluchprpPrj.fmsh <- apply_filter(gluchprpPrj.fmsh, "mme.ppm < 10") filtr <- MSnIDFilter(gluchprpPrj.fmsh) filtr$mme.ppm <- list(comparison="<", threshold=5.0) filtr$score <- list(comparison=">", threshold=8.0) filtr.grid <- optimize_filter(filtr, gluchprpPrj.fmsh, fdr.max=0.05, method="Grid", level="peptide", n.iter=20000) evaluate_filter(gluchprpPrj.fmsh, filtr.grid) filtr.sann <- optimize_filter(filtr.grid, gluchprpPrj.fmsh, fdr.max=0.05, method="SANN", level="peptide", n.iter=20000) evaluate_filter(gluchprpPrj.fmsh, filtr.sann) gluchprpPrj.fmsh <- apply_filter(gluchprpPrj.fmsh, filtr.sann) show(gluchprpPrj.fmsh)
library(dplyr) pepSeq <- unique(gluchprpPrj.fmsh$pepSeq) pepSeqMapped_to_clean <- pepSeq %>% sapply(grep, x=fasta_clean) %>% sapply(length) %>% subset(.>0) %>% names pepSeqMapped_to_with_marks <- pepSeq %>% sapply(grep, x=fasta_marks) %>% sapply(length) %>% subset(.>0) %>% names pepSeqFmsh_gluchprp <- setdiff(pepSeqMapped_to_clean, pepSeqMapped_to_with_marks) print(pepSeqFmsh_gluchprp)
meta_gluc_hprp <- gluchprpPrj.fmsh %>% apply_filter('pepSeq %in% pepSeqFmsh_gluchprp') %>% psms %>% select(spectrumFile,MS.GF.SpecEValue,mme.ppm,spectrumID,chargeState,peptide) %>% rename(SpecEValue = MS.GF.SpecEValue, charge = chargeState, `MME (ppm)`=mme.ppm) %>% mutate(spectrumFile = sub('_msgfplus.mzid.gz','',spectrumFile)) library(xtable) print(xtable(meta_gluc_hprp, display = c('d','s','e','f','s','d','s')), include.rownames=FALSE, comment = FALSE, size='scriptsize', floating = F)
\pagebreak
Final set of peptides and corresponding references to LC-MS/MS datasets and
spectra. Overall, r length(pepSeqFmsh_trypscx)
,
r length(pepSeqFmsh_tryphprp)
, and r length(pepSeqFmsh_gluchprp)
unique peptide sequences spanning the frameshift sites were identified in
trypsin/SCX
, trypsin/HPRP
, and 'Glu-C/HPRP` experiments, respectively.
meta_tryp_scx$experiment <- "trypsin/SCX" meta_tryp_hprp$experiment <- "trypsin/HPRP" meta_gluc_hprp$experiment <- "Glu-C/HPRP" meta <- rbind(meta_tryp_scx, meta_tryp_hprp, meta_gluc_hprp) print(xtable(meta, display = c('d','s','e','f','s','d','s','s')), include.rownames=FALSE, comment = FALSE, size='tiny', floating = F)
\pagebreak
Manual valiation was perfomed by LCMSSpectator. The spectra that
have passed the consensus opinion of 5 independed experts are shown below.
Necessary raw and mzIdenML files to reproduce the analysis are
available at http://dx.doi.org/10.6019/PXD004333.
Note, the MS/MS scan number
is not the same identifier as spectrumID
in the
table above.
SAQEEQDDEVIIDDQNPLLEDDLQIDEPEQK
SVNRENLDNEKLINDLTNDKANLKDIVFDLMFE
\pagebreak
WTPIDLPSEEITFVQGIQTVTGAGDPSMK
\pagebreak
ESNHNNDITNKNEIAYILR
FFAAPEK
\pagebreak
IIQNFQINTVFEDLDEIMQTQVQR
LINDLTNDK
\pagebreak
LINDLTNDKANLK
IVENFNK
\pagebreak
LISELTSEK
\pagebreak
LSQEHLSYISR
NKIRFFAAPEKIFE
\pagebreak
VYLGLMEEYE
\pagebreak
All software and respective versions used in this document, as returned by sessionInfo() are detailed below.
toLatex(sessionInfo())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.