normPSM: Standardization of PSM

normPSMR Documentation

Standardization of PSM

Description

normPSM standardizes PSM results from database search engines.

Usage

normPSM(
  dat_dir = NULL,
  expt_smry = "expt_smry.xlsx",
  frac_smry = "frac_smry.xlsx",
  fasta = NULL,
  entrez = NULL,
  group_psm_by = c("pep_seq", "pep_seq_mod"),
  group_pep_by = c("prot_acc", "gene"),
  pep_unique_by = c("group", "protein", "none"),
  mc_psm_by = c("peptide", "protein", "psm"),
  scale_rptr_int = FALSE,
  rptr_intco = 0,
  rptr_intrange = c(0, 100),
  rm_craps = FALSE,
  rm_krts = FALSE,
  rm_outliers = FALSE,
  rm_allna = FALSE,
  type_sd = c("log2_R", "N_log2_R", "Z_log2_R"),
  lfq_mbr = FALSE,
  mbr_ret_tol = 60,
  purge_phosphodata = TRUE,
  annot_kinases = FALSE,
  plot_rptr_int = TRUE,
  plot_log2FC_cv = TRUE,
  use_lowercase_aa = FALSE,
  corrected_int = TRUE,
  rm_reverses = TRUE,
  ...
)

Arguments

dat_dir

A character string to the working directory. The default is to match the value under the global environment.

expt_smry

A character string to a .xlsx file containing the metadata of TMT or LFQ experiments. The default is expt_smry.xlsx.

frac_smry

A character string to a .xlsx file containing peptide fractionation summary. The default is frac_smry.xlsx.

fasta

Character string(s) to the name(s) of fasta file(s) with prepended directory path. The fasta database(s) need to match those used in MS/MS ion search. There is no default and users need to provide the correct file path(s) and name(s).

entrez

Character string(s) to the name(s) of entrez file(s) with prepended directory path. At the NULL default, a convenience lookup is available for species among c("human", "mouse", "rat"). For other species, users need to provide the file path(s) and name(s) for the lookup table(s). See also Uni2Entrez and Ref2Entrez for preparing custom entrez files.

group_psm_by

A character string specifying the method in PSM grouping. At the pep_seq default, descriptive statistics will be calculated based on the same pep_seq groups. At the pep_seq_mod alternative, peptides with different variable modifications will be treated as different species and descriptive statistics will be calculated based on the same pep_seq_mod groups.

group_pep_by

A character string specifying the method in peptide grouping. At the prot_acc default, descriptive statistics will be calculated based on the same prot_acc groups. At the gene alternative, proteins with the same gene name but different accession numbers will be treated as one group.

pep_unique_by

A character string for annotating the uniqueness of peptides. At the group default, the uniqueness of peptides is by groups with the collapses of same-set or sub-set proteins. At a more stringent criterion of protein, the uniqueness of peptides is by protein entries without grouping. On the other extreme of choice none, all peptides are treated as unique. A new column of pep_isunique with corresponding logical TRUE or FALSE will be added to the PSM reports. Note that the choice of none is only for convenience, as the same may be achieved by setting use_unique_pep = FALSE in Pep2Prn.

mc_psm_by

A character string specifying the method in the median centering of PSM log2FC across samples. At the peptide default, the median description of PSMs (grouped by pep_seq or pep_seq_mod according to group_psm_by) will be first calculated and the offsets to zero (of logarithmic 2) will be used for the centering of PSMs across samples. At mc_psm_by = protein, the median description of PSMs (grouped by prot_acc or gene according to group_pep_by) will be calculated and the corresponding offsets to zero will be applied. At the mc_psm_by = psm, PSMs will be median centered without grouping.

scale_rptr_int

Logical; if TRUE, scales (up) MS2 reporter-ion intensities by MS1 precursor intensity: I_{MS1}*(I_{x}/\sum I_{MS2}). I_{MS1}, MS1 precursor intensity; I_{MS2}, MS2 reporter-ion intensity; I_{x}, MS2 reporter-ion intensity under TMT channel x. Note that the scaling will not affect log2FC.

rptr_intco

Numeric; the threshold of reporter-ion intensity (TMT: I126 etc.; LFQ: I000) being considered non-trivial. The default is 0 without cut-offs. The data nullification will not be applied synchronously to the precursor intensity (pep_tot_int) under the same PSM query. To guard against odds such as higher MS2 reporter-ion intensities than their contributing MS1 precursor intensity, employs for example filter_... = rlang::exprs(pep_tot_int >= my_ms1_cutoff) during PSM2Pep. The rule of thumb is that pep_tot_int is a single column; thus the corresponding data filtration against it may be readily achieved without introducing new arguments. By contrast, rptr_intco applies to a set of columns, I126 etc.; it might be slightly more involved/laborious when applying suitable statements of filter_ varargs.

rptr_intrange

Numeric vector at length two. The argument specifies the range of reporter-ion intensities (TMT: I126 etc.; LFQ: I000) being considered non-trivial. The default is between 0 and 100 percentile without cut-offs. While argument rptr_intco employs a universal cut-off across samples by absolute values, range_int provides an alternative means of sample-specific thresholding of intensities by percentiles. The data nullification will not be applied synchronously to the precursor intensity under the same PSM query.

rm_craps

Logical; if TRUE, cRAP proteins will be removed. The default is FALSE.

rm_krts

Logical; if TRUE, keratin entries will be removed. The default is FALSE.

rm_outliers

Logical; if TRUE, PSM outlier removals will be performed for peptides with more than two identifying PSMs. Dixon's method will be used when 2 < n \le 25 and Rosner's method will be used when n > 25. The default is FALSE.

rm_allna

Logical; if TRUE, removes data rows that are exclusively NA across ratio columns of log2_R126 etc. The setting also applies to log2_R000 in LFQ.

type_sd

Character string; the type of log2Ratios for SD calculations. The value is one log2_R, N_log2_R or Z_log2_R.

lfq_mbr

Logical; if TRUE, performs match-between-run (MBR) with LFQ data. Both psmQ.txt and psmC.txt are required with the feature.

mbr_ret_tol

Retention time tolerance (in seconds) for LFQ-MBR.

purge_phosphodata

Logical; if TRUE and phosphorylation present as variable modification(s), entries without phosphorylation will be removed. The default is TRUE.

annot_kinases

Logical; if TRUE, proteins of human or mouse origins will be annotated with their kinase attributes. The default is FALSE.

plot_rptr_int

Logical; if TRUE, the distributions of reporter-ion intensities will be plotted. The default is TRUE. The argument is also applicable to the precursor intensity with MaxQuant LFQ.

plot_log2FC_cv

Logical; if TRUE, the distributions of the CV of peptide log2FC will be plotted. The default is TRUE.

use_lowercase_aa

Logical; if TRUE, modifications in amino acid residues will be abbreviated with lower-case and/or ^_~. See the table below for details. The default is TRUE.

corrected_int

A logical argument for uses with MaxQuant TMT. At the TRUE default, values under columns "Reporter intensity corrected..." in MaxQuant PSM results (msms.txt) will be used. Otherwise, "Reporter intensity" values without corrections will be used.

rm_reverses

A logical argument for uses with MaxQuant TMT and LFQ. At the TRUE default, Reverse entries will be removed.

...

filter_: Variable argument statements for the filtration of data rows. Each statement contains to a list of logical expression(s). The lhs needs to start with filter_. The logical condition(s) at the rhs needs to be enclosed in exprs with round parenthesis. For example, pep_expect is a column key present in Mascot PSM exports and filter_psms_at = exprs(pep_expect <= 0.1) will remove PSM entries with pep_expect > 0.1.

Details

In each primary output file, "...PSM_N.txt", values under columns log2_R... are logarithmic ratios at base 2 in relative to the average intensity of reference(s) within each multiplex TMT set, or to the row-mean intensity within each plex if no reference(s) are present. Values under columns N_log2_R... are log2_R... with median-centering alignment. Values under columns I... are raw reporter-ion intensity from database searches. Values under columns N_I... are normalized reporter-ion intensity. Values under columns sd_log2_R... are the standard deviation of the log2FC of peptides from ascribing PSMs. Character strings under pep_seq_mod denote peptide sequences with applicable variable modifications.


Nomenclature of pep_seq_mod:

Variable modification Abbreviation
Non-terminal A letter from upper to lower case, e.g., mtFPEADILLK
N-term A hat to the left of a peptide sequence, e.g., ^QDGTHVVEAVDATHIGK
C-term A hat to the right of a peptide sequence, e.g., DAYYNLCLPQRPnMI^
Acetyl (Protein N-term) A underscore to the left of a peptide sequence, e.g., _mAsGVAVSDGVIK.
Amidated (Protein C-term) A underscore to the right of a peptide sequence, e.g., DAYYNLCLPQRPnMI_.
Other (Protein N-term) A tilde to the left of a peptide sequence, e.g., ~mAsGVAVSDGVIK
Other (Protein C-term) An tilde to the right of a peptide sequence, e.g. DAYYNLCLPQRPnMI~

Value

Outputs are interim and final PSM tables under the directory of PSM sub to dat_dir. Primary results are in standardized PSM tables of TMTset1_LCMSinj1_PSM_N.txt, TMTset2_LCMSinj1_PSM_N.txt, etc. The indexes of TMT experiment and LC/MS injection are indicated in the file names.

Mascot

Users will export PSM data from Mascot at a .csv format and store them under the file folder indicated by dat_dir. The header information should be included during the .csv export. The file name(s) should start with the letter 'F' and ended with a '.csv' extension (e.g., F004452.csv, F004453_this.csv etc.).

MaxQuant

Users will copy over msms.txt file(s) from MaxQuant to the dat_dir directory. The file name(s) should start with 'msms' and end with a '.txt' extension (e.g., msms.txt, msms_this.txt etc.).

MSFragger

Users will copy over psm.tsv file(s) from MSFragger to the dat_dir directory. The file name(s) should start with 'psm' and end with a '.tsv' extension (e.g., psm.tsv, psm_this.tsv etc.).

Spectrum Mill

Users will copy over PSMexport.1.ssv file(s) from Spectrum Mill to the dat_dir directory. The file name(s) should start with 'PSMexport' and end with a '.ssv' extension (e.g., PSMexport.ssv, PSMexport_this.ssv etc.).

Variable arguments and data files

Variable argument (vararg) statements of filter_ and arrange_ are available in proteoQ for flexible filtration and ordering of data rows, via functions at users' interface. To take advantage of the feature, users need to be aware of the column keys in input files. As indicated by their names, filter_ and filter2_ perform row filtration against column keys from a primary data file, df, and secondary data file(s), df2, respectively. The same correspondence is applicable for arrange_ and arrange2_ varargs.

Users will typically employ either primary or secondary vararg statements, but not both. In the more extreme case of gspaMap(...), it links prnGSPA findings in df2 to the significance pVals and abundance fold changes in df for volcano plot visualizations by gene sets. The table below summarizes the df and the df2 for varargs in proteoQ.

Utility Vararg_ df Vararg2_ df2
normPSM filter_ Mascot, F[...].csv; MaxQuant, msms[...].txt; SM, PSMexport[...].ssv NA NA
PSM2Pep NA NA NA NA
mergePep filter_ TMTset1_LCMSinj1_Peptide_N.txt NA NA
standPep slice_ Peptide.txt NA NA
Pep2Prn filter_ Peptide.txt NA NA
standPrn slice_ Protein.txt NA NA
pepHist filter_ Peptide.txt NA NA
prnHist filter_ Protein.txt NA NA
pepSig filter_ Peptide[_impNA].txt NA NA
prnSig filter_ Protein[_impNA].txt NA NA
pepMDS filter_ Peptide[_impNA][_pVal].txt NA NA
prnMDS filter_ Protein[_impNA][_pVal].txt NA NA
pepPCA filter_ Peptide[_impNA][_pVal].txt NA NA
prnPCA filter_ Protein[_impNA][_pVal].txt NA NA
pepLDA filter_ Peptide[_impNA][_pVal].txt NA NA
prnLDA filter_ Protein[_impNA][_pVal].txt NA NA
pepEucDist filter_ Peptide[_impNA][_pVal].txt NA NA
prnEucDist filter_ Protein[_impNA][_pVal].txt NA NA
pepCorr_logFC filter_ Peptide[_impNA][_pVal].txt NA NA
prnCorr_logFC filter_ Protein[_impNA][_pVal].txt NA NA
pepHM filter_, arrange_ Peptide[_impNA][_pVal].txt NA NA
prnHM filter_, arrange_ Protein[_impNA][_pVal].txt NA NA
anal_prnTrend filter_ Protein[_impNA][_pVal].txt NA NA
plot_prnTrend NA NA filter2_ [...]Protein_Trend_{NZ}[_impNA][...].txt
anal_pepNMF filter_ Peptide[_impNA][_pVal].txt NA NA
anal_prnNMF filter_ Protein[_impNA][_pVal].txt NA NA
plot_pepNMFCon NA NA filter2_ [...]Peptide_NMF[...]_consensus.txt
plot_prnNMFCon NA NA filter2_ [...]Protein_NMF[...]_consensus.txt
plot_pepNMFCoef NA NA filter2_ [...]Peptide_NMF[...]_coef.txt
plot_prnNMFCoef NA NA filter2_ [...]Protein_NMF[...]_coef.txt
plot_metaNMF filter_, arrange_ Protein[_impNA][_pVal].txt NA NA
prnGSPA filter_ Protein[_impNA]_pVals.txt NA NA
prnGSPAHM NA NA filter2_ [...]Protein_GSPA_{NZ}[_impNA]_essmap.txt
gspaMap filter_ Protein[_impNA]_pVal.txt filter2_ [...]Protein_GSPA_{NZ}[_impNA].txt
anal_prnString filter_ Protein[_impNA][_pVals].txt NA NA

See Also

Metadata
load_expts for metadata preparation and a reduced working example in data normalization

Data normalization
normPSM for extended examples in PSM data normalization
PSM2Pep for extended examples in PSM to peptide summarization
mergePep for extended examples in peptide data merging
standPep for extended examples in peptide data normalization
Pep2Prn for extended examples in peptide to protein summarization
standPrn for extended examples in protein data normalization.
purgePSM and purgePep for extended examples in data purging
pepHist and prnHist for extended examples in histogram visualization.
extract_raws and extract_psm_raws for extracting MS file names

Variable arguments of filter_...
contain_str, contain_chars_in, not_contain_str, not_contain_chars_in, start_with_str, end_with_str, start_with_chars_in and ends_with_chars_in for data subsetting by character strings

Missing values
pepImp and prnImp for missing value imputation

Informatics
pepSig and prnSig for significance tests
pepVol and prnVol for volcano plot visualization
prnGSPA for gene set enrichment analysis by protein significance pVals
gspaMap for mapping GSPA to volcano plot visualization
prnGSPAHM for heat map and network visualization of GSPA results
prnGSVA for gene set variance analysis
prnGSEA for data preparation for online GSEA.
pepMDS and prnMDS for MDS visualization
pepPCA and prnPCA for PCA visualization
pepLDA and prnLDA for LDA visualization
pepHM and prnHM for heat map visualization
pepCorr_logFC, prnCorr_logFC, pepCorr_logInt and prnCorr_logInt for correlation plots
anal_prnTrend and plot_prnTrend for trend analysis and visualization
anal_pepNMF, anal_prnNMF, plot_pepNMFCon, plot_prnNMFCon, plot_pepNMFCoef, plot_prnNMFCoef and plot_metaNMF for NMF analysis and visualization

Custom databases
Uni2Entrez for lookups between UniProt accessions and Entrez IDs
Ref2Entrez for lookups among RefSeq accessions, gene names and Entrez IDs
prepGO for gene ontology
prepMSig for molecular signatures
prepString and anal_prnString for STRING-DB

Column keys in PSM, peptide and protein outputs
system.file("extdata", "psm_keys.txt", package = "proteoQ")
system.file("extdata", "peptide_keys.txt", package = "proteoQ")
system.file("extdata", "protein_keys.txt", package = "proteoQ")

Examples


# ===================================
# PSM normalization
# ===================================

## !!!require the brief working example in `?load_expts`

## additional examples
# Mascot
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  
  # variable argument statement(s)
  filter_psms_at = exprs(pep_expect <= .1),
  filter_psms_more = exprs(pep_rank == 1, pep_exp_z > 1),
)

# MaxQuant
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  corrected_int = TRUE,
  rm_reverses = TRUE,
  
  # vararg statement(s)
  filter_psms_at = exprs(PEP <= 0.1),
)

# MSFragger
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),

  # vararg statement(s)
  filter_psms_at = exprs(Hyperscore >= 10),
)

# Spectrum Mill
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  
  # vararg statement(s)
  filter_psms_at = exprs(score >= 10),
)

###############################################
## Custom entrez lookups
#  (1) can overwrite the `proteoQ` default for 
#      species in "human", "mouse" and "rat"
#  (2) and are required for `other` species
###############################################
# see also `?Uni2Entrez` or `?Ref2Entrez` for more examples
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("org.Hs.eg.db")
BiocManager::install("org.Mm.eg.db")

library(org.Hs.eg.db)
library(org.Mm.eg.db)

library(proteoQ)
Ref2Entrez(species = human)
Ref2Entrez(species = mouse)

# see also Uni2Entrez(...) for Uniprot to Entrez lookups

normPSM(
  group_psm_by = pep_seq_mod, 
  group_pep_by = gene, 
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  entrez = c("~/proteoQ/dbs/entrez/refseq_entrez_hs.rds", 
             "~/proteoQ/dbs/entrez/refseq_entrez_mm.rds"),
)


## Not run: 
# wrong fasta 
normPSM(
  fasta = "~/proteoQ/dbs/fasta/wrong.fasta",
)

# no mouse entry annotation
normPSM(
  fasta = "~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
)

# bad vararg statement
normPSM(
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  filter_psms_at = exprs(column_key_not_in_psm_tables <= .1),
)
## End(Not run)


qzhang503/proteoQ documentation built on March 16, 2024, 5:27 a.m.