normPSM: Standardization of PSM
In qzhang503/proteoQ: Processing and Informatic Analysis of Mass Spectrometrirc Data

normPSM

R Documentation

Standardization of PSM

Description

normPSM standardizes PSM results from database search engines.

Usage

normPSM(
  dat_dir = NULL,
  expt_smry = "expt_smry.xlsx",
  frac_smry = "frac_smry.xlsx",
  fasta = NULL,
  entrez = NULL,
  group_psm_by = c("pep_seq_mod", "pep_seq"),
  group_pep_by = c("gene", "prot_acc"),
  pep_unique_by = c("group", "protein", "none"),
  mc_psm_by = c("peptide", "protein", "psm"),
  scale_rptr_int = FALSE,
  rptr_intco = 0,
  rptr_intrange = c(0, 100),
  rm_craps = FALSE,
  rm_krts = FALSE,
  rm_outliers = FALSE,
  rm_allna = FALSE,
  type_sd = c("log2_R", "N_log2_R", "Z_log2_R"),
  lfq_mbr = TRUE,
  mbr_ret_tol = 30,
  purge_phosphodata = TRUE,
  annot_kinases = FALSE,
  plot_rptr_int = TRUE,
  plot_log2FC_cv = TRUE,
  use_lowercase_aa = FALSE,
  use_spec_counts = FALSE,
  use_corrected_mqint = TRUE,
  rm_reverses = TRUE,
  ...
)

Arguments

`dat_dir`	A character string to the working directory. The default is to match the value under the global environment.
`expt_smry`	A character string to a `.xlsx` file containing the metadata of TMT or LFQ experiments. The default is `expt_smry.xlsx`.
`frac_smry`	A character string to a `.xlsx` file containing peptide fractionation summary. The default is `frac_smry.xlsx`.
`fasta`	Character string(s) to the name(s) of fasta file(s) with prepended directory path. The `fasta` database(s) need to match those used in MS/MS ion search. There is no default and users need to provide the correct file path(s) and name(s).
`entrez`	Character string(s) to the name(s) of entrez file(s) with prepended directory path. At the `NULL` default, a convenience lookup is available for species among `c("human", "mouse", "rat")`. For other species, users need to provide the file path(s) and name(s) for the lookup table(s). See also `Uni2Entrez` and `Ref2Entrez` for preparing custom entrez files.
`group_psm_by`	A character string specifying the method in PSM grouping. At the `pep_seq` default, descriptive statistics will be calculated based on the same `pep_seq` groups. At the `pep_seq_mod` alternative, peptides with different variable modifications will be treated as different species and descriptive statistics will be calculated based on the same `pep_seq_mod` groups.
`group_pep_by`	A character string specifying the method in peptide grouping. At the `prot_acc` default, descriptive statistics will be calculated based on the same `prot_acc` groups. At the `gene` alternative, proteins with the same gene name but different accession numbers will be treated as one group.
`pep_unique_by`	A character string for annotating the uniqueness of peptides. At the `group` default, the uniqueness of peptides is by groups with the collapses of same-set or sub-set proteins. At a more stringent criterion of `protein`, the uniqueness of peptides is by protein entries without grouping. On the other extreme of choice `none`, all peptides are treated as unique. A new column of `pep_isunique` with corresponding logical TRUE or FALSE will be added to the PSM reports. Note that the choice of `none` is only for convenience, as the same may be achieved by setting `use_unique_pep = FALSE` in Pep2Prn.
`mc_psm_by`	A character string specifying the method in the median centering of PSM `log2FC` across samples. At the `peptide` default, the median description of PSMs (grouped by `pep_seq` or `pep_seq_mod` according to `group_psm_by`) will be first calculated and the offsets to zero (of logarithmic 2) will be used for the centering of PSMs across samples. At `mc_psm_by = protein`, the median description of PSMs (grouped by `prot_acc` or `gene` according to `group_pep_by`) will be calculated and the corresponding offsets to zero will be applied. At the `mc_psm_by = psm`, PSMs will be median centered without grouping.
`scale_rptr_int`	Logical; if TRUE, scales (up) MS2 reporter-ion intensities by MS1 precursor intensity: `I_{MS1}*(I_{x}/\sum I_{MS2})`. `I_{MS1}`, MS1 precursor intensity; `I_{MS2}`, MS2 reporter-ion intensity; `I_{x}`, MS2 reporter-ion intensity under TMT channel `x`. Note that the scaling will not affect `log2FC`.
`rptr_intco`	Numeric; the threshold of reporter-ion intensity (TMT: `I126` etc.; LFQ: `I000`) being considered non-trivial. The default is 0 without cut-offs. The data nullification will not be applied synchronously to the precursor intensity (`pep_tot_int`) under the same PSM query. To guard against odds such as higher MS2 reporter-ion intensities than their contributing MS1 precursor intensity, employs for example `filter_... = rlang::exprs(pep_tot_int >= my_ms1_cutoff)` during PSM2Pep. The rule of thumb is that `pep_tot_int` is a single column; thus the corresponding data filtration against it may be readily achieved without introducing new arguments. By contrast, `rptr_intco` applies to a set of columns, `I126` etc.; it might be slightly more involved/laborious when applying suitable statements of `filter_` varargs.
`rptr_intrange`	Numeric vector at length two. The argument specifies the range of reporter-ion intensities (TMT: `I126` etc.; LFQ: `I000`) being considered non-trivial. The default is between 0 and 100 percentile without cut-offs. While argument `rptr_intco` employs a universal cut-off across samples by absolute values, `range_int` provides an alternative means of sample-specific thresholding of intensities by percentiles. The data nullification will not be applied synchronously to the precursor intensity under the same PSM query.
`rm_craps`	Logical; if TRUE, cRAP proteins will be removed. The default is FALSE.
`rm_krts`	Logical; if TRUE, keratin entries will be removed. The default is FALSE.
`rm_outliers`	Logical; if TRUE, PSM outlier removals will be performed for peptides with more than two identifying PSMs. Dixon's method will be used when `2 < n \le 25` and Rosner's method will be used when `n > 25`. The default is FALSE.
`rm_allna`	Logical; if TRUE, removes data rows that are exclusively NA across ratio columns of `log2_R126` etc. The setting also applies to `log2_R000` in LFQ.
`type_sd`	Character string; the type of log2Ratios for SD calculations. The value is one `log2_R`, `N_log2_R` or `Z_log2_R`.
`lfq_mbr`	Logical; if TRUE, performs match-between-run (MBR) with Mzion LFQ data. Also requires `ms1full_[rawfile].rds` at the same file-folder level of `psmQ[...].txt`.
`mbr_ret_tol`	Retention time tolerance (in seconds) for LFQ-MBR.
`purge_phosphodata`	Logical; if TRUE and phosphorylation present as variable modification(s), entries without phosphorylation will be removed. The default is TRUE.
`annot_kinases`	Logical; if TRUE, proteins of human or mouse origins will be annotated with their kinase attributes. The default is FALSE.
`plot_rptr_int`	Logical; if TRUE, the distributions of reporter-ion intensities will be plotted. The default is TRUE. The argument is also applicable to the precursor intensity with MaxQuant LFQ.
`plot_log2FC_cv`	Logical; if TRUE, the distributions of the CV of peptide `log2FC` will be plotted. The default is TRUE.
`use_lowercase_aa`	Logical; if TRUE, modifications in amino acid residues will be abbreviated with lower-case and/or `^_~`. See the table below for details. The default is TRUE.
`use_spec_counts`	Logical; If TRUE, uses spectrum counts for quantitation with Mascot or Mzion outputs.
`use_corrected_mqint`	A logical argument for uses with `MaxQuant` TMT. At the TRUE default, values under columns "Reporter intensity corrected..." in `MaxQuant` PSM results (`msms.txt`) will be used. Otherwise, "Reporter intensity" values without corrections will be used.
`rm_reverses`	A logical argument for uses with `MaxQuant` TMT and LFQ. At the TRUE default, `Reverse` entries will be removed.
`...`	`filter_`: Variable argument statements for the filtration of data rows. Each statement contains to a list of logical expression(s). The `lhs` needs to start with `filter_`. The logical condition(s) at the `rhs` needs to be enclosed in `exprs` with round parenthesis. For example, `pep_expect` is a column key present in `Mascot` PSM exports and `filter_psms_at = exprs(pep_expect <= 0.1)` will remove PSM entries with `pep_expect > 0.1`.

Details

In each primary output file, "...PSM_N.txt", values under columns log2_R... are logarithmic ratios at base 2 in relative to the average intensity of reference(s) within each multiplex TMT set, or to the row-mean intensity within each plex if no reference(s) are present. Values under columns N_log2_R... are log2_R... with median-centering alignment. Values under columns I... are raw reporter-ion intensity from database searches. Values under columns N_I... are normalized reporter-ion intensity. Values under columns sd_log2_R... are the standard deviation of the log2FC of peptides from ascribing PSMs. Character strings under pep_seq_mod denote peptide sequences with applicable variable modifications.

Nomenclature of pep_seq_mod:

Variable modification	Abbreviation
Non-terminal	A letter from upper to lower case, e.g., `mtFPEADILLK`
N-term	A hat to the left of a peptide sequence, e.g., `^QDGTHVVEAVDATHIGK`
C-term	A hat to the right of a peptide sequence, e.g., `DAYYNLCLPQRPnMI^`
Acetyl (Protein N-term)	A underscore to the left of a peptide sequence, e.g., `_mAsGVAVSDGVIK`.
Amidated (Protein C-term)	A underscore to the right of a peptide sequence, e.g., `DAYYNLCLPQRPnMI_`.
Other (Protein N-term)	A tilde to the left of a peptide sequence, e.g., `~mAsGVAVSDGVIK`
Other (Protein C-term)	An tilde to the right of a peptide sequence, e.g. `DAYYNLCLPQRPnMI~`

Value

Outputs are interim and final PSM tables under the directory of PSM sub to dat_dir. Primary results are in standardized PSM tables of TMTset1_LCMSinj1_PSM_N.txt, TMTset2_LCMSinj1_PSM_N.txt, etc. The indexes of TMT experiment and LC/MS injection are indicated in the file names.

`Mascot`

Users will export PSM data from Mascot at a .csv format and store them under the file folder indicated by dat_dir. The header information should be included during the .csv export. The file name(s) should start with the letter 'F' and ended with a '.csv' extension (e.g., F004452.csv, F004453_this.csv etc.).

`MaxQuant`

Users will copy over msms.txt file(s) from MaxQuant to the dat_dir directory. The file name(s) should start with 'msms' and end with a '.txt' extension (e.g., msms.txt, msms_this.txt etc.).

`MSFragger`

Users will copy over psm.tsv file(s) from MSFragger to the dat_dir directory. The file name(s) should start with 'psm' and end with a '.tsv' extension (e.g., psm.tsv, psm_this.tsv etc.).

`Spectrum Mill`

Users will copy over PSMexport.1.ssv file(s) from Spectrum Mill to the dat_dir directory. The file name(s) should start with 'PSMexport' and end with a '.ssv' extension (e.g., PSMexport.ssv, PSMexport_this.ssv etc.).

`Variable arguments and data files`

Variable argument (vararg) statements of filter_ and arrange_ are available in proteoQ for flexible filtration and ordering of data rows, via functions at users' interface. To take advantage of the feature, users need to be aware of the column keys in input files. As indicated by their names, filter_ and filter2_ perform row filtration against column keys from a primary data file, df, and secondary data file(s), df2, respectively. The same correspondence is applicable for arrange_ and arrange2_ varargs.

Users will typically employ either primary or secondary vararg statements, but not both. In the more extreme case of gspaMap(...), it links prnGSPA findings in df2 to the significance pVals and abundance fold changes in df for volcano plot visualizations by gene sets. The table below summarizes the df and the df2 for varargs in proteoQ.

Utility	Vararg_	df	Vararg2_	df2
normPSM	filter_	Mascot, `F[...].csv`; MaxQuant, `msms[...].txt`; SM, `PSMexport[...].ssv`	NA	NA
PSM2Pep	NA	NA	NA	NA
mergePep	filter_	`TMTset1_LCMSinj1_Peptide_N.txt`	NA	NA
standPep	slice_	`Peptide.txt`	NA	NA
Pep2Prn	filter_	`Peptide.txt`	NA	NA
standPrn	slice_	`Protein.txt`	NA	NA
pepHist	filter_	`Peptide.tx`t	NA	NA
prnHist	filter_	`Protein.txt`	NA	NA
pepSig	filter_	`Peptide[_impNA].txt`	NA	NA
prnSig	filter_	`Protein[_impNA].txt`	NA	NA
pepMDS	filter_	`Peptide[_impNA][_pVal].txt`	NA	NA
prnMDS	filter_	`Protein[_impNA][_pVal].txt`	NA	NA
pepPCA	filter_	`Peptide[_impNA][_pVal].txt`	NA	NA
prnPCA	filter_	`Protein[_impNA][_pVal].txt`	NA	NA
pepLDA	filter_	`Peptide[_impNA][_pVal].txt`	NA	NA
prnLDA	filter_	`Protein[_impNA][_pVal].txt`	NA	NA
pepEucDist	filter_	`Peptide[_impNA][_pVal].txt`	NA	NA
prnEucDist	filter_	`Protein[_impNA][_pVal].txt`	NA	NA
pepCorr_logFC	filter_	`Peptide[_impNA][_pVal].txt`	NA	NA
prnCorr_logFC	filter_	`Protein[_impNA][_pVal].txt`	NA	NA
pepHM	filter_, arrange_	`Peptide[_impNA][_pVal].txt`	NA	NA
prnHM	filter_, arrange_	`Protein[_impNA][_pVal].txt`	NA	NA
anal_prnTrend	filter_	`Protein[_impNA][_pVal].txt`	NA	NA
plot_prnTrend	NA	NA	filter2_	`[...]Protein_Trend_{NZ}[_impNA][...].txt`
anal_pepNMF	filter_	`Peptide[_impNA][_pVal].txt`	NA	NA
anal_prnNMF	filter_	`Protein[_impNA][_pVal].txt`	NA	NA
plot_pepNMFCon	NA	NA	filter2_	`[...]Peptide_NMF[...]_consensus.txt`
plot_prnNMFCon	NA	NA	filter2_	`[...]Protein_NMF[...]_consensus.txt`
plot_pepNMFCoef	NA	NA	filter2_	`[...]Peptide_NMF[...]_coef.txt`
plot_prnNMFCoef	NA	NA	filter2_	`[...]Protein_NMF[...]_coef.txt`
plot_metaNMF	filter_, arrange_	`Protein[_impNA][_pVal].txt`	NA	NA
prnGSPA	filter_	`Protein[_impNA]_pVals.txt`	NA	NA
prnGSPAHM	NA	NA	filter2_	`[...]Protein_GSPA_{NZ}[_impNA]_essmap.txt`
gspaMap	filter_	`Protein[_impNA]_pVal.txt`	filter2_	`[...]Protein_GSPA_{NZ}[_impNA].txt`
anal_prnString	filter_	`Protein[_impNA][_pVals].tx`t	NA	NA

Examples


# ===================================
# PSM normalization
# ===================================

## !!!require the brief working example in `?load_expts`

## additional examples
# Mascot
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  
  # variable argument statement(s)
  filter_psms_at = exprs(pep_expect <= .1),
  filter_psms_more = exprs(pep_rank == 1, pep_exp_z > 1),
)

# MaxQuant
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  corrected_int = TRUE,
  rm_reverses = TRUE,
  
  # vararg statement(s)
  filter_psms_at = exprs(PEP <= 0.1),
)

# MSFragger
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),

  # vararg statement(s)
  filter_psms_at = exprs(Hyperscore >= 10),
)

# Spectrum Mill
normPSM(
  group_psm_by = pep_seq_mod,
  group_pep_by = prot_acc,
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  
  # vararg statement(s)
  filter_psms_at = exprs(score >= 10),
)

###############################################
## Custom entrez lookups
#  (1) can overwrite the `proteoQ` default for 
#      species in "human", "mouse" and "rat"
#  (2) and are required for `other` species
###############################################
# see also `?Uni2Entrez` or `?Ref2Entrez` for more examples
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("org.Hs.eg.db")
BiocManager::install("org.Mm.eg.db")

library(org.Hs.eg.db)
library(org.Mm.eg.db)

library(proteoQ)
Ref2Entrez(species = human)
Ref2Entrez(species = mouse)

# see also Uni2Entrez(...) for Uniprot to Entrez lookups

normPSM(
  group_psm_by = pep_seq_mod, 
  group_pep_by = gene, 
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  entrez = c("~/proteoQ/dbs/entrez/refseq_entrez_hs.rds", 
             "~/proteoQ/dbs/entrez/refseq_entrez_mm.rds"),
)


## Not run: 
# wrong fasta 
normPSM(
  fasta = "~/proteoQ/dbs/fasta/wrong.fasta",
)

# no mouse entry annotation
normPSM(
  fasta = "~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
)

# bad vararg statement
normPSM(
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
  filter_psms_at = exprs(column_key_not_in_psm_tables <= .1),
)
## End(Not run)

qzhang503/proteoQ documentation built on April 13, 2025, 8:31 a.m.

qzhang503/proteoQ index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

qzhang503/proteoQ
Processing and Informatic Analysis of Mass Spectrometrirc Data

normPSM: Standardization of PSM
In qzhang503/proteoQ: Processing and Informatic Analysis of Mass Spectrometrirc Data

Standardization of PSM

Description

Usage

Arguments

Details

Value

`Mascot`

`MaxQuant`

`MSFragger`

`Spectrum Mill`

`Variable arguments and data files`

See Also

Examples

Related to normPSM in qzhang503/proteoQ...

R Package Documentation

Browse R Packages

We want your feedback!

qzhang503/proteoQ Processing and Informatic Analysis of Mass Spectrometrirc Data

normPSM: Standardization of PSM In qzhang503/proteoQ: Processing and Informatic Analysis of Mass Spectrometrirc Data

Standardization of PSM

Description

Usage

Arguments

Details

Value

Mascot

MaxQuant

MSFragger

Spectrum Mill

Variable arguments and data files

See Also

Examples

Related to normPSM in qzhang503/proteoQ...

R Package Documentation

Browse R Packages

We want your feedback!

qzhang503/proteoQ
Processing and Informatic Analysis of Mass Spectrometrirc Data

normPSM: Standardization of PSM
In qzhang503/proteoQ: Processing and Informatic Analysis of Mass Spectrometrirc Data

`Mascot`

`MaxQuant`

`MSFragger`

`Spectrum Mill`

`Variable arguments and data files`