normPSM | R Documentation |
normPSM
standardizes
PSM results
from database search engines.
normPSM(
dat_dir = NULL,
expt_smry = "expt_smry.xlsx",
frac_smry = "frac_smry.xlsx",
fasta = NULL,
entrez = NULL,
group_psm_by = c("pep_seq_mod", "pep_seq"),
group_pep_by = c("gene", "prot_acc"),
pep_unique_by = c("group", "protein", "none"),
mc_psm_by = c("peptide", "protein", "psm"),
scale_rptr_int = FALSE,
rptr_intco = 0,
rptr_intrange = c(0, 100),
rm_craps = FALSE,
rm_krts = FALSE,
rm_outliers = FALSE,
rm_allna = FALSE,
type_sd = c("log2_R", "N_log2_R", "Z_log2_R"),
lfq_mbr = TRUE,
mbr_ret_tol = 25,
purge_phosphodata = TRUE,
annot_kinases = FALSE,
plot_rptr_int = TRUE,
plot_log2FC_cv = TRUE,
use_lowercase_aa = FALSE,
use_spec_counts = FALSE,
use_corrected_mqint = TRUE,
rm_reverses = TRUE,
...
)
dat_dir |
A character string to the working directory. The default is to match the value under the global environment. |
expt_smry |
A character string to a |
frac_smry |
A character string to a |
fasta |
Character string(s) to the name(s) of fasta file(s) with
prepended directory path. The |
entrez |
Character string(s) to the name(s) of entrez file(s) with
prepended directory path. At the |
group_psm_by |
A character string specifying the method in PSM grouping.
At the |
group_pep_by |
A character string specifying the method in peptide
grouping. At the |
pep_unique_by |
A character string for annotating the uniqueness of
peptides. At the |
mc_psm_by |
A character string specifying the method in the median
centering of PSM |
scale_rptr_int |
Logical; if TRUE, scales (up) MS2 reporter-ion
intensities by MS1 precursor intensity: |
rptr_intco |
Numeric; the threshold of reporter-ion intensity (TMT:
|
rptr_intrange |
Numeric vector at length two. The argument specifies the
range of reporter-ion intensities (TMT: |
rm_craps |
Logical; if TRUE, cRAP proteins will be removed. The default is FALSE. |
rm_krts |
Logical; if TRUE, keratin entries will be removed. The default is FALSE. |
rm_outliers |
Logical; if TRUE, PSM outlier removals will be performed
for peptides with more than two identifying PSMs. Dixon's method will be
used when |
rm_allna |
Logical; if TRUE, removes data rows that are exclusively NA
across ratio columns of |
type_sd |
Character string; the type of log2Ratios for SD calculations.
The value is one |
lfq_mbr |
Logical; if TRUE, performs match-between-run (MBR) with Mzion
LFQ data. Also requires |
mbr_ret_tol |
Retention time tolerance (in seconds) for LFQ-MBR. |
purge_phosphodata |
Logical; if TRUE and phosphorylation present as variable modification(s), entries without phosphorylation will be removed. The default is TRUE. |
annot_kinases |
Logical; if TRUE, proteins of human or mouse origins will be annotated with their kinase attributes. The default is FALSE. |
plot_rptr_int |
Logical; if TRUE, the distributions of reporter-ion intensities will be plotted. The default is TRUE. The argument is also applicable to the precursor intensity with MaxQuant LFQ. |
plot_log2FC_cv |
Logical; if TRUE, the distributions of the CV of
peptide |
use_lowercase_aa |
Logical; if TRUE, modifications in amino acid residues
will be abbreviated with lower-case and/or |
use_spec_counts |
Logical; If TRUE, uses spectrum counts for quantitation with Mascot or Mzion outputs. |
use_corrected_mqint |
A logical argument for uses with |
rm_reverses |
A logical argument for uses with |
... |
|
In each primary output file, "...PSM_N.txt
", values under columns
log2_R...
are logarithmic ratios at base 2 in relative to the average
intensity of reference(s)
within each multiplex TMT set, or to the
row-mean intensity within each plex if no reference(s)
are present.
Values under columns N_log2_R...
are log2_R...
with
median-centering alignment. Values under columns I...
are raw
reporter-ion intensity
from database searches. Values under columns
N_I...
are normalized reporter-ion intensity
. Values under
columns sd_log2_R...
are the standard deviation of the log2FC
of peptides from ascribing PSMs. Character strings under pep_seq_mod
denote peptide sequences with applicable variable modifications.
Nomenclature of pep_seq_mod
:
Variable modification | Abbreviation |
Non-terminal | A letter from upper to lower case, e.g., mtFPEADILLK
|
N-term | A hat to the left of a peptide sequence, e.g.,
^QDGTHVVEAVDATHIGK |
C-term | A hat to the right of a peptide
sequence, e.g., DAYYNLCLPQRPnMI^ |
Acetyl (Protein N-term) | A
underscore to the left of a peptide sequence, e.g., _mAsGVAVSDGVIK .
|
Amidated (Protein C-term) | A underscore to the right of a peptide
sequence, e.g., DAYYNLCLPQRPnMI_ . |
Other (Protein N-term) | A
tilde to the left of a peptide sequence, e.g., ~mAsGVAVSDGVIK |
Other (Protein C-term) | An tilde to the right of a peptide sequence, e.g.
DAYYNLCLPQRPnMI~ |
Outputs are interim and final PSM tables under the directory of
PSM
sub to dat_dir
. Primary results are in
standardized PSM tables of TMTset1_LCMSinj1_PSM_N.txt,
TMTset2_LCMSinj1_PSM_N.txt, etc.
The indexes of TMT experiment and LC/MS
injection are indicated in the file names.
Mascot
Users will export PSM
data from
Mascot at a .csv
format and store them under the file folder indicated by dat_dir
.
The header information should be included during the .csv
export.
The file name(s) should start with the letter 'F'
and ended with a
'.csv'
extension (e.g., F004452.csv, F004453_this.csv etc.)
.
MaxQuant
Users will copy over msms.txt
file(s) from
MaxQuant to the dat_dir
directory.
The file name(s) should start with 'msms'
and end with a
'.txt'
extension (e.g., msms.txt, msms_this.txt etc.)
.
MSFragger
Users will copy over psm.tsv
file(s) from
MSFragger to the dat_dir
directory. The file name(s) should start with 'psm'
and end with a
'.tsv'
extension (e.g., psm.tsv, psm_this.tsv etc.)
.
Spectrum Mill
Users will copy over PSMexport.1.ssv
file(s) from
Spectrum
Mill to the dat_dir
directory. The file name(s) should start with
'PSMexport'
and end with a '.ssv'
extension (e.g.,
PSMexport.ssv, PSMexport_this.ssv etc.)
.
Variable arguments and data files
Variable argument (vararg)
statements of filter_
and arrange_
are available in
proteoQ
for flexible filtration and ordering of data rows, via
functions at users' interface. To take advantage of the feature, users need
to be aware of the column keys in input files. As indicated by their names,
filter_
and filter2_
perform row filtration against column
keys from a primary data file, df
, and secondary data file(s),
df2
, respectively. The same correspondence is applicable for
arrange_
and arrange2_
varargs.
Users will typically
employ either primary or secondary vararg statements, but not both. In the
more extreme case of gspaMap(...)
, it links prnGSPA
findings in df2
to the significance pVals
and abundance fold
changes in df
for volcano plot visualizations by gene sets. The
table below summarizes the df
and the df2
for varargs in
proteoQ
.
Utility | Vararg_ | df | Vararg2_ | df2 |
normPSM | filter_ | Mascot, F[...].csv ; MaxQuant, msms[...].txt ;
SM, PSMexport[...].ssv | NA | NA |
PSM2Pep | NA | NA | NA | NA |
mergePep | filter_ | TMTset1_LCMSinj1_Peptide_N.txt | NA | NA |
standPep | slice_ | Peptide.txt | NA | NA |
Pep2Prn | filter_ | Peptide.txt | NA | NA |
standPrn | slice_ | Protein.txt | NA | NA |
pepHist | filter_ | Peptide.tx t | NA | NA |
prnHist | filter_ | Protein.txt | NA | NA |
pepSig | filter_ | Peptide[_impNA].txt | NA | NA |
prnSig | filter_ | Protein[_impNA].txt | NA | NA |
pepMDS | filter_ | Peptide[_impNA][_pVal].txt | NA | NA |
prnMDS | filter_ | Protein[_impNA][_pVal].txt | NA | NA |
pepPCA | filter_ | Peptide[_impNA][_pVal].txt | NA | NA |
prnPCA | filter_ | Protein[_impNA][_pVal].txt | NA | NA |
pepLDA | filter_ | Peptide[_impNA][_pVal].txt | NA | NA |
prnLDA | filter_ | Protein[_impNA][_pVal].txt | NA | NA |
pepEucDist | filter_ | Peptide[_impNA][_pVal].txt | NA | NA |
prnEucDist | filter_ | Protein[_impNA][_pVal].txt | NA | NA |
pepCorr_logFC | filter_ | Peptide[_impNA][_pVal].txt | NA | NA |
prnCorr_logFC | filter_ | Protein[_impNA][_pVal].txt | NA | NA |
pepHM | filter_, arrange_ | Peptide[_impNA][_pVal].txt | NA | NA |
prnHM | filter_, arrange_ | Protein[_impNA][_pVal].txt | NA | NA |
anal_prnTrend | filter_ | Protein[_impNA][_pVal].txt | NA | NA |
plot_prnTrend | NA | NA | filter2_ | [...]Protein_Trend_{NZ}[_impNA][...].txt |
anal_pepNMF | filter_ | Peptide[_impNA][_pVal].txt | NA | NA |
anal_prnNMF | filter_ | Protein[_impNA][_pVal].txt | NA | NA |
plot_pepNMFCon | NA | NA | filter2_ | [...]Peptide_NMF[...]_consensus.txt |
plot_prnNMFCon | NA | NA | filter2_ | [...]Protein_NMF[...]_consensus.txt |
plot_pepNMFCoef | NA | NA | filter2_ | [...]Peptide_NMF[...]_coef.txt |
plot_prnNMFCoef | NA | NA | filter2_ | [...]Protein_NMF[...]_coef.txt |
plot_metaNMF | filter_, arrange_ | Protein[_impNA][_pVal].txt | NA | NA |
prnGSPA | filter_ | Protein[_impNA]_pVals.txt | NA | NA |
prnGSPAHM | NA | NA | filter2_ | [...]Protein_GSPA_{NZ}[_impNA]_essmap.txt |
gspaMap | filter_ | Protein[_impNA]_pVal.txt | filter2_ | [...]Protein_GSPA_{NZ}[_impNA].txt |
anal_prnString | filter_ | Protein[_impNA][_pVals].tx t | NA | NA |
Metadata
load_expts
for metadata
preparation and a reduced working example in data normalization
Data normalization
normPSM
for extended examples
in PSM data normalization
PSM2Pep
for extended examples
in PSM to peptide summarization
mergePep
for extended
examples in peptide data merging
standPep
for extended
examples in peptide data normalization
Pep2Prn
for
extended examples in peptide to protein summarization
standPrn
for extended examples in protein data normalization.
purgePSM
and purgePep
for extended examples
in data purging
pepHist
and prnHist
for
extended examples in histogram visualization.
extract_raws
and extract_psm_raws
for
extracting MS file names
Variable arguments of filter_...
contain_str
, contain_chars_in
,
not_contain_str
, not_contain_chars_in
,
start_with_str
, end_with_str
,
start_with_chars_in
and ends_with_chars_in
for
data subsetting by character strings
Missing values
pepImp
and prnImp
for
missing value imputation
Informatics
pepSig
and prnSig
for
significance tests
pepVol
and prnVol
for
volcano plot visualization
prnGSPA
for gene set
enrichment analysis by protein significance pVals
gspaMap
for mapping GSPA to volcano plot visualization
prnGSPAHM
for heat map and network visualization of GSPA results
prnGSVA
for gene set variance analysis
prnGSEA
for data preparation for online GSEA.
pepMDS
and prnMDS
for MDS visualization
pepPCA
and prnPCA
for PCA visualization
pepLDA
and prnLDA
for LDA visualization
pepHM
and prnHM
for heat map visualization
pepCorr_logFC
, prnCorr_logFC
,
pepCorr_logInt
and prnCorr_logInt
for
correlation plots
anal_prnTrend
and
plot_prnTrend
for trend analysis and visualization
anal_pepNMF
, anal_prnNMF
,
plot_pepNMFCon
, plot_prnNMFCon
,
plot_pepNMFCoef
, plot_prnNMFCoef
and
plot_metaNMF
for NMF analysis and visualization
Custom databases
Uni2Entrez
for lookups between
UniProt accessions and Entrez IDs
Ref2Entrez
for lookups
among RefSeq accessions, gene names and Entrez IDs
prepGO
for gene
ontology
prepMSig
for molecular
signatures
prepString
and anal_prnString
for STRING-DB
Column keys in PSM, peptide and protein outputs
system.file("extdata", "psm_keys.txt", package = "proteoQ")
system.file("extdata", "peptide_keys.txt", package = "proteoQ")
system.file("extdata", "protein_keys.txt", package = "proteoQ")
# ===================================
# PSM normalization
# ===================================
## !!!require the brief working example in `?load_expts`
## additional examples
# Mascot
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = prot_acc,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
# variable argument statement(s)
filter_psms_at = exprs(pep_expect <= .1),
filter_psms_more = exprs(pep_rank == 1, pep_exp_z > 1),
)
# MaxQuant
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = prot_acc,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
corrected_int = TRUE,
rm_reverses = TRUE,
# vararg statement(s)
filter_psms_at = exprs(PEP <= 0.1),
)
# MSFragger
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = prot_acc,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
# vararg statement(s)
filter_psms_at = exprs(Hyperscore >= 10),
)
# Spectrum Mill
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = prot_acc,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
# vararg statement(s)
filter_psms_at = exprs(score >= 10),
)
###############################################
## Custom entrez lookups
# (1) can overwrite the `proteoQ` default for
# species in "human", "mouse" and "rat"
# (2) and are required for `other` species
###############################################
# see also `?Uni2Entrez` or `?Ref2Entrez` for more examples
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("org.Mm.eg.db")
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(proteoQ)
Ref2Entrez(species = human)
Ref2Entrez(species = mouse)
# see also Uni2Entrez(...) for Uniprot to Entrez lookups
normPSM(
group_psm_by = pep_seq_mod,
group_pep_by = gene,
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
entrez = c("~/proteoQ/dbs/entrez/refseq_entrez_hs.rds",
"~/proteoQ/dbs/entrez/refseq_entrez_mm.rds"),
)
## Not run:
# wrong fasta
normPSM(
fasta = "~/proteoQ/dbs/fasta/wrong.fasta",
)
# no mouse entry annotation
normPSM(
fasta = "~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
)
# bad vararg statement
normPSM(
fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
"~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
filter_psms_at = exprs(column_key_not_in_psm_tables <= .1),
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.