load_expts: Loads TMT or LFQ experiments
In qzhang503/proteoQ: Processing and Informatic Analysis of Mass Spectrometrirc Data

load_expts

R Documentation

Loads TMT or LFQ experiments

Description

load_expts processes .xlsx or .csv files containing the metadata of TMT or LFQ experiments. For simplicity, .xlsx will be assumed in the document.

Usage

load_expts(
  dat_dir = NULL,
  expt_smry = "expt_smry.xlsx",
  frac_smry = "frac_smry.xlsx"
)

Arguments

`dat_dir`	A character string to the working directory. The default is to match the value under the global environment.
`expt_smry`	A character string to a `.xlsx` file containing the metadata of TMT or LFQ experiments. The default is `expt_smry.xlsx`.
`frac_smry`	A character string to a `.xlsx` file containing peptide fractionation summary. The default is `frac_smry.xlsx`.

`expt_smry.xlsx`

The expt_smry.xlsx files should be placed immediately under the file folder defined by dat_dir. The tab containing the metadata of TMT or LFQ experiments should be named Setup. The Excel spread sheet therein is comprised of three tiers of fields: (1) essential, (2) optional default and (3) optional open. The essential columns contain the mandatory information of the experiments. The optional default columns serve as the fields for default lookups in sample selection, grouping, ordering, aesthetics, etc. The optional open fields allow users to define their own analysis, aesthetics, etc.

Essential column	Descrption
Sample_ID	Unique sample IDs
TMT_Channel	TMT channel names: `126`, `127N`, `127C` etc. (left void for LFQ)
TMT_Set	TMT experiment indexes 1, 2, 3, ... (auto-filled for LFQ)
LCMS_Injection	LC/MS injection indexes 1, 2, 3, ... under a `TMT_Set`
RAW_File	MS data file names originated by `MS` software(s)
Reference	Labels indicating reference samples in TMT or LFQ experiments

Sample_ID: values should be unique for entries at a unique combination of TMT_Channel and TMT_Set, or voided for unused entries. Samples with the same indexes of TMT_Channel and TMT_Set but different indexes of LCMS_Injection should have the same value in Sample_ID. No white space or special characters are allowed. See also posts for sample exclusion.

RAW_File: (a) for analysis with off-line fractionation of peptides before LC/MS, values under the RAW_File column should be left void. Instead, the correspondence between the fraction numbers and RAW_File names should be specified in a separate file, for example, frac_smry.xlsx. (2) For analysis without off-line fractionation, it is recommended as well to leave the field under the RAW_File column blank and instead enter the MS file names in frac_smry.xlsx.

The set of RAW_File names in metadata needs to be identifiable in PSM data. Impalpable mismatches might occur when OS file names were altered by MS users and thus different to those recorded internally in MS data for parsing by search engine(s). In the case, machine-generated MS file names should be used. In addition, MS files may occasionally have no contributions to PSM findings. In the case, users will be prompted to remove these MS file names.

Utilities extract_raws and extract_psm_raws may aid matching MS file names between metadata and PSM data. Utility extract_raws extracts the names of MS files under a file folder. Utility extract_psm_raws extracts the names of MS files that are available in PSM data.

Reference: reference entry(entries) are indicated with non-void string(s).

Optional default column	Descrption
Select	Samples to be selected for indicated analysis
Group	Aesthetic labels annotating the prior knowledge of sample groups, e.g., Ctrl_T1, Ctrl_T2, Disease_T1, Disease_T2, ...
Order	Numeric labels specifying the order of sample `groups`
Fill	Aesthetic labels for sample annotation by filled color
Color	Aesthetic labels for sample annotation by edge color
Shape	Aesthetic labels for sample annotation by shape
Size	Aesthetic labels for sample annotation by size
Alpha	Aesthetic labels for sample annotation by transparency

Exemplary optional open column	Descrption
Term	Categorical terms for statistical modeling.
Peptide_Yield	Yields of peptides in sample handling

`frac_smry.xlsx`

Column	Descrption
Sample_ID	Unique sample IDs (only required with LFQ)
TMT_Set	TMT experiment indexes (auto-filled for LFQ)
LCMS_Injection	LC/MS injection indexes
Fraction	Fraction indexes under a `TMT_Set`
RAW_File	MS data file names
PSM_File	Names of PSM files. Required only when one `RAW_File` can be linked to multiple PSM files (e.g. F012345.csv and F012346.csv both from ms_1.raw).

Data normalization
normPSM for extended examples in PSM data normalization
PSM2Pep for extended examples in PSM to peptide summarization
mergePep for extended examples in peptide data merging
standPep for extended examples in peptide data normalization
Pep2Prn for extended examples in peptide to protein summarization
standPrn for extended examples in protein data normalization.
purgePSM and purgePep for extended examples in data purging
pepHist and prnHist for extended examples in histogram visualization.
extract_raws and extract_psm_raws for extracting MS file names

User-friendly utilities for variable arguments of 'filter_...'
contain_str, contain_chars_in, not_contain_str, not_contain_chars_in, start_with_str, end_with_str, start_with_chars_in and ends_with_chars_in for data subsetting by character strings

Missing values
pepImp and prnImp for missing value imputation

Informatics
pepSig and prnSig for significance tests
pepVol and prnVol for volcano plot visualization
prnGSPA for gene set enrichment analysis by protein significance pVals
gspaMap for mapping GSPA to volcano plot visualization
prnGSPAHM for heat map and network visualization of GSPA results
prnGSVA for gene set variance analysis
prnGSEA for data preparation for online GSEA.
pepMDS and prnMDS for MDS visualization
pepPCA and prnPCA for PCA visualization
pepLDA and prnLDA for LDA visualization
pepHM and prnHM for heat map visualization
pepCorr_logFC, prnCorr_logFC, pepCorr_logInt and prnCorr_logInt for correlation plots
anal_prnTrend and plot_prnTrend for trend analysis and visualization
anal_pepNMF, anal_prnNMF, plot_pepNMFCon, plot_prnNMFCon, plot_pepNMFCoef, plot_prnNMFCoef and plot_metaNMF for NMF analysis and visualization

Custom databases
Uni2Entrez for lookups between UniProt accessions and Entrez IDs
Ref2Entrez for lookups among RefSeq accessions, gene names and Entrez IDs
prepGO for gene ontology
prepMSig for molecular signatures
prepString and anal_prnString for STRING-DB

Workflow scripts
# TMT
system.file("extdata", "workflow_tmt_base.R", package = "proteoQ")
system.file("extdata", "workflow_tmt_ext.R", package = "proteoQ")

# LFQ
system.file("extdata", "workflow_lfq_base.R", package = "proteoQ")

Metadata files
# TMT, no fractionation — OK without 'frac_smry.xlsx'
# (a. no references)
system.file("extdata", "expt_smry_no_prefrac.xlsx", package = "proteoQDA")
# (b. W2 and W16 references)
system.file("extdata", "expt_smry_no_prefrac_ref_w2_w16.xlsx", package = "proteoQDA")

# TMT, prefractionation
# (a. no references)
system.file("extdata", "expt_smry_gtmt.xlsx", package = "proteoQDA")
system.file("extdata", "frac_smry_gtmt.xlsx", package = "proteoQDA")

# (b. W2 references)
system.file("extdata", "expt_smry_ref_w2.xlsx", package = "proteoQDA")
system.file("extdata", "frac_smry_gtmt.xlsx", package = "proteoQDA")

# (c. W2 and W16 references)
system.file("extdata", "expt_smry_ref_w2_w16.xlsx", package = "proteoQDA")
system.file("extdata", "frac_smry_gtmt.xlsx", package = "proteoQDA")

# TMT, prefractionation (global + phospho)
system.file("extdata", "expt_smry_tmt_cmbn.xlsx", package = "proteoQDA")
system.file("extdata", "frac_smry_tmt_cmbn.xlsx", package = "proteoQDA")

# TMT, prefractionation, one MS to multiple PSM files
system.file("extdata", "expt_smry_psmfiles.xlsx", package = "proteoQDA")
system.file("extdata", "frac_smry_psmfiles.xlsx", package = "proteoQDA")

# TMT, prefractionation, mixed-plexes
# (column PSM_File needed; as with this example,
# mixed-plexes results are actually from the same MS files
# but searched separately at 6- and 10-plex settings!)
system.file("extdata", "expt_smry_mixplexes.xlsx", package = "proteoQDA")
system.file("extdata", "frac_smry_mixplexes.xlsx", package = "proteoQDA")

# LFQ, prefractionation
system.file("extdata", "expt_smry_plfq.xlsx", package = "proteoQDA")
system.file("extdata", "frac_smry_plfq.xlsx", package = "proteoQDA")

Column keys in PSM, peptide and protein outputs
system.file("extdata", "psm_keys.txt", package = "proteoQ")
system.file("extdata", "peptide_keys.txt", package = "proteoQ")
system.file("extdata", "protein_keys.txt", package = "proteoQ")

MS1 peptide masses
calc_pepmasses for mono-isotopic masses of peptides from fasta databases
calc_monopeptide for mono-isotopic masses of peptides from individual sequences
parse_unimod for parsing Unimod fixed modifications, variable modifications and neutral losses.
find_unimod for finding a Unimod

Examples


# ***********************************
# ************    TMT    ************
# ***********************************
  
# ===================================
# Fasta and PSM files
# ===================================
# fasta (all platforms)
library(proteoQDA)
fasta_dir <- "~/proteoQ/dbs/fasta/refseq"
copy_refseq_hs(fasta_dir)
copy_refseq_mm(fasta_dir)

# working directory (all platforms)
dat_dir <- "~/proteoQ/examples"

# metadata (all platforms)
copy_exptsmry_gtmt(dat_dir)
copy_fracsmry_gtmt(dat_dir)

# PSM (choose one of the platforms)
choose_one <- TRUE
if (!choose_one) {
  ## Mascot
  copy_mascot_gtmt(dat_dir)
  
  ## or MaxQuant
  # copy_maxquant_gtmt(dat_dir)
  
  ## or MSFragger
  # copy_msfragger_gtmt(dat_dir)
  
  ## or proteoM
  # copy_proteom_gtmt(dat_dir)
  
  ## or Spectrum Mill
  # (temporarily unavailable)
}

# ===================================
# PSM, peptide and protein processing
# ===================================
library(proteoQ)
load_expts("~/proteoQ/examples")

# PSM data standardization
normPSM(
  group_psm_by = pep_seq_mod, 
  group_pep_by = gene, 
  annot_kinases = TRUE, 
  
  # no default and required
  fasta = c("~/proteoQ/dbs/fasta/refseq/refseq_hs_2013_07.fasta",
            "~/proteoQ/dbs/fasta/refseq/refseq_mm_2013_07.fasta"),
)

# optional PSM purging
purgePSM()

# PSMs to peptides
PSM2Pep()

# peptide data merging
mergePep()

# peptide data standardization
standPep()

# peptide data histograms
pepHist()

# optional peptide purging
purgePep()

# peptides to proteins
Pep2Prn(use_unique_pep = TRUE)

# protein data standardization
standPrn()

# protein data histograms
prnHist()

# ===================================
# Optional significance tests
# (no NA imputation)
# ===================================
pepSig(
  W2_bat = ~ Term["W2.BI.TMT2-W2.BI.TMT1", 
                  "W2.JHU.TMT2-W2.JHU.TMT1", 
                  "W2.PNNL.TMT2-W2.PNNL.TMT1"],
  W2_loc = ~ Term_2["W2.BI-W2.JHU", 
                    "W2.BI-W2.PNNL", 
                    "W2.JHU-W2.PNNL"],
  W16_vs_W2 = ~ Term_3["W16-W2"], 
)

prnSig()

# ===================================
# optional NA imputation
# ===================================
pepImp(m = 2, maxit = 2)
prnImp(m = 5, maxit = 5)

# ===================================
# Optional significance tests
# (with NA imputation)
# ===================================
pepSig(
  impute_na = TRUE, 
  W2_bat = ~ Term["W2.BI.TMT2-W2.BI.TMT1", 
                  "W2.JHU.TMT2-W2.JHU.TMT1", 
                  "W2.PNNL.TMT2-W2.PNNL.TMT1"],
  W2_loc = ~ Term_2["W2.BI-W2.JHU", 
                    "W2.BI-W2.PNNL", 
                    "W2.JHU-W2.PNNL"],
  W16_vs_W2 = ~ Term_3["W16-W2"], 
)

prnSig(impute_na = TRUE)



# ***********************************
# ************    LFQ    ************
# ***********************************

# ===================================
# Fasta and PSM files
# ===================================
# fasta (all platforms)
library(proteoQDA)
fasta_dir <- "~/proteoQ/dbs/fasta/uniprot"
copy_uniprot_hsmm(fasta_dir)

# working directory (all platforms)
dat_dir <- "~/proteoQ/examples"

# metadata (all platforms)
copy_exptsmry_plfq(dat_dir)
copy_fracsmry_plfq(dat_dir)

# PSM (choose one of the platforms)
choose_one <- TRUE
if (!choose_one) {
  ## Mascot
  copy_mascot_plfq(dat_dir)
  
  ## or MaxQuant
  # copy_maxquant_plfq(dat_dir)
  
  ## or MSFragger
  # copy_msfragger_plfq(dat_dir)
  
  ## or proteoM
  # copy_proteom_plfq(dat_dir)
  
  ## or Spectrum Mill
  # (temporarily unavailable)
}


# ===================================
# PSM, peptide and protein processing
# ===================================
library(proteoQ)
load_expts("~/proteoQ/examples")

# PSM data standardization
normPSM(
  group_psm_by = pep_seq_mod, 
  group_pep_by = gene, 
  annot_kinases = TRUE, 
  fasta = c("~/proteoQ/dbs/fasta/uniprot/uniprot_hsmm_2020_03.fasta"),
)

# PSM purging not applicable with LFQ
# purgePSM()

# PSMs to peptides
PSM2Pep()

# peptide data merging
mergePep()

# peptide data standardization
standPep()

# peptide data histograms
pepHist()

# optional peptide purging
purgePep()

# peptides to proteins
Pep2Prn(use_unique_pep = TRUE)

# protein data standardization
standPrn()

# protein data histograms
prnHist()

# ===================================
# Optional significance tests
# (no NA imputation)
# ===================================
pepSig(
  fml_1 = ~ Term["BI-JHU", 
                 "JHU-PNNL", 
                 "(BI+JHU)/2-PNNL"],
)

prnSig()

# ===================================
# optional NA imputation
# ===================================
pepImp(m = 2, maxit = 2)
prnImp(m = 5, maxit = 5)

# ===================================
# Optional significance tests
# (with NA imputation)
# ===================================
pepSig(
  impute_na = TRUE, 
  fml_1 = ~ Term["BI-JHU", 
                 "JHU-PNNL", 
                 "(BI+JHU)/2-PNNL"],
)

prnSig(impute_na = TRUE)


# ***********************************
# ***********    SILAC    ***********
# ***********************************

# Database searches
library(proteoM)

matchMS(
  silac_mix = list(base = NULL, heavy = c("K8 (K)", "R10 (R)")),
  ...
)

# The remaining is the same as LFQ
# ...




## Not run: 
load_expts(dat_dir = "~/proteoQ/examples", expt_smry = "expt_smry.xlsx")

# not working; `expt_smry = my_expt` is an expression
my_expt <- "expt_smry.xlsx"
load_expts(dat_dir = "~/proteoQ/examples", expt_smry = my_expt)

# need unquoting; 
# see also: https://dplyr.tidyverse.org/articles/programming.html
load_expts(dat_dir = "~/proteoQ/examples", expt_smry = !!my_expt)

## End(Not run)

qzhang503/proteoQ documentation built on April 13, 2025, 8:31 a.m.