library_generator: Digitizing and networking LC-MS/MS data

View source: R/library_generator.r

library_generatorR Documentation

Digitizing and networking LC-MS/MS data

Description

The function proposes three data processing algorithms to pick up MS1/MS2 scans from DDA or targeted mode LC-MS/MS data, merge them into a spectral library and create a spectral similarity-based molecular network.

Usage

library_generator(
  input_library = NULL,
  lcms_files = NULL,
  metadata_file = NULL,
  polarity = c("Positive", "Negative")[1],
  mslevel = c(1, 2),
  add.adduct = TRUE,
  adductType = NULL,
  processing.algorithm = c("Default", "compMS2Miner", "RMassBank")[1],
  params.search = list(mz_search = 0.01, ppm_search = 10, rt_search = 15, rt_gap = 30),
  params.ms.preprocessing = list(normalized = TRUE, baseline = 1000, relative = 0.1,
    max_peaks = 200, recalibration = 0),
  params.consensus = list(consensus = FALSE, consensus_method = c("consensus",
    "consensus2", "common_peaks", "most_recent")[1], consensus_window = 0.02),
  params.network = list(network = FALSE, similarity_method = "Cosine", min_frag_match =
    6, min_score = 0.6, topK = 10, max_comp_size = 100, reaction_type = "Metabolic",
    use_reaction = FALSE),
  params.user = list(sample_type = "", user_name = "", comments = "")
)

Arguments

input_library

Character or a list object. If character, name of the existing library into which new scans are added, the file extension must be mgf, msp or RData; please set to NULL if the new library has no dependency with previous ones.

lcms_files

A character vector of LC-MS/MS file names from which scans are extracted. All files must have be in centroid-mode with mzML, mzXML or cdf extension!

metadata_file

A single character, NULL object or data frame. If it is character, it should be the metadata file name. The file should be tab, comma or semi-colon separated txt, dat or csv format. For all algorithms, the metadata must contain the column "ID" - a unique structure identifier. The column PEPMASS (targeted precursor mass) must be provided for Default and compMS2Miner. The column RT (targeted retention time in min) must be provided for compMS2Miner and optional for MergeION and RMassBank. Please include the column SMILES (structure identifier) for RMassBank algorithm. If RMassBank is used, the column FILENAME (chromatogram file with mzML, mzXML or cdf extension) must be provided for each compound telling the algorithm from which file compound can be found. Column FILENAME is optional for Default and compMS2Miner. Column ADDUCT is optional for all algorithms, if not provided, all input will be considered as M+H or M-H depending on polarity. Please specify the adduct type if metadata contains both positive and negative ions. If metadata is NULL and lcms files are acquired in DDA mode, an automated feature screening is performed for fragmented masses. Masses and retention times of these features are used for spectral library generation and molecular networking.

polarity

A single character. Either "Positive" or "Negative". Ion mode of LC-MS/MS files.

mslevel

A numeric vector. 1 or 2 or c(1,2). 2 if MS2 scans are extracted, 1 if isotopic pattern of the precursor mass in the MS1 scan is extracted. c(1,2) if both MS1 and MS2 scans are extracted. Note: High-quality isotopic patterns in MS1 scans are useful for determining precursor formula!

add.adduct

Logical. If TRUE, additional adduct types will be calculated based on precursor masses of "M+H" and "M-H" adducts in the input metadata: "M+2H", "M+Na","M+K","M+NH4","M+" will be searched for positive ion mode, "M+COO-", "M+Cl" and "M+CH3COO-" for negative ion mode. If FALSE, no additional adduct types will be searched.

processing.algorithm

A single character. "Default", "compMS2Miner" or "RMassBank".

params.search

Parameters for searching and collecting ions from chromatogram files in a list. These parameters define the tolerance window when input metadata is searched. The list must contain following elements:

  • mz_search: Numeric. Absolute mass tolerance in Da.

  • ppm_search: Numeric. Absolute mass tolerance in ppm.

  • rt_search": Numeric. Absolute retention time tolerance in second.

  • rt_gap: Numeric. Retention time gap in second - when two scans both match with an input structure, they are both recorded as isomeric features of the same identifier if they are separated by a certain retention time gap. Please set it to 10000 if no isomeric feature is picked. This parameter is not used for RMassBank.

params.ms.preprocessing

Parameters for pre-processing scans found in chromatogram files in a list. It must contain:

  • normalized: Logical. TRUE if the intensities of extracted spectra need to normalized so that the intensity of highest peak will be 100.

  • baseline: Numeric. Absolute intensity threshold that is considered as a mass peak and written into the library.

  • relative: Numeric between 0 and 100. Relative intensity threshold of the highest peak in each spectrum, peaks above both absolute and relative thresholds are saved in the library.

  • max_peaks: Integer higher than 3. Maximum number of peaks kept per spectrum from the highest peak.

  • recalibration: NUmeric. Parameter used by RMassBank. 0 if output is experimental spectra. 1 if output is experimental mass along with annotated formula. 2 if output is the theoritical masses calculated from elemenetal formula.

params.consensus

Parameters for generating consensus scans that combine spectra of the same compound ID

  • consensus: Logical. TRUE if consensus spectra are generated

  • consensus_method: Character. Method for merging library "duplicates" by compound IDs.

    1. consensus: Default method for generating generated by merging spectra of the same compound ID. All peaks were kept, similar fragments were aligned by averaging m/z and intensity.

    2. common_peaks: Peaks detected in ALL duplicated spectra were kept and averaged.

    3. most_recent: The most recent record was kept if duplicates are detected.

  • consensus_window m/z window (in Dalton) for spectra alignment, only used when method = "consensus" or "common_peaks". To generate consensus spectra, mass peaks in different spectra within the mass window were aligned by averaging their mass values and intensities. The metadata was kept only for the most recent spectrum.

params.network

Parameters for networking consensus spectra library into a molecular network

  • network: Logical. TRUE if a network is built for consensus spectral library

  • similarity_method:Characeter.Similarity metrics for networking and spectral library search. Must be "Matches", "Dot", "Cosine", "Spearman", "MassBank" or "NIST".

  • min_frag_match: Integer. Minimum number of common fragment ions (or neutral losses) that are shared to be considered for spectral similarity evaluation. We suggest setting this value to at least 6 for statistical meaningfulness.

  • min_score: Numeric between 0 and 1. Minimum similarity score to connect two nodes in the library annotate an unknown feature with spectral library or to connect two unknown features because they are similar. It does NOT affect method = "Matches".

  • topK: Integer higher than 0. For networking, the edge between two nodes are kept only if both nodes are within each other's TopK most similar nodes. For example, if this value is set at 20, then a single node may be connected to up to 20 other nodes. Keeping this value low makes very large networks (many nodes) much easier to visualize. We suggest keeping this value at 10.

  • max_comp_size: Numeric between 0 and 200. Maximum size of nodes allowed in each network component. Default value is 100. Network component = Cluster of connected node. Set to 0 if no limitation on componet size.

  • reaction_type: Character. Either "Metabolic" and "Chemical". Type of transformation list used to annotate mass difference between connected features in molecular network.

  • use_reaction: Boolean. TRUE if keep only edges whose mass difference can be annotated to known metabolic or chemical reactions.

params.user

A list of additional parameters.

  • sample_type: Character. Type of LCMS samples added to the spectral library e.g. plasma, standards...

  • user_name: Character. User name who process the batch of lc-ms files.

  • comments: Character. Additional comments about the samples added.

adductType.

User-specified adduct type, default is NULL. Set 'add.adduct' to TRUE and specify 'adductType' to fiter records limited to 'adductType' before appending the additional adduct types.

Value

  • complete: Entire spectra library (historical + newly added records) is a list object of two elements: "library$sp" ~ List of all extracted spectra. Each spectrum is a data matrix with two columns: m/z and intensity; "library$metadata" ~ Data frame containing metadata of extracted scans. PEPMASS and RT are updated based on scans detected in the chromatogram files. Following metadata columns are updated/added: FILENAME (which raw data file the scan is isolated), MSLEVEL (1 or 2), TIC, PEPMASS_DEV (ppm error for detected precursor mass) and SCANNUMBER (scan number in raw chromatogram). The last three columns were PARAM_ALGORITHM (algorithm of processing), PARAM_CREATION_TIME (date and time when the MS record was added) and SCANS (unique identifier for each record)

  • consensus: Consensus spectral library by merging MS/MS spectra with the same ID.

  • network: Consensus spectral library transformed into a molecular network based on MS/MS spectral similarity.

Author(s)

Youzhong Liu, YLiu186@ITS.JNJ.com

Examples


## Not run:  library(RMassBankData)

input_library = NULL # There's no historical spectral library. We create a brand new spectral library here,
lcms_files <- list.files(system.file("spectra", package="RMassBankData"), ".mzML", full.names = TRUE)
metadata_file <- list.files(system.file(package = "MergeION"),".csv", full.names = TRUE)

polarity = "Positive"
mslevel= 2 # Only MS2 scans are extracted!
add.adduct = FALSE # No additional adducts are searched besides M+H 

params.search = list(mz_search = 0.005, ppm_search = 10, rt_search = 15, rt_gap = 30)
params.ms.preprocessing = list(normalized = T, baseline = 1000, relative =0.01, max_peaks = 200, recalibration = 0)

# Building a spectral library with default (SmartION) algorithm by simply gathering scans that matched with metadata:
params.user = list(sample_type = "RMassBank data", user_name = "daniel", comments = "default algorithm, without building a consensus library")
processing.algorithm = "Default"
lib = library_generator(input_library, lcms_files, metadata_file, 
                        polarity = "Positive", mslevel, add.adduct, processing.algorithm,
                        params.search, params.ms.preprocessing, params.user = params.user)
lib1 = lib$complete
save(lib1, file = "test_default_complete.RData") # Save the library as RData

# Building a spectral library with compMS2Miner algorithm and generating a consensus spectral library
processing.algorithm = "compMS2Miner"
params.consensus = list(consensus = T, consensus_method = "consensus", consensus_window = 0.02)
params.user = list(sample_type = "RMassBank data", user_name = "daniel", comments = "compMS2Miner algorithm, building a consensus library")
lib = library_generator(input_library, lcms_files, metadata_file, 
                       polarity = "Positive", mslevel, add.adduct, processing.algorithm,
                       params.search, params.ms.preprocessing, params.consensus, params.user = params.user)
lib2 = lib$consensus
save(lib2, file = "test_compMS2Miner_consensus.RData")  # Save the library as RData

# Building a spectral library with RMassBank algorithm (recalibration based on elemental formula annotation), creating consensus spectral library and building a molecular network based on the consensus library
processing.algorithm = "RMassBank"
params.ms.preprocessing = list(normalized = T, baseline = 1000, relative =0.01, max_peaks = 200, recalibration = 2)
params.network = list(network = T, similarity_method = "Cosine", min_frag_match = 6, min_score = 0.6, max_comp_size = 100, topK = 10, reaction_type = "Chemical", use_reaction = F)
lib3 = library_generator(input_library, lcms_files, metadata_file, 
                      polarity = "Positive", mslevel, add.adduct, processing.algorithm,
                      params.search, params.ms.preprocessing, params.consensus, params.network, params.user = params.user)
save(lib3, file = "test_RMassBank_consensus_network.RData")
## End(Not run)  # Save the library as RData


daniellyz/MergeION2 documentation built on Jan. 26, 2024, 6:24 a.m.