metID.buildConsensus: Build consensus combinatorial metabolite identification

Description Usage Arguments Details Source

Description

this function is designed to be used at the end of combinatorial metabolite identification process. It evaluates the multiple layers of evidence which are currently accumulated in the CompMS2 class object to automatically rank possible annotations and identify the annotation with the greatest weight of evidence for every composite spectrum.

Usage

1

Arguments

object

a "compMS2" class object.

include

character vector of 6 options to build consensus combinatorial metabolite identification see Details below for a description of each. If specific options are not supplied as a character vector then the default is to consider all 7. i.e. c('massAccuracy', 'spectralDB', 'inSilico', 'rtPred', 'chemSim', 'pubMed', 'substructure').

metIDWeights

numeric vector equal in length to include vector (see above). Default is NULL and a simple arithmetic mean will be calculated for all the metabolite identification options included. The metIDWeights will be used to calculate a weighted mean of the combination of metabolite identification options. This option can be used to generate a custom metabolite identification setting which best annotates the unknown metabolites. N.B. The sum of the metIDWeights vector must be 1. e.g. include= c('massAccuracy', 'spectralDB', 'inSilico') and metIDWeights=c(0.2, 0.5, 0.3) therefore massAccuracy will be given a weight of 0.2 (20%), spectralDB matches will be given a weight of 0.5 (50%) and in silico fragmentation score will be given a weight of 0.3 (30%). rtPred (predicted retention time), chemSim (nearest neighbour chemical similarity score) and pubMed (number of pubmed citations) will not be included.

autoPossId

logical if TRUE the function will automatically add the names of the top annotation based on mean consensus annotation score to the "metID comments" table (default = FALSE). Caution if TRUE this will overwrite any existing possible_identities in the "metID comments" table. This functionality is intended as an automatic metabolite annotation identification tool prior to thorough examination of the data in compMS2Explorer as part of an objective and seamless first-pass annotation workflow. The mean build consensus score can consist of many orthogonal measurements of metabolite identification and a means to rapidly rank metabolite annotations.

minMeanBCscore

numeric minimum mean consensus score (values between 0-1), if argument autoPossId is TRUE any metabolite annotations above this value will be automatically added to the "metID comments" table. (if argument not supplied the default is the upper interquartile range of the mean BC score).

possContam

numeric how many times does a possible annotation have to appear in the automatically generated possible annotations for it to be considered a contaminant and therefore not added to the "metID comment" table (default = 3, i.e. if a database name appears more than 3 times in the automatic annotation table it will be removed).

verbose

logical if TRUE display progress bars.

Details

Specifically the function looks at the following 7 pieces of evidence:

  1. "massAccuracy" monoisotopic mass similarity. Absolute mass similarity between 0 and the upper mass accuracy limit (default 10 ppm) are used to generate a ranking score between 0-1.

  2. "spectralDB" spectral database match. If a match has been made to a spectral database using the function metID.matchSpectralDB then a combination of the dot product score and proportion of the composite spectrum explained is used to rank the annotations. A score is determined between 0-1 based on the average dot product and proportion of composite spectrum is explained. Where 1 is perfect agreement and 0 is no agreement. If no spectral database match has been made then the value is set to NA and this score will not be used in calculating the average ranking.

  3. "inSilico" in silico fragmentation data. Both the results of the metID.metFrag and metID.CFM functions. The total proportion of the composite spectrum explained by each in silico fragmentation method (a value between 0-1) is used to rank the annotations. If no in silico fragmentation match has been made then a value of NA is set and this score will not be used in calculating the average ranking.

  4. "rtPred" predicted retention time similarity. Annotations are ranked based on the retention time deviation from the predictive retention time model built using the function metID.rtPred. A relative score between 0-1 is calculated globally by taking the range of retention time deviation values.

  5. "chemSim" chemical similarity score. The mean maximum 1st neighbour (connected by correlation metID.corrNetwork and/or spectral similarity metID.specSimNetwork) tanimoto chemical similarity scores calculated by metID.chemSim is used to rank annotations. A relative score between 0-1 is calculated globally by taking the range of mean maximum 1st neighbour chemical similarity scores.

  6. "pubMed" crude literature based plausibility. The number of PMIDs returned by searching the compound name in PubMed. Number of returned PMIDs are used to generate a relative score ranking between 0-1. This aspect is highly reliant on the database name being the correct synonym to search the PubMed repository with. In an effort to ensure phospholipids are correctly search against PubMed a set of regular expressions has been created to identify common phospholipid annotations and use the compound class name rather than an abbreviation with positional and fatty acid chain length information to obscure the number of pubmed abstract ids returned (see lipidAbbrev).

    This aspect is potentially time consuming (but only needs to be conducted once) as it complies closely with the NCBI recommendations from the section "Frequency, Timing and Registration of E-utility URL Requests" of book "A General Introduction to the E-utilities" by Eric Sayers http://www.ncbi.nlm.nih.gov/books/NBK25497/:

    "In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI."

    This aspect is optional and will only work during these recommended times. However the function can optionally wait until the recommended time automatically.

  7. "substructure" should the substructure score generated by the dbProb function be used to rank possible annotations.

Depending on the availabilty of each of these pieces of evidence a mean annotation ranking score is calculated for every annotation and the best annotations can be automatically added.

Source

Sayers E. A General Introduction to the E-utilities. In: Entrez Programming Utilities Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010-. Available from: http://www.ncbi.nlm.nih.gov/books/NBK25497


WMBEdmands/compMS2Miner documentation built on May 9, 2019, 10:04 p.m.