generateCompoundsMetFrag: Compound annotation with MetFrag

generateCompoundsMetFragR Documentation

Compound annotation with MetFrag

Description

Uses the metfRag package or MetFrag CL for compound identification (see http://ipb-halle.github.io/MetFrag/).

Usage

generateCompoundsMetFrag(fGroups, ...)

## S4 method for signature 'featureGroups'
generateCompoundsMetFrag(
  fGroups,
  MSPeakLists,
  method = "CL",
  timeout = 300,
  timeoutRetries = 2,
  errorRetries = 2,
  topMost = 100,
  dbRelMzDev = 5,
  fragRelMzDev = 5,
  fragAbsMzDev = 0.002,
  adduct = NULL,
  database = "pubchem",
  extendedPubChem = "auto",
  chemSpiderToken = "",
  scoreTypes = compoundScorings("metfrag", database, onlyDefault = TRUE)$name,
  scoreWeights = 1,
  preProcessingFilters = c("UnconnectedCompoundFilter", "IsotopeFilter"),
  postProcessingFilters = c("InChIKeyFilter"),
  maxCandidatesToStop = 2500,
  identifiers = NULL,
  extraOpts = NULL
)

## S4 method for signature 'featureGroupsSet'
generateCompoundsMetFrag(
  fGroups,
  MSPeakLists,
  method = "CL",
  timeout = 300,
  timeoutRetries = 2,
  errorRetries = 2,
  topMost = 100,
  dbRelMzDev = 5,
  fragRelMzDev = 5,
  fragAbsMzDev = 0.002,
  adduct = NULL,
  ...,
  setThreshold = 0,
  setThresholdAnn = 0,
  setAvgSpecificScores = FALSE
)

Arguments

fGroups

featureGroups object which should be annotated. This should be the same or a subset of the object that was used to create the specified MSPeakLists. In the case of a subset only the remaining feature groups in the subset are considered.

... \setsWF

Further arguments passed to the non-sets workflow method.

MSPeakLists

A MSPeakLists object that was generated for the supplied fGroups.

method

Which method should be used for MetFrag execution: "CL" for MetFragCL and "R" for MetFragR. The former is usually much faster and recommended.

timeout

Maximum time (in seconds) before a metFrag query for a feature group is stopped. Also see timeoutRetries argument.

timeoutRetries

Maximum number of retries after reaching a timeout before completely skipping the metFrag query for a feature group. Also see timeout argument.

errorRetries

Maximum number of retries after an error occurred. This may be useful to handle e.g. connection errors.

topMost

Only keep this number of candidates (per feature group) with highest score. Set to NULL to always keep all candidates, however, please note that this may result in significant usage of CPU/RAM resources for large numbers of candidates.

dbRelMzDev

Relative mass deviation (in ppm) for database search. Sets the DatabaseSearchRelativeMassDeviation option.

fragRelMzDev

Relative mass deviation (in ppm) for fragment matching. Sets the FragmentPeakMatchRelativeMassDeviation option.

fragAbsMzDev

Absolute mass deviation (in Da) for fragment matching. Sets the FragmentPeakMatchAbsoluteMassDeviation option.

adduct

An adduct object (or something that can be converted to it with as.adduct). Examples: "[M-H]-", "[M+Na]+". If the featureGroups object has adduct annotations then these are used if adducts=NULL.

\setsWF

The adduct argument is not supported for sets workflows, since the adduct annotations will then always be used.

database

Compound database to use. Valid values are: "pubchem", "chemspider", "for-ident", "comptox", "pubchemlite", "kegg", "sdf", "psv" and "csv". See section below for more information. Sets the MetFragDatabaseType option.

extendedPubChem

If database="pubchem": whether to use the extended database that includes information for compound scoring (i.e. number of patents/PubMed references). Note that downloading candidates from this database might take extra time. Valid values are: FALSE (never use it), TRUE (always use it) or "auto" (default, use if specified scorings demand it).

chemSpiderToken

A character string with the ChemSpider security token that should be set when the ChemSpider database is used. Sets the ChemSpiderToken option.

scoreTypes

A character vector defining the scoring types. See the ⁠Scorings⁠ section below for more information. Note that both generic and MetFrag specific names are accepted (i.e. name and metfrag columns returned by compoundScorings). When a local database is used, the name should match what is given there (e.g column names when database=csv). Note that MetFrag may still report other scoring data, however, these are not used for ranking. Sets the MetFragScoreTypes option.

scoreWeights

Numeric vector containing weights of the used scoring types. Order is the same as set in scoreTypes. Values are recycled if necessary. Sets the MetFragScoreWeights option.

preProcessingFilters, postProcessingFilters

A character vector defining pre/post filters applied before/after fragmentation and scoring (e.g. "UnconnectedCompoundFilter", "IsotopeFilter", "ElementExclusionFilter"). Some methods require further options to be set. For all filters and more information refer to the ⁠Candidate Filters⁠ section on the MetFragR homepage. Sets the MetFragPreProcessingCandidateFilter and MetFragPostProcessingCandidateFilter options.

maxCandidatesToStop

If more than this number of candidate structures are found then processing will be aborted and no results this feature group will be reported. Low values increase the chance of missing data, whereas too high values will use too much computer resources and signficantly slowdown the process. Sets the MaxCandidateLimitToStop option.

identifiers

A list containing for each feature group a character vector with database identifiers that should be used to find candidates for a feature group (the list should be named by feature group names). If NULL all relevant candidates will be retrieved from the specified database. An example usage scenario is to obtain the list of candidate identifiers from a compounds object obtained with generateCompoundsSIRIUS using the identifiers method. This way, only those candidates will be searched by MetFrag that were generated by SIRIUS+CSI:FingerID. Sets the PrecursorCompoundIDs option.

extraOpts

A named list containing further settings MetFrag. See the MetFragR and MetFrag CL homepages for all available options. Set to NULL to ignore.

setThreshold \setsWF

Minimum abundance for a candidate among all sets (‘⁠0-1⁠’). For instance, a value of ‘⁠1⁠’ means that the candidate needs to be present in all the set data.

setThresholdAnn \setsWF

As setThreshold, but only taking into account the set data that contain annotations for the feature group of the candidate.

setAvgSpecificScores \setsWF

If TRUE then set specific scorings (e.g. MS/MS match) are also averaged.

Details

This function uses MetFrag to generate compound candidates. This function is called when calling generateCompounds with algorithm="metfrag".

Several online compound databases such as PubChem and ChemSpider may be chosen for retrieval of candidate structures. This method requires the availability of MS/MS data, and feature groups without it will be ignored. Many options exist to score and filter resulting data, and it is highly suggested to optimize these to improve results. The MetFrag options PeakList, IonizedPrecursorMass and ExperimentalRetentionTimeValue (in minutes) fields are automatically set from feature data.

Value

generateCompoundsMetFrag returns a compoundsMF object.

Scorings

MetFrag supports many different scorings to rank candidates. The compoundScorings function can be used to get an overview: (some columns are omitted)

name metfrag database
score Score
fragScore FragmenterScore
metFusionScore OfflineMetFusionScore
individualMoNAScore OfflineIndividualMoNAScore
numberPatents PubChemNumberPatents pubchem
numberPatents Patent_Count pubchemlite
pubMedReferences PubChemNumberPubMedReferences pubchem
pubMedReferences ChemSpiderNumberPubMedReferences chemspider
pubMedReferences NUMBER_OF_PUBMED_ARTICLES comptox
pubMedReferences PubMed_Count pubchemlite
extReferenceCount ChemSpiderNumberExternalReferences chemspider
dataSourceCount ChemSpiderDataSourceCount chemspider
referenceCount ChemSpiderReferenceCount chemspider
RSCCount ChemSpiderRSCCount chemspider
smartsInclusionScore SmartsSubstructureInclusionScore
smartsExclusionScore SmartsSubstructureExclusionScore
suspectListScore SuspectListScore
retentionTimeScore RetentionTimeScore
CPDATCount CPDAT_COUNT comptox
TOXCASTActive TOXCAST_PERCENT_ACTIVE comptox
dataSources DATA_SOURCES comptox
pubChemDataSources PUBCHEM_DATA_SOURCES comptox
EXPOCASTPredExpo EXPOCAST_MEDIAN_EXPOSURE_PREDICTION_MG/KG-BW/DAY comptox
ECOTOX ECOTOX comptox
NORMANSUSDAT NORMANSUSDAT comptox
MASSBANKEU MASSBANKEU comptox
TOX21SL TOX21SL comptox
TOXCAST TOXCAST comptox
KEMIMARKET KEMIMARKET comptox
MZCLOUD MZCLOUD comptox
pubMedNeuro PubMedNeuro comptox
CIGARETTES CIGARETTES comptox
INDOORCT16 INDOORCT16 comptox
SRM2585DUST SRM2585DUST comptox
SLTCHEMDB SLTCHEMDB comptox
THSMOKE THSMOKE comptox
ITNANTIBIOTIC ITNANTIBIOTIC comptox
STOFFIDENT STOFFIDENT comptox
KEMIMARKET_EXPO KEMIMARKET_EXPO comptox
KEMIMARKET_HAZ KEMIMARKET_HAZ comptox
REACH2017 REACH2017 comptox
KEMIWW_WDUIndex KEMIWW_WDUIndex comptox
KEMIWW_StpSE KEMIWW_StpSE comptox
KEMIWW_SEHitsOverDL KEMIWW_SEHitsOverDL comptox
ZINC15PHARMA ZINC15PHARMA comptox
PFASMASTER PFASMASTER comptox
peakFingerprintScore AutomatedPeakFingerprintAnnotationScore
lossFingerprintScore AutomatedLossFingerprintAnnotationScore
agroChemInfo AgroChemInfo pubchemlite
bioPathway BioPathway pubchemlite
drugMedicInfo DrugMedicInfo pubchemlite
foodRelated FoodRelated pubchemlite
pharmacoInfo PharmacoInfo pubchemlite
safetyInfo SafetyInfo pubchemlite
toxicityInfo ToxicityInfo pubchemlite
knownUse KnownUse pubchemlite
disorderDisease DisorderDisease pubchemlite
identification Identification pubchemlite
annoTypeCount FPSum pubchemlite
annoTypeCount AnnoTypeCount pubchemlite
annotHitCount AnnotHitCount pubchemlite

In addition, the compoundScorings function is also useful to programmatically generate a set of scorings to be used for ranking with MetFrag. For instance, the following can be given to the scoreTypes argument to use all default scorings for PubChem: compoundScorings("metfrag", "pubchem", onlyDefault=TRUE)$name.

For all MetFrag scoring types refer to the ⁠Candidate Scores⁠ section on the MetFragR homepage.

Usage of MetFrag databases

When database="chemspider" setting the chemSpiderToken argument is mandatory.

If a local database is chosen via sdf, psv, or csv then its file location should be set with the LocalDatabasePath value via the extraOpts argument. For example: extraOpts = list(LocalDatabasePath = "C:/myDB.csv").

If database="pubchemlite" or database="comptox" and patRoonExt is not installed then the file location must be specified as above or by setting the patRoon.path.MetFragPubChemLite/patRoon.path.MetFragCompTox option. See the installation section in the handbook for more details.

Parallelization

generateCompoundsMetFrag uses multiprocessing to parallelize computations. Please see the parallelization section in the handbook for more details and patRoon options for configuration options.

When local database files are used with generateCompoundsMetFrag (e.g. when database is set to "pubchemlite", "csv" etc.) and patRoon.MP.method="future", then the database file must be present on all the nodes. When pubchemlite or comptox is used, the location for these databases can be configured on the host with the respective package options (patRoon.path.MetFragPubChemLite and patRoon.path.MetFragCompTox) or made available by installing the patRoonExt package. Note that these files must also be present on the local host computer, even if it is not participating in computations.

References

\insertRef

Ruttkies2016patRoon

See Also

generateCompounds for more details and other algorithms.


rickhelmus/patRoon documentation built on April 25, 2024, 8:15 a.m.