Generating workflow data

Introduction

Workflow functions

Each step in the non-target workflow is performed by a function that performs the heavy lifting of a workflow step behind the scenes and finally return the results. An important goal of patRoon is to support multiple algorithms for each workflow step, hence, when such a function is called you have to specify which algorithm you want to use. The available algorithms and their characteristics will be discussed in the next sections. An overview of all functions involved in generating workflow data is shown in the table below.

Workflow step | Function | Output S4 class --------------------- | ------------------------------------------------- | ------------ Data pre-treatment | convertMSFiles(), recalibrarateDAFiles() | - Finding features | findFeatures() | features Grouping features | groupFeatures() | featureGroups Suspect screening | screenSuspects() | featureGroupsScreening Componentization | generateComponents() | components MS peak lists | generateMSPeakLists() | MSPeakLists Formula annotation | generateFormulas() | formulas Compound annotation | generateCompounds() | compounds

Workflow output

The output of each workflow step is stored in objects derived from so called S4 classes. Knowing the details about the S4 class system of R is generally not important when using patRoon (and well written resources are available if you want to know more). In brief, usage of this class system allows a general data format that is used irrespective of the algorithm that was used to generate the data. For instance, when features have been found by [OpenMS] or [XCMS] they both return the same data format.

Another advantage of the S4 class system is the usage of so called generic functions. To put simply: a generic function performs a certain task for different types of data objects. A good example is the plotSpectrum() function which plots an (annotated) spectrum from data of MS peak lists or from formula or compound annotation:

# mslists, formulas, compounds contain results for MS peak lists and
# formula/compound annotations, respectively.

plotSpectrum(mslists, ...) # plot raw MS spectrum
plotSpectrum(formulas, ...) # plot annotated spectrum from formula annotation data
plotSpectrum(compounds, ...) # likewise but for compound annotation.

Overview of all functions and their output

makeRefURL <- function(node, page, label, tt) glue::glue("'{ node }' [URL=\"https://rickhelmus.github.io/patRoon/reference/{ page }.html\" label=<<U>{ label }</U>> tooltip=\"{ tt }\"]", node = node, page = page, label = label, tt = tt)
makeRefURLClass <- function(class) makeRefURL(class, paste0(class, "-class"), class, paste(class, "class"))
makeRefURLFunc <- function(func, page = func) makeRefURL(func, page, paste0(func, "()"), paste0(func, "()"))
makeRefURLPage <- function(node, page) makeRefURL(node, page, node, node)

URLs <- paste0(makeRefURLClass(c("transformationProducts", "featureGroupsScreening", "features", "featureGroups",
                                 "MSPeakLists", "formulas", "compounds", "components")),
               makeRefURLFunc(c("findFeatures", "groupFeatures", "generateTPs", "generateMSPeakLists",
                                "generateFormulas", "generateCompounds", "generateComponents")),
               makeRefURLFunc(c("screenSuspects", "annotateSuspects", "generateAnalysisInfo", "addFormulaScoring",
                                "selectIons", "convertToSuspects", "convertToMFDB"),
                              c("suspect-screening", "featureGroupsScreening-class", "analysis-information",
                                "compounds-class", "featureGroups-class", "transformationProducts-class",
                                "generics")),
               makeRefURLPage(c("Analysis information", "Suspect list", "database"),
                              c("analysis-information", "suspect-screening", "generateCompoundsMetFrag")), collapse = "\n")

plotGV(sprintf("
digraph workflow {
  newrank = true
  graph [ rankdir = TB, compound = true, style = invis, splines=ortho ]
  node [ fixedsize = true,
         width = 3.4,
         height = 1,
         fontsize = 24,
         style = filled ]

  subgraph cluster_suspAnn {
    graph [style = invis; margin=20 ]
    'Suspect list' -> screenSuspects -> featureGroupsScreening -> annotateSuspects
    annotateSuspects -> featureGroupsScreening
  }

  subgraph cluster_fGroups {
    graph [style = invis; margin=25 ]
    'Raw MS data' -> 'Analysis information' -> 'findFeatures' -> 'features' -> 'groupFeatures' -> 'featureGroups'
    generateAnalysisInfo -> 'Analysis information'
  }

  subgraph cluster_tps {
    graph [style = invis ]
    Parents -> generateTPs -> transformationProducts
    transformationProducts -> convertToSuspects
    convertToSuspects -> 'Suspect list' [constraint=none]
    transformationProducts -> convertToMFDB -> database
  }

  subgraph cluster_MSPL {
    graph [style = invis ]
    generateMSPeakLists -> MSPeakLists
  }

  subgraph cluster_form {
    graph [style = invis; margin=30 ]
    generateFormulas -> formulas
  }

  subgraph cluster_comp {
    graph [style = invis; margin=30 ]
    generateCompounds -> compounds
  }

  subgraph cluster_compon {
    graph [style = invis margin=25 ]
    generateComponents -> components
    components -> selectIons
    selectIons -> featureGroups
  }

  featureGroups -> generateMSPeakLists
  featureGroups -> generateFormulas
  featureGroups -> generateCompounds
  featureGroups -> generateComponents
  featureGroups -> screenSuspects [constraint=none]
  MSPeakLists -> generateFormulas
  MSPeakLists -> generateCompounds
  MSPeakLists -> annotateSuspects [style=dashed, constraint=none]
  formulas -> annotateSuspects [style=dashed, constraint=none]
  compounds -> annotateSuspects [style=dashed, constraint=none]
  formulas -> addFormulaScoring
  compounds -> addFormulaScoring
  addFormulaScoring -> compounds
  transformationProducts -> generateComponents [constraint=none, style=dashed]
  database -> generateCompounds [constraint=none, style=dashed]

  { rank=same; 'findFeatures'; Parents; 'Suspect list' }
  { rank=same; generateMSPeakLists; generateComponents }
  { rank=same; addFormulaScoring; compounds }
  { rank=same; 'Analysis information'; generateAnalysisInfo }

  Parents, 'Suspect list', 'Raw MS data', database [shape=cylinder; fillcolor=cadetblue1]
  generateAnalysisInfo, findFeatures, groupFeatures, generateTPs, convertToSuspects, convertToMFDB, screenSuspects,
    annotateSuspects, generateMSPeakLists, generateFormulas, generateCompounds, generateComponents,
    selectIons, addFormulaScoring [ shape=Mrecord; fillcolor=\"#FFE6CC\" ]
  transformationProducts, featureGroupsScreening, 'Analysis information', features, featureGroups,
    MSPeakLists, formulas, compounds, components [ shape=octagon; fillcolor=\"#D5E8D4\" ]

  %s
}", URLs), width = 920, height = 800)

The next sections in this chapter will further detail on how to actually perform the non-target workflow steps to generate data. The transformation product screening workflows are discussed in a separate chapter.

Preparations

Data pre-treatment

Prior to performing the actual non-target data processing workflow some preparations often need to be made. Often data has to be pre-treated, for instance, by converting it to an open format that is usable for subsequent workflow steps or to perform mass re-calibration. Some common functions are listed below.

Task | Function | Algorithms | Supported file formats ------------------------------------| ------------------------ | -------------------------------------- | ------- Conversion | convertMSFiles() | [OpenMS], [ProteoWizard], DataAnalysis | All common (algorithm dependent) Advanced (e.g. spectral filtering) | convertMSFiles() | [ProteoWizard] | All common Mass re-calibration | recalibrarateDAFiles() | DataAnalysis | Bruker

The convertMSFiles() function supports conversion between many different file formats typically used in non-target analysis. Furthermore, other pre-treatment steps are available (e.g. centroiding, filtering) when the [ProteoWizard] algorithm is used. For an overview of these functionalities see the MsConvert documentation. Some examples:

# Converts a single mzXML file to mzML format
convertMSFiles("standard-1.mzXML", to = "mzML", algorithm = "openms")

# Converts all Thermo files with ProteoWizard (the default) in the analyses/raw
# directory and stores the mzML files in analyses/raw. Afterwards, only MS1
# spectra are retained.
convertMSFiles("analyses/raw", "analyses/mzml", from = "thermo",
               centroid = "vendor", filters = "msLevel 1")

NOTE Most algorithms further down the workflow require the mzML or mzXML file format and additionally require that mass peaks have been centroided. When using the ProteoWizard algorithm (the default), centroiding by vendor algorithms is generally recommended (i.e. by setting centroid="vendor" as shown in the above example).

When Bruker MS data is used it can be automatically re-calibrated to improve its mass accuracy. Often this is preceeded by calling the setDAMethod() function to set a DataAnalysis method to all files in order to configure automatic re-calibration. The recalibrarateDAFiles() function performs the actual re-calibration. The getDACalibrationError() function can be used at anytime to request the current calibration error of each analysis. An example of these functions is shown below.

# anaInfo is a data.frame with information on analyses (see next section)
setDAMethod(anaInfo, "path/to/DAMethod.m") # configure Bruker files with given method that has automatic calibration setup
recalibrarateDAFiles(anaInfo) # trigger re-calibration for each analysis
getDACalibrationError(anaInfo) # get calibration error for each analysis (NOTE: also shown when previous function is finished)

Analysis information {#anaInfo}

The final bits of preparation is constructing the information for the analyses that need to be processed. In patRoon this is referred to as the analysis information and often stored in a variable anaInfo (of course you are free to choose a different name!). The analysis information should be a data.frame with the following columns:

The generateAnalysisInfo() function can be used to (semi-)automatically generate a suitable data.frame that contains all the required information for a set of analysis. For, instance:

# Take example data from patRoonData package (triplicate solvent blank + triplicate standard)
generateAnalysisInfo(paths = patRoonData::exampleDataPath(),
                     groups = c(rep("solvent-pos", 3), rep("standard-pos", 3)),
                     blanks = "solvent-pos")

(Note that for the example data the patRoonData::exampleAnalysisInfo() function can also be used.)

Alternatively, the newProject() function discussed in the next section can be used to interactively construct this information.

Automatic project generation with newProject() {#newProject}

The previous sections already highlighted some steps that have to be performed prior to starting a new non-target analysis workflow: data pre-treatment and gathering information on the analysis. Most of the times you will put this and other R code a script file so you can recall what you have done before (i.e. reproducible research).

The newProject() function can be used to setup a new project. When you run this function it will launch a small tool (see screenshot below) where you can select your analyses and configure the various workflow steps which you want to execute (e.g. data pre-treatment, finding features, annotation etc). After setting everything up the function will generate a template script which can easily be edited afterwards. In addition, you have the option to create a new RStudio project, which is advantegeous as it neatly seperates your data processing work from the rest.

knitr::include_graphics(file.path(vignDir, "newp.png"))

NOTE At the moment newProject() only works with [RStudio].

Features

Collecting features from the analyses consists of finding all features, grouping them across analyses (optionally after retention time alignment), and if desired suspect screening:

plotGV("
digraph Features {
  graph [ rankdir = LR, compound = true ]
  node [ shape = box,
         fixedsize = true,
         width = 2.2,
         height = 1,
         fontsize = 18,
         fillcolor = darkseagreen1,
         style = filled ]

    'Find features' -> 'Group features'
    'Suspect screening'
    'Group features' -> 'Suspect screening' [style=dashed]
}", height = 70, width = 500)

Finding and grouping features

Several algorithms are available for finding features. These are listed in the table below alongside their usage and general remarks.

Algorithm | Usage | Remarks ---------------- | ------------------------------------------- | -------------- [OpenMS] | findFeatures(algorithm = "openms", ...) | Uses the [FeatureFinderMetabo] algorithm [XCMS] | findFeatures(algorithm = "xcms", ...) | Uses xcms::xcmsSet() function [XCMS] (import) | importFeatures(algorithm = "xcms", ...) | Imports an existing xcmsSet object [XCMS3] | findFeatures(algorithm = "xcms3", ...) | Uses xcms::findChromPeaks() from the new XCMS3 interface [XCMS3] (import) | importFeatures(algorithm = "xcms3", ...) | Imports an existing XCMSnExp object [enviPick] | findFeatures(algorithm = "envipick", ...) | Uses enviPick::enviPickwrap() [KPIC2] | findFeatures(algorithm = "kpic2", ...) | Uses the [KPIC2] R package [KPIC2] (import) | importFeatures(algorithm = "kpic2", ...) | Imports features from [KPIC2] [SIRIUS] | findFeatures(algorithm = "sirius", ...) | Uses [SIRIUS] to find features [SAFD] | findFeatures(algorithm = "safd", ...) | Uses the [SAFD] algorithm (experimental) DataAnalysis | findFeatures(algorithm = "bruker", ...) | Uses Find Molecular Features from DataAnalysis (Bruker only)

Most often the performance of these algorithms heavily depend on the data and parameter settings that are used. Since obtaining a good feature dataset is crucial for the rest of the workflow, it is highly recommended to experiment with different settings (this process can also be automated, see the feature optimization section for more details). Some common parameters to look at are listed in the table below. However, there are many more parameters that can be set, please see the reference documentation for these (e.g. ?findFeatures).

Algorithm | Common parameters ---------------- | --------------------------------------------------------------------------------- [OpenMS] | noiseThrInt, chromSNR, chromFWHM, mzPPM, minFWHM, maxFWHM (see ?findFeatures) [XCMS] / [XCMS3] | peakwidth, mzdiff, prefilter, noise (assuming default centWave algorithm, see ?findPeaks.centWave / ?CentWaveParam) [enviPick] | dmzgap, dmzdens, drtgap, drtsmall, drtdens, drtfill, drttotal, minpeak, minint, maxint (see ?enviPickwrap) [KPIC2] | kmeans, level, min_snr (see ?findFeatures and ?getPIC / ?getPIC.kmeans) [SIRIUS] | The sirius algorithm is currently parameterless [SAFD] | mzRange, maxNumbIter, resolution, minInt (see ?findFeatures) DataAnalysis | See Find -> Parameters... -> Molecular Features in DataAnalysis.

NOTE Support for SAFD is still experimental and some extra work is required to set everything up. Please see the reference documentation for this algorithm (?findFeatures).

NOTE DataAnalysis feature settings have to be configured in DataAnalysis prior to calling findFeatures().

Similarly, for grouping features across analyses several algorithms are supported.

Algorithm | Usage | Remarks ---------------- | -------------------------------------------------------- | ------------------------------------- [OpenMS] | groupFeatures(algorithm = "openms", ...) | Uses the [FeatureLinkerUnlabeled] algorithm (and [MapAlignerPoseClustering] for retention alignment) [XCMS] | groupFeatures(algorithm = "xcms", ...) | Uses xcms::group() xcms::retcor() functions [XCMS] (import) | importFeatureGroupsXCMS(...) | Imports an existing xcmsSet object. [XCMS3] | groupFeatures(algorithm = "xcms3", ...) | Uses xcms::groupChromPeaks() and xcms::adjustRtime() functions [XCMS3] (import) | importFeatureGroupsXCMS3(...) | Imports an existing XCMSnExp object. [KPIC2] | groupFeatures(algorithm = "kpic2", ...) | Uses the [KPIC2] package [KPIC2] (import) | importFeatureGroupsKPIC2(...) | Imports a PIC set object [SIRIUS] | groupFeatures(anaInfo, algorithm = "sirius") | Finds and groups features with [SIRIUS] ProfileAnalysis | importFeatureGroups(algorithm = "brukerpa", ...) | Import .csv file exported from Bruker ProfileAnalysis TASQ | importFeatureGroups(algorithm = "brukertasq", ...) | Imports a Global result table (exported to Excel file and then saved as .csv file)

NOTE: Grouping features with the sirius algorithm will perform both finding and grouping features with [SIRIUS]. This algorithm cannot work with features from another algorithm.

Just like finding features, each algorithm has their own set of parameters. Often the defaults are a good start but it is recommended to have look at them. See ?groupFeatures for more details.

When using the [XCMS] algorithms both the 'classical' interface and latest XCMS3 interfaces are supported. Currently, both interfaces are mostly the same regarding functionalities and implementation. However, since future developments of XCMS are primarily focused the latter this interface is recommended.

Some examples of finding and grouping features are shown below.

# The anaInfo variable contains analysis information, see the previous section

# Finding features
fListOMS <- findFeatures(anaInfo, "openms") # OpenMS, with default settings
fListOMS2 <- findFeatures(anaInfo, "openms", noiseThrInt = 500, chromSNR = 10) # OpenMS, adjusted minimum intensity and S/N
fListXCMS <- findFeatures(anaInfo, "xcms", ppm = 10) # XCMS
fListXCMSImp <- importFeatures(anaInfo, "xcms", xset) # import XCMS xcmsSet object
fListXCMS3 <- findFeatures(anaInfo, "xcms3", CentWaveParam(peakwidth = c(5, 15))) # XCMS3
fListEP <- findFeatures(anaInfo, "envipick", minint = 1E3) # enviPick
fListKPIC2 <- findFeatures(anaInfo, "kpic2", kmeans = TRUE, level = 1E4) # KPIC2
fListSIRIUS <- findFeatures(anaInfo, "sirius") # SIRIUS

# Grouping features
fGroupsOMS <- groupFeatures(fListOMS, "openms") # OpenMS grouping, default settings
fGroupsOMS2 <- groupFeatures(fListOMS2, "openms", rtalign = FALSE) # OpenMS grouping, no RT alignment
fGroupsOMS3 <- groupFeatures(fListXCMS, "openms", maxGroupRT = 6) # group XCMS features with OpenMS, adjusted grouping parameter
# group enviPick features with XCMS3, disable minFraction
fGroupsXCMS <- groupFeatures(fListEP, "xcms3",
                             xcms::PeakDensityParam(sampleGroups = analInfo$group,
                                                    minFraction = 0))
# group with KPIC2 and set some custom grouping/aligning parameters
fGroupsKPIC2 <- groupFeatures(fListKPIC2, "kpic2", groupArgs = list(tolerance = c(0.002, 18)),
                              alignArgs = list(move = "loess"))
fGroupsSIRIUS <- groupFeatures(anaInfo, "sirius") # find/group features with SIRIUS

Suspect screening {#suspscr}

After features have been grouped a so called suspect screening step may be performed to find features that may correspond to suspects within a given suspect list. The screenSuspects() function is used for this purpose, for instance:

suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
                       mz = c(120.0556, 137.0709, 146.0600))
fGroupsSusp <- screenSuspects(fGroups, suspects)

Suspect list format {#suspFormat}

The example above has a very simple suspect list with just three compounds. The format of the suspect list is quite flexible, and can contain the following columns:

In most cases a suspect list is best made as a csv file which can then be imported with e.g. the read.csv() function. This is exactly what happen when you specify a suspect list when using the newProject() function.

Quite often, the ionized masses are not readily available and these have to be calculated. In this case, data in any of the SMILES/InChI/formula/neutralMass columns should be provided. Whenever possible, it is strongly recommended to fill in SMILES column (or InChI), as this will assist annotation. Applying this to the above example:

suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
                       SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "Oc1ccc2ccccc2n1"))
fGroupsSusp <- screenSuspects(fGroups, suspects, adduct = "[M+H]+")

NOTE: It is highly recommended to install [OpenBabel] to automatically validate and amend chemical properties such as SMILES, InChI, formulae etc in the suspect list.

Since suspect matching now occurs by the neutral mass it is required that the adduct information for the feature groups are set. This is done either by setting the adduct function argument to screenSuspects or by feature group adduct annotations.

Finally, when the adduct is known for a suspect it can be specified in the suspect list:

# Aldicarb is measured with a sodium adduct.
suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "Aldicarb"),
                       SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "CC(C)(C=NOC(=O)NC)SC"),
                       adduct = c("[M+H]+", "[M+H]+", "[M+Na]+"))
fGroupsSusp <- screenSuspects(fGroups, suspects)

To summarize:

The fragments_mz and fragments_formula columns in the suspect list can be used to specify known fragments for a suspect, which can help suspect annotation. The former specifies the ionized m/z of known MS/MS peaks, whereas the second specifies known formulas. Multiple values can be given by separating them with a semicolon:

suspects <- data.frame(name = c("1H-benzotriazole", "N-Phenyl urea", "2-Hydroxyquinoline"),
                       SMILES = c("[nH]1nnc2ccccc12", "NC(=O)Nc1ccccc1", "Oc1ccc2ccccc2n1"),
                       fragments_formula = c("C6H6N", "C6H8N;C7H6NO", ""),
                       fragments_mz = c("", "", "118.0652"))

Removing feature groups without hits

Note that any feature groups that were not matched to a suspect are not removed by default. If you want to remove these, you can use the onlyHits parameter:

fGroupsSusp <- screenSuspects(fGroups, suspects, onlyHits = TRUE) # remove any non-hits immediately

The advantage of removing non-hits is that it may significantly reduce the complexity of your dataset. On the other hand, retaining all features allows you to mix a full non-target analysis with a suspect screening workflow. The filter() function (discussed here) can also be used to remove feature groups without a hit at a later stage.

Combining screening results

The amend function argument to screenSuspects can be used to combine screening results from different suspect lists.

fGroupsSusp <- screenSuspects(fGroups, suspects)
fGroupsSusp <- screenSuspects(fGroupsSusp, suspects2, onlyHits = TRUE, amend = TRUE)

In this example the suspect lists defined in suspects and suspects2 are both used for screening. By setting amend=TRUE the original screening results (i.e. from suspects) are preserved. Note that onlyHits should only be set in the final call to screenSuspects to ensure that all feature groups are screened.

Componentization {#componentization}

In patRoon componentization refers to grouping related feature groups together in components. There are different methodologies to generate components:

The following algorithms are currently supported:

Algorithm | Usage | Remarks ----------------------- | -------------------------------------------------- | -------------------------------------------- [CAMERA] | generateComponents(algorithm = "camera", ...) | Clusters feature groups with similar chromatographic elution profiles and annotate by known chemical rules (adducts, isotopologues, in-source fragments). [RAMClustR] | generateComponents(algorithm = "ramclustr", ...) | As above. [cliqueMS] | generateComponents(algorithm = "cliquems", ...) | As above, but using feature components. [OpenMS] | generateComponents(algorithm = "openms", ...) | As above. Uses [MetaboliteAdductDecharger]. [nontarget] | generateComponents(algorithm = "nontarget", ...) | Uses the [nontarget] R package to perform unsupervised homologous series detection. Intensity clustering | generateComponents(algorithm = "intclust", ...) | Groups features with similar intensity profiles across analyses by hierarchical clustering. MS/MS clustering | generateComponents(algorithm = "specclust", ...) | Clusters feature groups with similar MS/MS spectra. Transformation products | generateComponents(algorithm = "tp", ...) | Discussed in its own chapter.

Features with similar chromatographic behaviour {#componentsChrom}

Isotopes, adducts and in-source fragments typically result in detection of multiple mass peaks by the mass spectrometer for a single chemical compound. While some feature finding algorithms already try to collapse (some of) these in to a single feature, this process is often incomplete (if performed at all) and it is not uncommon that multiple features will describe the same compound. To overcome this complexity several algorithms can be used to group features that undergo highly similar chromatographic behavior but have different m/z values. Basic chemical rules are then applied to the resulting components to annotate adducts, in-source fragments and isotopologues, which may be highly useful for general identification purposes.

Note that some algorithms were primarily designed for datasets where features are generally present in the majority of the analyses (as is relatively common in metabolomics). For environmental analyses, however, this is often not the case. For instance, consider the following situation with three feature groups that chromatographically overlap and therefore could be considered a component:

Feature group | m/z | analysis 1 | analysis 2 | analysis 3 ------------- | ---------- | ---------- | ---------- | ---------- #1 | 100.08827 | Present | Present | Absent #2 | 122.07021 | Present | Present | Absent #3 | 138.04415 | Absent | Absent | Present

Based on the mass differences from this example a cluster of [M+H]+, [M+Na]+ and [M+K]+ could be assumed. However, no features of the first two feature groups were detected in the third sample analysis, whereas the third feature group wasn't detected in the first two sample analysis. Based on this it seems unlikely that feature group #3 should be part of the component.

For the algorithms that operate on a 'feature group level' ([CAMERA] and [RAMClustR]), the relMinReplicates argument can be used to remove feature groups from a component that are not abundant. For instance, when this value is 0.5 (the default), and all the features of a component were detected in four different replicate groups in total, then only those feature groups are kept for which its features were detected in at least two different replicate groups (i.e. half of four).

Another approach to reduce unlikely adduct annotations is to use algorithms that operate on a 'feature level' ([cliqueMS] and [OpenMS]). These algorithms generate components for each sample analysis individually. The 'feature components' are then merged by a consensus approach where unlikely annotations are removed (the algorithm is described further in the reference manual, ?generateComponents).

Each algorithm supports many different parameters that may significantly influence the (quality of the) output. For instance, care has to be taken to avoid 'over-clustering' of feature groups which do not belong in the same component. This is often easily visible since the chromatographic peaks poorly overlap or are shaped differently. The checkComponents function (discussed here) can be used to quickly verify componentization results. For a complete listing all arguments see the reference manual (e.g. ?generateComponents).

Once the components with adduct and isotopes annotations are generated this data can be used to prioritize and improve the workflow.

Some example usage is shown below.

# Use CAMERA with defaults
componCAM <- generateComponents(fGroups, "camera", ionization = "positive")

# CAMERA with customized settings
componCAM2 <- generateComponents(fGroups, "camera", ionization = "positive",
                                 extraOpts = list(mzabs = 0.001, sigma = 5))

# Use RAMClustR with customized parameters
componRC <- generateComponents(fGroups, "ramclustr", ionization = "positive", hmax = 0.4,
                               extraOptsRC = list(cor.method = "spearman"),
                               extraOptsFM = list(ppm.error = 5))

# OpenMS with customized parameters
componOpenMS <- generateComponents(fGroups, "openms", ionization = "positive", chargeMax = 2,
                                   absMzDev = 0.002)

# cliqueMS with default parameters
componCliqueMS <- generateComponents(fGroups, "cliquems", ionization = "negative")

Homologues series

Homologues series can be automatically detected by interfacing with the [nontarget] R package. Components are made from feature groups that show increasing m/z and retention time values. Series are first detected within each replicate group. Afterwards, series from all replicates are linked in case (partial) overlap occurs and this overlap consists of the same feature groups (see figure below). Linked series are then finally merged if this will not cause any conflicts with other series: such a conflict typically occurs when two series are not only linked to each other.

plotGV("
digraph Homologs {
  graph [ rankdir = LR, compound = true, style = invis ]
  node [ shape = oval,
         fixedsize = true,
         width = 2.3,
         height = 0.8,
         fontsize = 25,
         fillcolor = darkseagreen1,
         style = filled ]

  subgraph cluster3 {
    I [fillcolor=skyblue]; J [fillcolor=skyblue]; K [fillcolor=skyblue]
    G -> H -> I -> J -> K [style=invis]
  }

  subgraph cluster4 {
    _I [label=I, fillcolor=skyblue]; X [fillcolor=indianred1]; _K [label=K, fillcolor=skyblue]
    _I -> X -> _K -> L [style=invis]
  }

  H -> _I [ltail=cluster3, lhead=cluster4, style=invis]

  I -> _I [constraint=false, style=dashed, arrowhead=none]
  K -> _K [constraint=false, style=dashed, arrowhead=none]

  subgraph cluster1 {
    C [fillcolor=skyblue]; D [fillcolor=skyblue]
    A -> B -> C -> D [style=invis]
  }

  subgraph cluster2 {
    _C [label=C, fillcolor=skyblue]; _D [label=D, fillcolor=skyblue]
    _C -> _D -> E -> F [style=invis]
  }

  B -> _C [ltail=cluster1, lhead=cluster2, style=invis]

  C -> _C [constraint=false, style=dashed, arrowhead=none]
  D -> _D [constraint=false, style=dashed, arrowhead=none]
}", height = 250, width = 600)

The series that are linked can be interactively explored with the plotGraph() function (discussed here).

Common function arguments to generateComponents() are listed below.

Argument | Remarks -------------------- | ----------------------------------------------------------------------- ionization | Ionization mode: "positive" or "negative". Not needed if adduct annotations are available. rtRange, mzRange | Retention and m/z increment range. Retention times can be negative to allow series with increasing m/z values and decreasing retention times. elements | Vector with elements to consider. rtDev, absMzDev | Maximum retention time and m/z deviation. ... | Further arguments passed to the homol.search() function.

# default settings
componNT <- generateComponents(fGroups, "nontarget", ionization = "positive")

# customized settings
componNT2 <- generateComponents(fGroups, "nontarget", ionization = "positive",
                                elements = c("C", "H"), rtRange = c(-60, 60))

Intensity and MS/MS similarity {#compClust}

The previous componentization methods utilized chemical properties to relate features. The two componentization algorithms described in this section use a statistical approach based on hierarchical clustering. The first algorithm normalizes all feature intensities and then clusters features with similar intensity profiles across sample analyses together. The second algorithm compares all MS/MS spectra from all feature groups, and then uses hierarchical clustering to generate components from feature groups that have a high MS/MS spectrum similarity.

Some common arguments to generateComponents() are listed below. It is recommended to test various settings (especially for method) to optimize the clustering results.

Argument | Algorithm | Default | Remarks --------------------------------------------- | ----------- | ---------------- | ---------------------------------------------- method | All | "complete" | Clustering method. See ?hclust metric | intclust | "euclidean" | Metric used to calculate the distance matrix. See ?daisy. normalized |intclust|TRUE| Whether normalized feature intensities should be used. Detailed [here](#fNorm).average|intclust|TRUE| Whether intensities of replicates should first be averaged.MSPeakLists|specclust| - | The [MS peak lists] object used for spectral similarity calculationsspecSimParams|specclust|getDefSpecSimParams()| Parameters used for [spectral similarity calculation](#specSim).maxTreeHeight,deepSplit,minModuleSize| All |1,TRUE,1| Used for dynamic cluster assignment. See?cutreeDynamicTree`.

The components are generated by automatically assigning clusters using the [dynamicTreeCut] R package. However, the cluster assignment can be performed manually or with different parameters, as is demonstrated below.

The resulting components are stored in an object from the componentsIntClust or componentsSpecClust S4 class, which are both derived from the componentsClust class (which in turn is derived from the components class). Several methods are defined that can be used on these objects to re-assign clusters, perform plotting operations and so on. Below are some examples. For plotting see the relevant visualization section. More info can be found in the reference manual (e.g. ?componentsIntClust, ?componentsSpecClust and ?componentsClust).

# generate intensity profile components with default settings
componInt <- generateComponents(fGroups, "intclust")

# manually re-assign clusters
componInt <- treeCut(componInt, k = 10)

# automatic re-assignment of clusters (adjusted max tree height)
componInt <- treeCutDynamic(componInt, maxTreeHeight = 0.7)

# MS/MS similarity components
componMSMS <- generateComponents(fGroups, "specclust", MSPeakLists = mslists)

Incorporating adduct and isotopic data {#incorpAdductIso}

With mass spectrometry it is common that multiple m/z values are detected for a single compound. These may be different adducts (e.g. [M+H]+, [M+Na]+, [M-H]-), the different isotopes of the molecule or a combination thereof. When multiple m/z values are measured for the same compound, the feature finding algorithm may yield a distinct feature for each, which adds complexity to the data. In the previous section it was discussed how componentization can help to find feature groups that belong to the same adduct and/or isotope clusters. This section explains how this data can be used to simplify the feature dataset. Furthermore, this section also covers adduct annotations for feature groups which may improve and simplify the general workflow.

Selecting features with preferential adducts/isotopes

The selectIons function forms the bridge between feature group and componentization data. This function uses the adduct and isotope annotations to select preferential feature groups. For adduct clusters this means that only the feature group that has a preferential adduct (e.g. [M+H]+) is kept while others (e.g. [M+Na]+) are removed. If none of the adduct annotations are considered preferential, the most intense feature group is kept instead. For isotopic clusters typically only the feature group with the monoisotopic mass (i.e. M0) is kept.

The behavior of selectIons is configurable with the following parameters:

Argument | Remarks ---------------- | ------------------------------------------------------- prefAdduct | The preferential adduct. Usually "[M+H]+" or "[M-H]-". onlyMonoIso | If TRUE and a feature group is with isotopic annotations then it is only kept if it is monoisotopic. chargeMismatch | How charge mismatches between adduct and isotope annotations are dealt with. Valid options are "isotope", "adduct", "none" or "ignore". See the reference manual for selectIons for more details.

In case componentization did not lead to an adduct annotation for a feature group it will never be removed and simply be annotated with the preferential adduct. Similarly, when no isotope annotations are available and onlyMonoIso=TRUE, the feature group will not be removed.

Although selectIons operates fairly conservative, it is still recommended to verify the componentization results in advance, for instance with the checkComponents function discussed here. Furthermore, the next subsection explains how adduct annotations can be corrected manually if needed.

An example usage is shown below.

fGroupsSel <- selectIons(fGroups, componCAM, "[M+H]+")

Setting adduct annotations for feature groups

The adducts() function can be used to obtain a character vector with adduct annotations for each feature group. When no adduct annotations are available it will simply return an empty character vector.

When the selectIons function is used it will automatically add adduct annotations based on the componentization data. In addition, the adducts()<- function can be used to manually add or change adduct annotations.

adducts(fGroups) # no adduct annotations
adducts(fGroupsSel)[1:5] # adduct annotations set by selectIons()

adducts(fGroupsSel)[3] <- "[M+Na]+" # modify annotation
adducts(fGroupsSel)[1:5] # verify

NOTE Adduct annotations are always available with sets workflows.

Using adduct annotations in the workflow {#useAdductAnn}

When feature groups have adduct annotations available this may simplify and improve the workflow. The adduct and ionization arguments used for suspect screening, formula/compound annotation and some componentization algorithms do not have to be set anymore, since this data can be obtained from the adduct annotations. Furthermore, these algorithms may improve their results, since the algorithms are now able to use adduct information for each feature group individually, instead of assuming that all feature groups have the same adduct.

Annotation

The annotation consists of collecting MS peak lists and then formula and/or compound annotation:

plotGV("
digraph Annotation {
  graph [ rankdir = LR ]
  node [ shape = box,
         fixedsize = true,
         width = 2.3,
         height = 1,
         fontsize = 18,
         fillcolor = darkseagreen1,
         style = filled ]

    'MS peak lists' -> 'Formula annotation'
    'MS peak lists' -> 'Compound annotation'
    'Formula annotation':n -> 'Suspect annotation':n [style = dotted]
    'Compound annotation':s -> 'Suspect annotation':s [style = dotted]
    'Formula annotation':e -> 'Compound annotation':e [style = dotted, constraint = false]
}", height = 120, width = 500)

Note that compound annotation is normally not dependent upon formula annotation. However, formula data can be used to improve ranking of candidates afterwards by the addFormulaScoring() function, which will be discussed later in this section. Furthermore, suspect annotation is not mandatory, and may use data from peak lists, formulae and/or comounds.

MS peak lists

Algorithm | Usage | Remarks ------------ | ------------------------------------------------------- | ----------------------------------------------------- [mzR] | generateMSPeakLists(algorithm = "mzr", ...) | Uses [mzR] for spectra retrieval. Recommended default. DataAnalysis | generateMSPeakLists(algorithm = "bruker", ...) | Loads data after automatically generating MS and MS/MS spectra in DataAnalysis DataAnalysis FMF | generateMSPeakLists(algorithm = "brukerfmf", ...) | Uses spectra from the find molecular features algorithm.

The recommended default algorithm is mzr: this algorithm is generally faster and is not limited to a vendor data format as it will read the open mzML and mzXML file formats. On the other hand, when DataAnalysis is used with Bruker data the spectra can be automatically background subtracted and there is no need for file conversion. Note that the brukerfmf algorithm only works when findFeatures() was called with the bruker algorithm.

When generateMSPeakists() is called it will

  1. Find all MS and MS/MS spectra that 'belong' to a feature. For MS spectra this means that all spectra close to the retention time of a feature will be collected. In addition, for MS/MS normally only spectra will be considered that have a precursor mass close to that of the feature (however, this can be disabled for data that was recorded with data independent acquisition (DIA, MS^E, bbCID, ...)).
  2. Average all MS and MS/MS spectra to produce peak lists for each feature.
  3. Average all peak lists for features within the same group.

Data from either (2) or (3) is used for subsequent annotation steps. Formula calculation can use either (as a trade-off between possibly more accurate results by outlier removal vs speed), whereas compound annotation will always use data from (3) since annotating single features (as opposed to their groups) would take a very long time.

There are several common function arguments to generateMSPeakLists() that can be used to optimize its behaviour:

Argument | Algorithm(s) | Remarks ------------------------------------ | --------------------- | ---------------------------------------------------------------------- maxMSRtWindow | mzr, bruker | Maximum time window +/- the feature retention time (in seconds) to collect spectra for averaging. Higher values may significantly increase processing times. precursorMzWindow | mzr | Maximum precursor m/z search window to find MS/MS spectra. Set to NULL to disable (i.e. for DIA experiments). topMost | mzr | Only retain feature data for no more than this amount analyses with highest intensity. For instance, a value of 1 will only keep peak lists for the feature with highest intensity in a feature group.
bgsubtr | bruker | Perform background subtraction (if the spectra type supports this, e.g. MS and bbCID) minMSIntensity, minMSMSIntensity | bruker, brukerfmf | Minimum MS and MS/MS intensity. Note that DataAnalysis reports many zero intensity peaks so a value of at least 1 is recommended. MSMSType | bruker | The type of spectra that should be used for MSMS: "BBCID" for bbCID experiments, otherwise "MSMS" (the default).

In addition, several parameters can be set that affect spectral averaging. These parameters are passed as a list to the avgFeatParams (mzr algorithm only) and avgFGroupParams arguments, which affect averaging of feature and feature group data, respectively. Some typical parameters include:

See ?generateMSPeakLists for all possible parameters.

A suitable list object to set averaging parameters can be obtained with the getDefAvgPListParams() function.

# lower default clustering window, other settings remain default
avgPListParams <- getDefAvgPListParams(clusterMzWindow = 0.001)

# Apply to both feature and feature group averaging
plists <- generateMSPeakLists(fGroups, "mzr", avgFeatParams = avgPListParams, avgFGroupParams = avgPListParams)

Formulae

Formulae can be automatically calculated for all features using the generateFormulas() function. The following algorithms are currently supported:

Algorithm | Usage | Remarks ------------ | ---------------------------------------------- | ------------------------------------------------ [GenForm] | generateFormulas(algorithm = "genform", ...) | Bundled with patRoon. Reasonable default. [SIRIUS] | generateFormulas(algorithm = "sirius", ...) | Requires MS/MS data. DataAnalysis | generateFormulas(algorithm = "bruker", ...) | Requires FMF features (i.e. findFeatures(algorithm = "bruker", ...)). Uses SmartFormula algorithms.

Calculation with [GenForm] is often a good default. It is fast and basic rules can be applied to filter out obvious non-existing formulae. A possible drawback of GenForm, however, is that may become slow when many candidates are calculated, for instance, due to a relative high feature m/z (e.g. >600) or loose elemental restricitions. More thorough calculation is performed with [SIRIUS]: this algorithm often yields fewer and often more plausible results. However, [SIRIUS] requires MS/MS data (hence features without will not have results) and formula prediction may not work well for compounds that structurally deviate from the training sets used by [SIRIUS]. Calculation with DataAnalysis is only possible when features are obtained with DataAnalysis as well. An advantage is that analysis files do not have to be converted, however, compared to other algorithms calculation is often relative slow.

There are two methods for formula assignment:

  1. Formulae are first calculated for each individual feature within a feature group. These results are then pooled, outliers are removed and remaining formulae are assigned to the feature group (i.e. calculateFeatures = TRUE).
  2. Formulae are directly calculated for each feature group by using group averaged peak lists (see previous section) (i.e. calculateFeatures = FALSE).

The first method is more thorough and the possibility to remove outliers may sometimes result in better formula assignment. However, the second method is much faster and generally recommended for large number of analyses.

By default, formulae are either calculated by only MS/MS data (SIRIUS) or with both MS and MS/MS data (GenForm/Bruker). The latter also allows formula calculation when no MS/MS data is present. Furthermore, with Bruker algorithms, data from both MS and MS/MS formula data can be combined to allow inclusion of candidates that would otherwise be excluded by e.g. poor MS/MS data. However, a disadvantage is that formulae needs to be calculated twice. The MSMode argument (listed below) can be used to customize this behaviour.

An overview of common parameters that are typically set to customize formula calculation is listed below.

Argument | Algorithm(s) | Remarks ----------------- | ------------------- | ------------------------------------------------------------------------------- relMzDev | genform, sirius | The maximum relative m/z deviation for a formula to be considered (in ppm). elements | genform, sirius | Which elements to consider. By default "CHNOP". Try to limit possible elements as much as possible. calculateFeatures | genform, sirius | Whether formulae should be calculated first for all features (see discussion above) (always TRUE with DataAnalysis). featThresholdAnn | All | Minimum relative amount (0-1) that a candidate formula for a feature group should be found among all annotated features (e.g. 1 means that a candidate is only considered if it was assigned to all annotated features). adduct | All | The adduct to consider for calculation (e.g. "[M+H]+", "[M-H]-", more details in the adduct section). Don't set this when adduct annotations are available. MSMode | genform, bruker | Whether formulae should be generated only from MS data ("ms"), MS/MS data ("msms") or both ("both"). The latter is default, see discussion above. profile | sirius | Instrument profile, e.g. "qtof", "orbitrap", "fticr".

Some typical examples:

formulasGF <- generateFormulas(fGroups, mslists, "genform") # GenForm, default settings
formulasGF2 <- generateFormulas(fGroups, mslists, "genform", calculateFeatures = FALSE) # direct feature group assignment (faster)
formulasSIR <- generateFormulas(fGroups, mslists, "sirius", elements = "CHNOPSClBr") # SIRIUS, common elements for pollutant
formulasSIR2 <- generateFormulas(fGroups, mslists, "sirius", adduct = "[M-H]-") # SIRIUS, negative ionization
formulasBr <- generateFormulas(fGroups, mslists, "bruker", MSMode = "MSMS") # Only consider MSMS data (SmartFormula3D)

Compounds {#compounds}

An important step in a typical non-target workflow is structural identification for features of interest, as this information may finally reveal what a feature is. In a first step all possible candidate structures for a feature are obtained from a database (based on e.g. monoisotopic mass or formula). These candidates are then ranked, for instance, by matching the feature MS/MS data with in-silico or library MS/MS spectra or its relevance to the environment.

Structure assignment in patRoon is performed automatically for all feature groups with the generateCompounds() function. Currently, this function supports the following algorithms:

Algorithm | Usage | Remarks ---------------------- | ----------------------------------------------- | ------------------------------------------------ [MetFrag] | generateCompounds(algorithm = "metfrag", ...) | Supports many databases (including offline and custom), matching MS/MS data with in-silico and library MS/MS data, and many other scorings to rank candidates. [SIRIUS] with [CSI:FingerID] | generateCompounds(algorithm = "sirius", ...) | Matches with in-silico MS/MS data, incorporates formula annotations to improve candidate selection. Library | generateCompounds(algorithm = "library", ...) | Obtains candidates by matching MS/MS data with an offline MS library, e.g. obtained from [MassBank.eu] or [MoNA].

All algorithms rank their candidates by matching MS/MS data with in-silico generated MS/MS data (MetFrag and SIRIUS) and/or experimental MS/MS data from an MS library (MetFrag with MoNA scoring and Library algorithm). The latter may yield better candidates, and the Library algorithm is also generally much faster. However, in-silico annotation is not limited by the availability of experimental MS/MS data.

Compound annotation is often a relative time and resource intensive procedure. For this reason, annotation occurs for each feature group and not individual features. Nevertheless, it is not uncommon that this is the most time consuming step in the workflow. For this reason, prioritization of features is highly important, even more so to avoid 'abusing' servers when an online database is used for compound retrieval.

Database selection for MetFrag and SIRIUS

Selecting the right database is important for proper candidate assignment. If the 'right' chemical compound is not present in the used database, it is impossible to assign the correct structure. Luckily, however, several large databases such as [PubChem] and [ChemSpider] are openly available which contain tens of millions of compounds. On the other hand, these databases may also lead to many unlikely candidates and therefore more specialized (or custom databases) may be preferred. Which database will be used is dictated by the database argument to generateCompounds(), currently the following options exist:

Database | Algorithm(s) | Remarks ------------------- | ----------------------- | ----------------------- pubchem | "metfrag", "sirius" | [PubChem] is currently the largest compound database and is used by default. chemspider | "metfrag" | [ChemSpider] is another large database. Requires security token from here (see next section). comptox | "metfrag" | The EPA [CompTox] contains many compounds and scorings relevant to environmental studies. Needs manual download (see next section). pubchemlite | "metfrag" | A specialized subset of the PubChem database. Needs manual download (see next section). for-ident | "metfrag" | The [FOR-IDENT] (STOFF-IDENT) database for water related substances. kegg | "metfrag", "sirius" | The [KEGG] database for biological compounds hmdb | "metfrag", "sirius" | The [HMDB] contains many human metabolites. bio | "sirius" | Selects all supports biological databases. csv, psv, sdf | "metfrag" | Custom database (see next section). [CSV example][csvDB-ex].

Configuring MetFrag databases and scoring

Some extra configuration may be necessary when using certain databases with MetFrag. In order to use the ChemSpider database a security token should be requested and set with the chemSpiderToken argument to generateCompounds(). The CompTox and PubChemLite databases need to be manually downloaded from [CompTox][CompTox-dl] (or variations with [smoking][CompTox-smoke] or [wastewater][CompTox-WW] metadata) and [PubChemLite][PCLite-dl] (or the PubChem derived [OECD PFAS database][PC-PFAS]). The file location of this and other local databases (csv, psv, sdf) needs to be manually configured, see the examples below and/or ?generateCompounds for more information on how to do this.

# PubChem: the default
compsMF <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+")

# ChemSpider: needs security token
compsMF2 <- generateCompounds(fGroups, mslists, "metfrag", database = "chemspider",
                              chemSpiderToken = "MY_TOKEN_HERE", adduct = "[M+H]+")

# CompTox: set global option to database path
options(patRoon.path.MetFragCompTox = "~/CompTox_17March2019_SelectMetaData.csv")
compsMF3 <- generateCompounds(fGroups, mslists, "metfrag", database = "comptox", adduct = "[M+H]+")

# CompTox: set database location without global option
compsMF4 <- generateCompounds(fGroups, mslists, "metfrag", database = "comptox", adduct = "[M+H]+",
                              extraOpts = list(LocalDatabasePath = "~/CompTox_17March2019_SelectMetaData.csv"))

# Same, but for custom database
compsMF5 <- generateCompounds(fGroups, mslists, "metfrag", database = "csv", adduct = "[M+H]+",
                              extraOpts = list(LocalDatabasePath = "~/mydb.csv"))

An example of a custom .csv database can be found [here][csvDB-ex].

With MetFrag compound databases are not only used to retrieve candidate structures but are also used to obtain metadata for further ranking. Each database has its own scorings, a table with currently supported scorings can be obtained with the compoundScorings() function (some columns omitted):

makeTable(compoundScorings()[, c("name", "metfrag", "database", "default")])

The first two columns contain the generic and original MetFrag naming schemes for each scoring type. While both naming schemes can be used, the generic is often shorter and harmonized with other algorithms (e.g. SIRIUS). The database column specifies for which databases a particular scoring is available (empty if not database specific). Most scorings are selected by default (as specified by the default column), however, this behaviour can be customized by using the scoreTypes argument:

# Only in-silico and PubChem number of patents scorings
compsMF1 <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+",
                              scoreTypes = c("fragScore" "numberPatents"))

# Custom scoring in custom database
compsMF2 <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+",
                              database = "csv",
                              extraOpts = list(LocalDatabasePath = "~/mydb.csv"),
                              scoreTypes = c("fragScore", "myScore", "myScore2"))

By default ranking is performed with equal weight (i.e. 1) for all scorings. This can be changed by the scoreWeights argument, which should be a vector containing the weights for all scorings following the order of scoreTypes, for instance:

compsMF <- generateCompounds(fGroups, mslists, "metfrag", adduct = "[M+H]+",
                             scoreTypes = c("fragScore" "numberPatents"),
                             scoreWeights = c(1, 2))

Sometimes thousands or more structural candidates are found when annotating a feature group. In this situation processing all these candidates will too involving (especially when external databases are used). To avoid this a default cut-off is set: when the number of candidates exceed a certain amount the search will be aborted and no results will be reported for that feature group. The maximum number of candidates can be set with the maxCandidatesToStop argument. The default value is relative conservative, especially for local databases it may be useful to increase this number.

MetFrag error and timeout handling

The use of online databases has the drawback that an error may occur, for instance, as a result of a connection error or when the aforementioned maximum number of candidates is reached (maxCandidatesToStop argument). By default, the processing is restarted if an error has occurred (configured by the errorRetries argument). Similarly, the timeoutRetries and timeout arguments can be used to avoid being 'stuck' on obtaining results, for instance, due to an unstable internet connection. If no compounds could be assigned due to an error a warning will be issued. In this case it is best to see what went wrong by manually checking the log files, which by default are stored in the log/metfrag folder.

Annotation with the Library algorithm {#compAnnLib}

To use the Library algorithm we first need to load an MS library. Currently, MS libraries in the MSP and [MoNA] JSON formats are supported. Note that the former format is not so well standardized, and the support in patRoon was mainly tailored for MSP files from [MassBank.eu] and [MoNA]. To load the MS library the loadMSLibrary() function is used:

mslibrary <- loadMSLibrary("~/MassBank_NIST.msp", "msp") # MassBank.eu MSP library
mslibrary <- loadMSLibrary("~/MoNA-export-CASMI_2016.msp", "msp") # MoNA MSP library
mslibrary <- loadMSLibrary("~/MoNA-export-MassBank.json", "json") # MoNA JSON library

NOTE Currently it is only possible to load formula annotated MS/MS peaks with the MoNA JSON format.

Once loaded, the MS library can be post-processed with various filtering, subsetting and export functionality, which may be useful for more tailored compound annotation. This is further discussed in the advanced chapter.

The compound annotation is performed with generateCompounds():

compsLib <- generateCompounds(fGroups, mslists, "library", MSLibrary = mslibrary)

# set minimum MS/MS spectral match for candidates to 0.5
compsLib <- generateCompounds(fGroups, mslists, "library", MSLibrary = mslibrary, minSim = 0.5)

Formula scoring

Ranking of candidate structures may further be improved by incorporating formula information by using the addFormulaScoring() function:

comps <- addFormulaScoring(coms, formulas, updateScore = TRUE)

Here, corresponding formula and explained fragments will be used to calculate a formulaScore for each candidate. Note that SIRIUS candidates are already based on calculated formulae, hence, running this function on SIRIUS results is less sensible unless scoring from another formula calculation algorithm is desired.

Further options and parameters

There are many more options and parameters that affect compound annotation. For a full overview please have a look at the reference manual (e.g. by running ?generateCompounds).

Suspect annotation {#suspAnn}

The data obtained during the previously described annotation steps can be used to improve a suspect screening workflow. The annotateSuspects() method uses the annotation data to calculate various annotation properties for each suspect, such as their rank in formula/compound candidates, which fragments from the suspect list were matched, and a rough indication of the identification level according to @Schymanski2014

fGroupsSusp <- annotateSuspects(fGroupsSusp, MSPeakLists = mslists,
                                formulas = formulas, compounds = compounds)

The calculation of identification levels is performed by a set of pre-defined rules. The genIDLevelRulesFile() can be used to inspect the default rules or to create your own rules file, which can subsequently passed to annotateSuspects() with the IDFile argument. See ?annotateSuspects for more details on the file format and options. The default identification levels can be summarized as follows:

Level | Description | Rules ----- | ----------- | -------------------------------- 1 | Target match | Retention time deviates <12 seconds from suspect list. At least 3 (or all if the suspect list contains less) fragments from the suspect list must match. 2a | Good MS/MS library match | Suspect is top ranked in the compounds results. The individualMoNAScore (MetFrag) or libMatch (Library algorithm) is at least 0.9 and no other candidates were matched with the MS library. 3a | Fair library match | The individualMoNAScore or libMatch is at least 0.4. 3b | Known MS/MS match | At least 3 (or all if the suspect list contains less) fragments from the suspect list must match. 3c | Good in-silico MS/MS match | The annotation MS/MS similarity (annSimComp column) is at least 0.7. 4a | Good formula MS/MS match | Suspect is top ranked formula candidate, annotation MS/MS similarity (annSimForm column) is at least 0.7 and isotopic match (isoScore) of at least 0.5. The latter two scores are at least 0.2 higher than next best ranked candidate. 4b | Good formula isotopic pattern match | Suspect is top ranked formula candidate and isotopic match (isoScore) of at least 0.9 and at least 0.2 higher than next best ranked candidate. 5 | Unknown | All else.

In general, the more data provided by the suspect list and to annotateSuspects(), the better identification level estimation works. For instance, when considering the default rules, either the fragments_mz or fragments_formula column is necessary to be able assign a level 3b. Similarly, the suspect list needs retention times (as well as fragment data) to be able to assign level 1. As you can imagine, providing the annotation workflow objects (i.e. MSPeakLists, formulas, compounds) to annotateSuspects() is necessary for calculation of most levels.

The annotateSuspects() function will log decisions for identification level assignments to the log/ sub-directory in the current working directory. This is useful to inspect level assignments and especially useful when you customized any rules.

NOTE: The current identification level rules are only optimized for GenForm and MetFrag annotation algorithms.



rickhelmus/patRoon documentation built on April 25, 2024, 8:15 a.m.