Advanced usage {#advanced_usage}

Adducts {#adducts}

When generating formulae and compound annotations and some other functionalities it is required to specify the adduct species. Behind the scenes, different algorithms typically use different formats. For instance, in order to specify a protonated species...

In addition, most algorithms only accept a limited set of possible adducts, which do not necessarily all overlap with each other. The GenFormAdducts() and MetFragAdducts() functions list the possible adducts for GenForm and MetFrag, respectively.

In order to simplify the situation patRoon internally uses its own format and converts it automatically to the algorithm specific format when necessary. Furthermore, during conversion it checks if the specified adduct format is actually allowed by the algorithm. Adducts in patRoon are stored in the adduct S4 class. Objects from this class specify which elements are added and/or subtracted, the final charge and the number of molecules present in the adduct (e.g. 2 for a dimer).

adduct(add = "H") # [M+H]+
adduct(sub = "H", charge = -1) # [M-H]-
adduct(add = "K", sub = "H2", charge = -1) # [M+K-H2]-
adduct(add = "H3", charge = 3) # [M+H3]3+
adduct(add = "H", molMult = 2) # [2M+H]+

A more easy way to generate adduct objects is by using the as.adduct() function:

as.adduct("[M+H]+")
as.adduct("[M+H2]2+")
as.adduct("[2M+H]+")
as.adduct("[M-H]-")
as.adduct("+H", format = "genform")
as.adduct(1, isPositive = TRUE, format = "metfrag")

In fact, the adduct argument to workflow functions such as generateFormulas() and generateCompounds() is automatically converted to an adduct class with the as.adduct() function if necessary:

formulas <- generateFormulas(..., adduct = adduct(sub = "H", charge = -1))
formulas <- generateFormulas(..., adduct = "[M-H]-") # same as above

More details can be found in the reference manual (?adduct and ?`adduct-utils`).

Feature intensity normalization {#fNorm}

Feature intensities are often compared between sample analyses, for instance, to evaluate trends between sample points. However, matrix effects, varying detector sensitivity and differences in analysed sample amount may complicate such comparison. For this reason, it may be desired to normalize the feature intensities.

The normInts() function is used to normalize feature intensities (peak heights and areas). Two different types are supported:

  1. Feature normalization: normalization occurs by intensities within the same sample analysis
  2. Group normalization: normalization occurs by intensities among features within the same group

Both normalization types can be combined.

Feature normalization

Feature normalization itself supports the following normalization methods:

Method | Usage | Description ------------------ | ---------------------------------- | -------------------------------------------------- TIC | normInts(featNorm = "tic", ...) | Normalizes by the combined intensity of all features, also known as the Total Ion Current (TIC). Internal Standard | normInts(featNorm = "istd", ...) | Uses internal standards (IS) to normalize feature intensities. Concentration | normInts(featNorm = "conc", ...) | Normalizes feature intensities of a sample analysis by its normalization concentration (explained below). None | normInts(featNorm = "none", ...) | Performs no feature normalization. Set this if you only want to perform group normalization (discussed in the next section).

Normalization concetration

All methods (except "none") are influenced by the normalization concentration, which is a property set for each sample analysis. For IS normalization, this should equal the concentration of the IS present in the sample. Otherwise the normalization concentration resembles the injected sample amount. The normalization concentration is defined in the norm_conc column of the analysis information. For example:

# obtain analysis information as usual, but add normalization concentrations.
# The blanks are set to NA, and will therefore not be normalized.
generateAnalysisInfo(paths = patRoonData::exampleDataPath(),
                     groups = c(rep("solvent", 3), rep("standard", 3)),
                     blanks = "solvent",
                     norm_concs = c(NA, NA, NA, 2, 2, 1))

The normalization concentration does not need to be an absolute value. In the end, what matters are the relative numbers between the sample analyses. For example, if the concentrations for two analyses are c(1, 2) or c(1.5, 3.0) the normalization occurs the same. Setting the concentration to NA (or 0) will skip normalization for an analysis. If the normalization concentration is absent from the analysis information it will be defaulted to 1.

Internal standard normalization

For IS normalization an internal standard list should be specified with the properties of the internal standards. Essentially, the format of this list is exactly the same as a suspect list. Example lists can be found in the patRoonData package:

patRoonData::ISTDListPos[1:5, ]

As can be seen from above, labelled isotopes can be specified with square brackets, e.g. [2]H for deuterium.

The next step is to perform the normalization with normInts():

fGroupsNorm <- normInts(fGroups, featNorm = "istd", standards = patRoonData::ISTDListPos, adduct = "[M+H]+",
                        ISTDRTWindow = 20, ISTDMZWindow = 200, minISTDs = 2)

This will do the following:

To evaluate the assignments for a particular feature group, the internalStandardAssignments() function and plotGraph() functions can be used:

fg <- names(fGroupsNorm)[2]
internalStandardAssignments(fGroupsNorm)[[fg]] # IS assignments for 2nd feature group
plotGraph(fGroupsNorm) # interactively explore assignments

Other methods

Like IS normalization, other feature normalization methods also occurs with normInts():

fGroupsNorm <- normInts(fGroups, featNorm = "tic") # TIC normalization
fGroupsNorm <- normInts(fGroups, featNorm = "conc") # Concentration normalization

Group normalization

Normalizing feature intensities among group member is easily performed by setting groupNorm=TRUE:

# only perform group normalization
fGroupsNorm <- normInts(fGroups, featNorm = "none", groupNorm = TRUE)
# first perform TIC feature normalization and then group normalization
fGroupsNorm <- normInts(fGroups, featNorm = "tic", groupNorm = TRUE)

Using normalized intensities

The normalized intensity (peak heigh/area) values can easily be obtained with as.data.table():

as.data.table(fGroupsNorm, normalized = TRUE)[1:5]
# can be combined with other parameters
as.data.table(fGroupsNorm, normalized = TRUE, average = TRUE, areas = TRUE)[1:5]
# feature values (no need to set normalized=TRUE)
as.data.table(fGroupsNorm, features = TRUE)[1:5, .(group, analysis, intensity_rel, area_rel)]

Several other patRoon functions also accept the normalized argument to use normalized data, such as plotInt() (discussed here), plotVolcano() (discussed here) and generateComponents() with intensity clustering (discussed here).

Default normalization

If normalized data is requested (normalized=TRUE, see previous section) and normInts() was not called on the feature group data, a default normalization will occur. This is nothing more than a group normalization (normInts(groupNorm=TRUE, ...)), and was mainly implemented to ensure backwards compatibility with previous patRoon versions.

Feature parameter optimization {#fOpt}

Many different parameters exist that may affect the output quality of feature finding and grouping. To avoid time consuming manual experimentation, functionality is provided to largely automate the optimization process. The methodology, which uses design of experiments (DoE), is based on the excellent Isotopologue Parameter Optimization (IPO) R package. The functionality of this package is directly integrated in patRoon. Some functionality was added or changed, the most important being support for other feature finding and grouping algorithms besides [XCMS] and basic optimization support for qualitative parameters. Nevertheless, the core optimization algorithms are largely untouched.

This section will introduce the most important concepts and functionality. Please see the reference manual for more information (e.g. ?`feature-optimization`).

NOTE The [SIRIUS] and [SAFD] algorithms are currently not (yet) supported.

Parameter sets

Before starting an optimization experiment we have to define parameter sets. These sets contain the parameters and (initial) numeric ranges that should be tested. A parameter set is defined as a regular list, and can be easily constructed with the generateFeatureOptPSet() and generateFGroupsOptPSet() functions (for feature finding and feature grouping, respectively).

pSet <- generateFeatureOptPSet("openms") # default test set for OpenMS
pSet <- generateFeatureOptPSet("openms", chromSNR = c(5, 10)) # add parameter
# of course manually making a list is also possible (e.g. if you don't want to test the default parameters)
pSet <- list(noiseThrInt = c(1000, 5000))

When optimizing with [XCMS] or [KPIC2] a few things have to be considered. First of all, when using the XCMS3 interface (i.e. algorithm="xcms3") the underlying method that should be used for finding and grouping features and retention alignment should be set. In case these are not set default methods will be used.

pSet <- list(method = "centWave", ppm = c(2, 8))
pSet <- list(ppm = c(2, 8)) # same: centWave is default

# get defaults, but for different grouping/alignment methods
pSetFG <- generateFGroupsOptPSet("xcms3", groupMethod = "nearest", retAlignMethod = "peakgroups")

In addition, when optimizing feature grouping (both XCMS interfaces and KPIC2) we need to set the grouping and retention alignment parameters in two different nested lists: these are groupArgs/retcorArgs (algorithm="xcms"), groupParams/retAlignParams (algorithm="xcms3") or groupArgs/alignArgs (algorithm="kpic2").

pSetFG <- list(groupParams = list(bw = c(20, 30))) # xcms3
pSetFG <- list(retcorArgs = list(gapInit = c(0, 7))) # xcms
pSetFG <- list(groupArgs = list(mz_weight = c(0.3, 0.9))) # kpic2

When a parameter set has been defined it should be used as input for the optimizeFeatureFinding() or optimizeFeatureGrouping() functions.

ftOpt <- optimizeFeatureFinding(anaInfo, "openms", pSet)
fgOpt <- optimizeFeatureGrouping(fList, "openms", pSetFG) # fList is an existing features object

Similar to findFeatures(), the first argument to optimizeFeatureFinding() should be the analysis information. Note that it is not uncommon to perform the optimization with only a subset of (representative) analyses (i.e. to reduce processing time).

ftOpt <- optimizeFeatureFinding(anaInfo[1:2, ], "openms", pSet) # only use first two analyses

From the parameter set a design of experiment will be automatically created. Obviously, the more parameters are specified, the longer such an experiment will take. After an experiment has finished, the optimization algorithm will start a new experiment where numeric ranges for each parameter are increased or decreased in order to more accurately find optimum values. Hence, the numeric ranges specified in the parameter set are only initial ranges, and will be changed in subsequent experiments. After each experiment iteration the results will be evaluated and a new experiment will be started as long as better results were obtained during the last experiment (although there is a hard limit defined by the maxIterations argument).

For some parameters it is recommended or even necessary to set hard limits on the numeric ranges that are allowed to be tested. For instance, setting a minimum feature intensity threshold is highly recommended to avoid excessive processing time and potentially suboptimal results due to excessive amounts of resulting features. Configuring absolute parameter ranges is done by setting the paramRanges argument.

# set minimum intensity threshold (but no max)
ftOpt <- optimizeFeatureFinding(anaInfo, "openms",
                                list(noiseThrInt = c(1000, 5000), # initial testing range
                                paramRanges = list(noiseThrInt = c(500, Inf))) # never test below 500

Depending on the used algorithm, several default absolute limits are imposed. These may be obtained with the getDefFeaturesOptParamRanges() and getDefFGroupsOptParamRanges() functions.

The common operation is to optimize numeric parameters. However, parameters that are not numeric (i.e. qualitative) need a different approach. In this case you will need to define multiple parameter sets, where each set defines a different qualitative value.

ftOpt <- optimizeFeatureFinding(anaInfo, "openms",
                                list(chromFWHM = c(4, 8), isotopeFilteringModel = "metabolites (5% RMS)"),
                                list(chromFWHM = c(4, 8), isotopeFilteringModel = "metabolites (2% RMS)"))

In the above example there are two parameter sets: both define the numeric chromFWHM parameter, whereas the qualitative isotopeFilteringModel parameter has a different value for each. Note that we had to specify the chromFWHM twice, this can be remediated by using the templateParams argument:

ftOpt <- optimizeFeatureFinding(anaInfo, "openms",
                                list(isotopeFilteringModel = "metabolites (5% RMS)"),
                                list(isotopeFilteringModel = "metabolites (2% RMS)"),
                                templateParams = list(chromFWHM = c(4, 8)))

As its name suggests, the templateParams argument serves as a template parameter set, and its values are essentially combined with each given parameter set. Note that current support for optimizing qualitative parameters is relatively basic and time consuming. This is because tests are essentially repeated for each parameter set (e.g. in the above example the chromFWHM parameter is optimized twice, each time with a different value for isotopeFilteringModel).

Processing optmization results

The results of an optimization process are stored in objects from the S4 optimizationResult class. Several methods are defined to inspect and visualize these results.

The optimizedParameters() function is used to inspect the best parameter settings. Similarly, the optimizedObject() function can be used to obtain the object that was created with these settings (i.e. a features or featureGroups object).

optimizedParameters(ftOpt) # best settings for whole experiment
optimizedObject(ftOpt) # features object with best settings for whole experiment

Results can also be obtained for specific parameter sets/iterations.

optimizedParameters(ftOpt, 1) # best settings for first parameter set
optimizedParameters(ftOpt, 1, 1) # best settings for first parameter set and experiment iteration
optimizedObject(ftOpt, 1) # features object with best settings for first parameter set

The plot() function can be used to visualize optimization results. This function will plot graphs for results of all tested parameter pairs. The graphs can be contour, image or perspective plots (as specified by the type argument).

plot(ftOpt, paramSet = 1, DoEIteration = 1) # contour plots for first param set/experiment
plot(ftOpt, paramSet = 1, DoEIteration = 1, type = "persp") # pretty perspective plots

Please refer to the reference manual for more methods to inspect optimization results (e.g. ?optimizationResult).

Chromatographic peak qualities {#peakQual}

The algorithms used by findFeatures detect chromatographic peaks automatically to find the features. However, it is common that not all detected features have 'proper' chromatographic peaks, and some features could be just noise. The [MetaClean] R package supports various quality measures for chromatographic peaks. The quality measures include Gaussian fit, symmetry, sharpness and others. In addition, [MetaClean] averages all feature data for each feature group and adds a few additional group specific quality measures (e.g. retention time consistency). Please see @Chetnik2020 for more details. The calculations are integrated into patRoon, and are easily performed with the calculatePeakQualities() generic function.

fList <- calculatePeakQualities(fList) # calculate for all features
fGroups <- calculatePeakQualities(fGroups) # calculate for all features and groups

Most often the featureGroups method is only used, unless you want to filter features (discussed below) prior to grouping.

An extension in patRoon is that the qualities are used to calculate peak scores. The score for each quality measure is calculated by normalizing and scaling the values into a 0-1 range, where zero is the worst and one the best. Note that most scores are relative, hence, the values should only be used to compare features among each other. Finally, a totalScore is calculated which sums all individual scores and serves as a rough overall score indicator for a feature (group).

The qualities and scores are easily obtained with the as.data.table() function.

# (limit rows/columns for clarity)
as.data.table(fList)[1:5, 26:30]
# the qualities argument is necessary to include the scores.
# valid values are: "quality", "score" or "both"
as.data.table(fGroups, qualities = "both")[1:5, 25:29]

The feature quality values can also be reviewed interactively with reports generated with report (see [Reporting]) and with checkFeatures (see here). The filter function can be used filter out low scoring features and feature groups:

# only keep features with at least 0.3 Modality score and 0.5 symmetry score
fList <- filter(fList, qualityRange = list(ModalityScore = c(0.3, Inf),
                                           SymmetryScore = c(0.5, Inf)))

# same as above
fGroups <- filter(fGroups, featQualityRange = list(ModalityScore = c(0.3, Inf),
                                                   SymmetryScore = c(0.5, Inf)))

# filter group averaged data
fGroups <- filter(fGroups, groupQualityRange = list(totalScore = c(0.5, Inf)))

Applying machine learning with MetaClean

An important feature of [MetaClean] is to use the quality measures to train a machine learning model to automatically recognize 'good' and 'bad' features. patRoon provides a few extensions to simplify training and using a model. Furthermore, while MetaClean was primarily designed to work with [XCMS], the extensions of patRoon allow the usage of data from all the algorithms supported by patRoon.

The getMCTrainData function can be used to convert data from a feature check session to training data that can be used by [MetaClean]. This allows you to use interactively select good/bad peaks. The workflow looks like this:

# untick the 'keep' checkbox for all 'bad' feature groups
checkFeatures(fGroupsTrain, "train_session.yml")

# get train data. This gives comparable data as MetaClean::getPeakQualityMetrics()
trainData <- getMCTrainData(fGroupsTrain, "train_session.yml")

# use train data with MetaClean with MetaClean::runCrossValidation(),
# MetaClean::getEvaluationMeasures(), MetaClean::trainClassifier() etc
# --> see the MetaClean vignette for details

Once you have created a model with [MetaClean] it can be used with the predictCheckFeaturesSession() function:

predictCheckFeaturesSession(fGroups, "model_session.yml", model)

This will generate another check session file: all the feature groups that are considered good will be with a 'keep' state, the others without. As described elsewhere, the checkFeatures function is used to review the results from a session and the filter function can be used to remove unwanted feature groups. Note that calculatePeakQualitites() must be called before getMCTrainData/predictCheckFeaturesSession can be used.

NOTE MetaClean only predicts at the feature group level. Thus, only the kept feature groups from a feature check session will be used for training, and any indivual features that were marked as removed will be ignored.

Exporting and converting feature data

The feature group data obtained during the workflow can be exported to various formats with the export() generic function. There are currently three formats supported: "brukerpa" (Bruker ProfileAnalysis), "brukertasq" (Bruker TASQ) and "mzmine" (mzMine). The former exports a 'bucket table' which can be loaded in ProfileAnalysis, the second and third export a target list that can be processed with TASQ and mzMine, respectively.

The getXCMSSet() function converts a features or featureGroups object to an xcmsSet object which can be used for further processing with [xcms]. Similarly, the getXCMSnExp() function can be used for conversion to an XCMS3 style XCMSnExp object, and the getPICSet() function can be used to convert features to [KPIC2] data.

Some examples for these functions are shown below.

export(fGroups, "brukertasq", out = "my_targets.csv")

# convert features to xcmsSet.
# NOTE: loadRawData should only be FALSE when the analysis data files cannot be
# loaded by the algorithm (e.g. when features were obtained with DataAnalysis and data was not exported to mz(X)ML)
xset <- getXCMSSet(fList, loadRawData = TRUE)
xsetg <- getXCMSSet(fGroups, loadRawData = TRUE) # get grouped xcmsSet

# using the new XCMS3 interface
xdata <- getXCMSnExp(fList)
xdata <- getXCMSnExp(fGroups)

# KPIC2 conversion. Like XCMS it optionally loads the raw data.
picSet <- getPICSet(fList, loadRawData = TRUE)

Algorithm consensus {#consensus}

With patRoon you have the option to choose between several algorithms for most workflow steps. Each algorithm is typically characterized by its efficiency, robustness, and may be optimized towards certain data properties. Comparing their output is therefore advantageous in order to design an optimum workflow. The consensus() generic function will compare different results from different algorithms and returns a consensus, which may be based on minimal overlap, uniqueness or simply a combination of all results from involved objects. The output from the consensus() function is of similar type as the input types and is therefore compatible to any 'regular' further data processing operations (e.g. input for other workflow steps or plotting). Note that a consensus can also be made from objects generated by the same algorithm, for instance, to compare or combine results obtained with different parameters (e.g. different databases used for compound annotation).

The consensus() generic is defined for most workflow objects. Some of its common function arguments are listed below.

Argument | Classes | Remarks ------------- | -------------------------------------------------- | ------------------------------------------------------------- obj, ... | All | Two or more objects (of the same type) that should be compared to generate the consensus. compThreshold, relAbundance, absAbundance, formThreshold | compounds, formulas, featureGroupsComparison | The minimum overlap (relative/absolute) for a result (feature, candidate) to be kept. uniqueFrom | compounds, formulas, transformationProductsStructure, featureGroupsComparison | Only keep unique results from specified objects. uniqueOuter | compounds, formulas, transformationProductsStructure, featureGroupsComparison | Should be combined with uniqueFrom. If TRUE then only results are kept which are also unique between the objects specified with uniqueFrom.

Note that current support for generating a consensus between components objects is very simplistic; here results are not compared, but the consensus simply consists a combination of all the components from each object.

Generating a consensus for feature groups involves first generating a featureGroupsComparison object. This step is necessary since (small) deviations between retention times and/or mass values reported by different feature finding/grouping algorithms complicates a direct comparison. The comparison objects are made by the comparison() function, and its results can be visualized by the plotting functions discussed in the previous chapter.

Some examples are shown below

compoundsCons <- consensus(compoundsMF, compoundsSIR) # combine MetFrag/SIRIUS results
compoundsCons <- consensus(compoundsMF, compoundsSIR,
                           compThreshold = 1) # only keep results that overlap

TPsCons <- consensus(TPsLib, TPsBT) # combine library and BioTransformer TPs

fGroupComp <- comparison(fGroupsXCMS, fGroupsOpenMS, fGroupsEnviPick,
                         groupAlgo = "openms")
plotVenn(fGroupComp) # visualize overlap/uniqueness
fGroupsCons <- consensus(fGroupComp,
                         uniqueFrom = 1:2) # only keep results unique in OpenMS+XCMS
fGroupsCons <- consensus(fGroupComp,
                         uniqueFrom = 1:2,
                         uniqueOuter = TRUE) # as above, but also exclude any overlap between OpenMS/XCMS

MS libraries {#MSLibraries}

The loadMSLibraries() function is used to load MS spectral libraries, and was already briefly introduced for compound annotation. Currently, loading of MSP files and [MoNA] JSON files is supported, while loading of formula annotations for MS peaks is currently only supported for the latter. The underlying algorithms implement several optimizations to efficiently load large number of records. Furthermore, loadMSLibraries() automatically verifies record data such as formulas, adducts and masses, and automatically calculates missing or invalid data where possible.

mslibraryMSP <- loadMSLibrary("MoNA-export-CASMI_2016.msp", "msp")
mslibraryJSON <- loadMSLibrary("MoNA-export-CASMI_2016.json", "json")

Several advanced parameters are available that influence the loading of MS library data, see the reference manual (?loadMSLibrary) for details.

Once loaded, the usual methods are available to inspect its data:

show(mslibraryMSP)
mslibraryMSP[[1]] # MS/MS spectrum for first candidate
mslibraryJSON[["SM801601"]] # a record with annotations
 # overview of all metadata (select few columns for readability)
records(mslibraryJSON)[, .(DB_ID, Name, InChIKey, formula)]
# convert all data to a data.table (may be huge!)
as.data.table(mslibraryMSP)[, .(DB_ID, SMILES, formula, mz, intensity)]

Furthermore, like many other objects in patRoon, the MS library objects can be subset and filtered:

mslibrarySub <- mslibrary[1:100] # only keep first 100 records

# only keep records a neutral mass of 100-200
mslibraryF <- filter(mslibrary, massRange = c(100, 200))
# remove records with neutral mass below 100
mslibraryF <- filter(mslibrary, massRange = c(0, 100), negate = TRUE)
# only keep mass peaks with m/z 100-500
mslibraryF <- filter(mslibrary, mzRangeSpec = c(100, 500))
# remove low intensity peaks (<1%) and only keep top 10
mslibraryF <- filter(mslibrary, relMinIntensity = 0.01, topMost = 10)
# only keep mass peak with annotations
mslibraryF <- filter(mslibraryJSON, onlyAnnotated = TRUE)

In addition, the properties filter may be useful to tailor the library data. The library properties can be obtained as following:

oldW <- options(width = 80) # HACK: make sure vectors below are shown nicely
names(records(mslibrary)) # get all property names
unique(records(mslibrary)[["Instrument_type"]]) # Get the available instrument types
options(width = oldW[[1]])

Then to filter the MS library:

# only keep APCI instrument types
mslibraryF <- filter(mslibrary, properties = list(Instrument_type = c("LC-APCI-ITFT", "APCI-ITFT")))
# remove Q-TOF by negation
mslibraryF <- filter(mslibrary, properties = list(Instrument_type = "LC-ESI-QTOF"), negate = TRUE)

More advanced filtering can be performed with the delete() generic function, see the reference manual for details (?MSLibrary).

Finally, functionality exists to convert, export and merge MS libraries:

# Convert the MS library to a suspect list.
# By setting collapse to TRUE, all records with the same first block InChIKey
# are collapsed and mass peaks are averaged.
suspL <- convertToSuspects(mslibrary, adduct = "[M+H]+", collapse = TRUE)
# Amend custom suspect list with library data (fragments_mz column)
suspL <- convertToSuspects(mslibrary, adduct = "[M+H]+", suspects = patRoonData::suspectsPos)

export(mslibrary, out = "myMSLib.msp") # export to a new MSP library

mslibraryM <- merge(mslibraryMSP, mslibraryJSON) # merge two libraries

Compound clustering {#compclust}

When large databases such as [PubChem] or [ChemSpider] are used for compound annotation, it is common to find many candidate structures for even a single feature. While choosing the right scoring settings can significantly improve their ranking, it is still very much possible that many candidates of potential interest remain. In this situation it might help to perform compound clustering. During this process, all candidates for a feature are clustered hierarchically on basis of similar chemical structure. From the resulting cluster the maximum common substructure (MCS) can be derived, which represents the largest possible substructure that 'fit' in all candidates. By visual inspection of the MCS it may be possible to identify likely common structural properties of a feature.

In order to perform compound clustering the makeHCluster() generic function should be used. This function heavily relies on chemical fingerprinting functionality provided by [rcdk].

compounds <- generateCompounds(...) # get our compounds
compsClust <- makeHCluster(compounds)

This function accepts several arguments to fine tune the clustering process:

For all arguments see the reference manual (?makeHClust).

The resulting object is of type compoundsCluster. Several methods are defined to modify and inspect these results:

# plot MCS of first cluster from candidates of M192_R355_191
plotStructure(compsClust, groupName = "M192_R355_191", 1)

# plot dendrogram
plot(compsClust, groupName = "M192_R355_191")

# re-assign clusters for a feature group
compsClust <- treeCut(compsClust, k = 5, groupName = "M192_R355_191")
# ditto, but automatic cluster determination
compsClust <- treeCutDynamic(compsClust, minModuleSize = 3, groupName = "M192_R355_191")

For a complete overview see the reference manual (?compoundsCluster).

Feature regression analysis

Some basic support in patRoon is available to perform simple linear regression on feature intensities vs given experimental conditions. Examples of such conditions are dilution factor, sampling time or initial concentration of a parent in a degradation experiment. By testing if there is a significant linearity, features of interest can be isolated in a relative easy way. Originally, this functionality was implemented as a very basic method to perform rough calculations of concentrations. However, the next sections describes a much better way by using the MS2Quant package. Regardless, this functionality still uses 'concentrations' as terminology for the experimental conditions of interest. The conditions are specified in the conc column of the analysis information, for instance:

# obtain analysis information as usual, but add some experimental parameters of interest (dilution, time etc).
# The blanks are set to NA, whereas the standards are set to increasing levels.
anaInfo <- generateAnalysisInfo(paths = patRoonData::exampleDataPath(),
                                groups = c(rep("solvent", 3), rep("standard", 3)),
                                blanks = "solvent",
                                concs = c(NA, NA, NA, 1, 2, 3))

If no experimental conditions are available for a particular analysis then the conc value should be NA. For these analyses the experimental condition will be calculated using the regression model obtained from the other analyses.

The as.data.table() function (or as.data.frame()) can then be used to calculate regression data:

# use areas for quantitation and make sure that feature data is reported
# (only relevant columns are shown for clarity)
as.data.table(fGroups, areas = TRUE, features = TRUE, regression = TRUE)
as.data.table(fGroupsConc, areas = TRUE, features = TRUE, regression = TRUE)[, c("group", "conc", "RSQ", "intercept", "slope", "conc_reg")]

The calculated experimental conditions are stored in the conc_reg column (this column is only present if features=TRUE). In addition, the table also contains other regression data such as RSQ, intercept and slope. To perform basic trend analysis the RSQ (i.e. R squared) can be used:

fGroupsTab <- as.data.table(fGroups, areas = TRUE, features = FALSE, regression = TRUE)
# subset fGroups with reasonable correlation
increasingFGroups <- fGroups[, fGroupsTab[RSQ >= 0.8, group]]

Predicting toxicities and concentrations (MS2Tox and MS2Quant integration) {#pred}

The [MS2Tox] and [MS2Quant] R packages predict toxicities and feature concentrations using a machine learning approach. The predictions are performed with either SMILES data or fingerprints calculated from MS/MS data with SIRIUS+CSI:FingerID. While using SMILES data is generally more accurate, using MS/MS fingerprints is generally faster and may be more suitable for features without know or suspected structure.

In patRoon the predictions are done in two steps:

  1. The LC50 values (toxicity prediction) or response factors (concentration prediction) are calculated for given SMILES or MS/MS fingerprint data using MS2Tox/MS2Quant. This step is performed by the predictTox()/predictConcs() method function.
  2. The predicted LC50 values are used to assign toxicities/concentrations to feature data. This is performed by the calculateTox()/calculateConcs() method function.

Various workflow data can be used to perform the predictions for step 1:

a. Suspect hits that were obtained with screenSuspects (see suspect screening). b. Formula annotations obtained with SIRIUS+CSI:FingerID. c. Compound annotations obtained with SIRIUS+CSI:FingerID. d. Compound annotations obtained with other algorithms, e.g. MetFrag.

For a and d SMILES is used to perform the calculations, for b MS/MS fingerprints are used and for c either can be used.

NOTE For option b, make sure that getFingerprints=TRUE when running generateFormulas() in order to obtain fingerprints.

Predicting toxicities

Some example workflows are shown below:

# Calculate toxicity for suspect hits.
fGroupsSuspTox <- predictTox(fGroupsSusp)
fGroupsSuspTox <- calculateTox(fGroupsSuspTox)

# Calculate toxicity for compound hits. Limit to the top 5 to reduce calculation time.
compoundsTop5 <- filter(compounds, topMost = 5)
compoundsTox <- predictTox(compoundsTop5)
fGroupsTox <- calculateTox(fGroups, compoundsTox)

It is also possible to calculate toxicities from multiple workflow objects:

fGroupsSuspTox <- predictTox(fGroupsSusp) # as above

# Predict toxicities from compound candidates, using both SMILES and MS/MS fingerprints
# compoundsSuspSIR is an object produced with generateCompounds() with algorithm="sirius"
compoundsSuspSIRTox <- predictTox(compoundsSuspSIR, type = "both")

# Assign toxicities to feature groups from both suspect hits and SIRIUS annotations
fGroupsSuspTox <- calculateTox(fGroupsSuspTox, compoundsSuspSIRTox)

More details are in the reference manual: ?`pred-tox`.

Predicting concentrations

The workflow to predict concentrations is quite similar to predicting toxicities. However, before we can start we first have to specify the calibrants and the LC gradient elution program.

The calibrant data is used by MS2Quant to convert predicted ionization efficiencies to actual response factors, which are specific to the used LC instrument and methodology. For this purpose, several mixtures with known concentrations (i.e. standards) should be measured alongside your samples. The calibrants should be specified as a data.frame, for instance:

# generate calibrants with some imaginary intensities/concentrations
calibrants <- data.table(name = c("DEET", "1h-benzotriazole", "Caffeine", "Atrazine", "Carbamazepine", "Venlafaxine"),
                         conc = rep(c(1, 2, 5, 10, 25, 50), each = 6))
setorderv(calibrants, "name")
calibrants[, SMILES := patRoonData::suspectsPos[match(name, patRoonData::suspectsPos$name), "SMILES"]]
calibrants[, rt := patRoonData::suspectsPos[match(name, patRoonData::suspectsPos$name), "rt"]]
fgScr <- screenSuspects(fGroups, calibrants, onlyHits = TRUE, adduct = "[M+H]+")
fgTab <- as.data.table(fgScr, average = TRUE, collapseSuspects = NULL)
calibrants <- merge(calibrants, fgTab[, c("susp_name", "standard-pos"), with = FALSE], by.x = "name", by.y = "susp_name")
setnames(calibrants, "standard-pos", "intensity")
withr::with_seed(1, calibrants[, intensity := round(runif(.N, 0.9, 1.1) * (intensity / 20 * conc))])
setcolorder(calibrants, c("name", "SMILES", "intensity", "conc"))
makeTable(calibrants)

The intensity column should contain either the peak intensity (height) or area. Note that some feature detection algorithms can sometimes produce inaccurate peak areas, and the area determination methodology is often different among algorithms. For this reason, using peak intensities may be more reliable, however, it is worth testing this with your data.

It is also possible to use the getQuantCalibFromScreening() function to automatically create the calibrant table from feature group data:

calibList <- data.frame(...) # this should be a suspect list with your calibrants
fGroups <- screenSuspects(fGroups, calibList) # screen for the calibrants
concs <- data.frame(...) # concentration data for each calibrant compound, see below
calibrants <- getQuantCalibFromScreening(fGroups, concs)
calibrants <- getQuantCalibFromScreening(fGroups, concs, areas = TRUE) # obtain feature areas instead of intensities

The first step is to perform a screening for the calibrant compounds. Please ensure that this list should contains SMILES data, and to ensure correct feature assignment it is highly recommended to include retention times. The second requirement for getQuantCalibFromScreening() is a table with concentrations for each calibrant compound, e.g.:

concs <- data.frame(
    name = c("DEET", "1h-benzotriazole", "Caffeine", "Atrazine", "Carbamazepine", "Venlafaxine"),
    standard_1 = c(1.00, 1.05, 1.10, 0.99, 1.01, 1.12),
    standard_2 = c(2.00, 2.15, 2.20, 1.98, 2.02, 1.82),
    standard_5 = c(5.01, 5.05, 5.22, 5.00, 4.88, 4.65),
    standard_10 = c(10.2, 10.11, 10.23, 11.77, 11.75, 12.13),
    standard_25 = c(25.3, 25.12, 25.34, 24.89, 24.78, 24.68),
    standard_50 = c(50.34, 50.05, 50.10, 49.97, 49.71, 50.52)
)
concs

The concentrations are specified separately for each calibrant compound. The column names should follow the names of the replicate groups assigned to the standards. The concentration unit is µg/l by default. The next section describes how this can be changed.

The gradient elution program is also specified by a data.frame, which for every time point (in seconds!) describes the percentage of 'B'. In this case, 'B' represents the total amount of organic modifier.

eluent <- data.frame(
    time = c(0, 180, 600, 900, 960),
    B = c(10, 30, 100, 100, 10)
)
plot(eluent, type = "l")
eluent

Then, the workflow to predict concentrations is very similar then predicting toxicities (previous section):

# Calculate concentrations for suspect hits.
fGroupsSuspConc <- predictRespFactors(
    fGroupsSusp,
    calibrants = calibrants, eluent = eluent,
    organicModifier = "MeCN", # organic modifier: MeOH or MeCN
    pHAq = 4 # pH of the aqueous part of the mobile phase
)
# set areas to TRUE if the calibrant table contains areas
fGroupsSuspConc <- calculateConcs(fGroupsSuspConc, areas = FALSE)

As was shown for toxicities it is possible to use different data sources (e.g. compound annotations, suspects) for predictions.

More details are in the reference manual: ?`pred-quant`.

Toxicity and concentration units

The default unit for toxicity and concentration data is µg/l. However, this can be configured when calling the predictTox()/predictRespFactors() functions:

fGroupsSuspTox <- predictTox(fGroupsSusp) # default unit: ug/l
fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "ug/l") # same as above
fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "mM") # millimolar
fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "log mM") # unit used by MS2Tox

# calculated concentrations are ng/l, calibrants are specified in ug/l
# (by default calibConcUnit=concUnit)
fGroupsSuspConc <- predictRespFactors(
    fGroupsSusp, calibrants = calibrants, eluent = eluent,
    organicModifier = "MeCN", pHAq = 4,
    concUnit = "ng/l", calibConcUnit = "ug/l"
)

See the reference manuals (?`pred-tox`/?`pred-quant`) For more details on which units can be specified

Inspecting predicted values

The raw toxicity and concentration data assigned to feature groups can be retrieved with the toxicities() and concentrations() method functions, respectively.

toxicities(fGroupsSuspTox)
concentrations(fGroupsSuspConc)

If there were multiple candidates for a single feature group then these are split over the table rows:

toxicities(fGroupsTox)

The as.data.table() method function, which was discussed previously, can be used to summarize toxicity and concentration values.

# NOTE: NA values are filtered and columns are subset for readability
as.data.table(fGroupsTox)[!is.na(LC50), c("group", "LC50", "LC50_types")]
concCols <- c("group", paste0(analyses(fGroupsSuspConc), "_conc"), "conc_types")
as.data.table(fGroupsSuspConc)[!is.na(conc_types), concCols, with = FALSE]

The as.data.table() method function aggregates the data for a feature group in case multiple candidates were assigned to it. By default the values are mean averaged, but this be changed with the toxAggrParams/concAggrParams arguments, for instance:

# as above, but aggregate by taking maximum values
as.data.table(fGroupsTox, toxAggrParams = getDefPredAggrParams(max))[!is.na(LC50), c("group", "LC50", "LC50_types")]

If the as.data.table() method is used on suspect screening results, and predictions were performed directly for suspect hits, then predicted values can be reported for individual suspect match instead of aggregating them per feature group:

# Reports predicted values for each suspect separately. If multiple suspects are assigned to a feature group,
# then each suspect match is split into a different row. 
as.data.table(fGroupsSuspTox, collapseSuspects = NULL)

Finally, the reporting functionality can be used to overview all predicted values, both aggregated and raw.

Using predicted values to prioritize data

The filter() method function that was introduced before can also be used to filter data based on predicted toxicities, response factors and concentrations. For instance, this allows you to remove annotation candidates which are unlikely to be toxic or sensitive enough to be detected or any features with very low concentrations. Some examples are shown below.

# compoundsSuspSIRTox is an object with predicted toxicities (LC50 values) for each candidate
# we can use the common scoreLimits filter to select a range of allowed values (min/max)
compoundsSuspSIRToxF <- filter(compoundsSuspSIRTox, scoreLimits = list(LC50_SMILES = c(0, 1E4)))

# for suspects with predicted toxicities/response factors there are dedicated filters
fGroupsSuspConcF <- filter(fGroupsSuspConc, minRF = 5E4) # remove suspect hits with response factor <5E4
fGroupsSuspToxF <- filter(ffGroupsSuspTox, maxLC50 = 100) # remove suspect hits with LC50 values > 100

# similarly, for feature data there are dedicated filters.
# note that these aggregate data prior to filtering (see previous section)
fGroupsConcF <- filter(fGroupsConc, absMinConc = 0.02)
# only keep features with concentrations that are at least 1% of their toxicity
# note that both concentrations/toxicity values should have been calculated with calculateConcs()/calculateTox()
fGroupsConcToxF <- filter(fGroupsConcTox, absMinConcTox = 0.01)

# also get rid of features without concentrations (these are ignored by default)
fGroupsConcF <- filter(fGroupsConc, absMinConc = 0.02, removeNA = TRUE)
# like as.data.table we can configure how values are aggregated
# here the minimum is used instead of the default mean
fGroupsToxF <- filter(fGroupsTox, absMaxTox = 5E3, predAggrParams = getDefPredAggrParams(min))

More details are found in the reference manual (?`feature-filtering`).

Fold changes {#FCCalc}

A specific statistical way to prioritize feature data is by Fold changes (FC). This is a relative simple method to quickly identify (significant) changes between two sample groups. A typical use case is to compare the feature intensities before and after an experiment.

To perform FC calculations we first need to specify its parameters. This is best achieved with the getFCParams() function:

getFCParams(c("before", "after"))

In this example we generate a list with parameters in order to make a comparison between two replicate groups: before and after. Several advanced parameters are available to tweak the calculation process. These are explained in the reference manual (?featureGroups).

The as.data.table function for feature groups is used to perform the FC calculations.

fGroupsO <- fGroups
fGroups <- filter(fGroupsUF, absMinIntensity = 1E4)
myFCParams <- getFCParams(c("solvent-pos", "standard-pos")) # compare solvent/standard
as.data.table(fGroups, FCParams = myFCParams)[, c("group", "FC", "FC_log", "PV", "PV_log", "classification")]

The classification column allows you to easily identify if and how a feature changes between the two sample groups. This can also be used to prioritize feature groups:

tab <- as.data.table(fGroups, FCParams = myFCParams)
# only keep feature groups that significantly increase or decrease
fGroupsChanged <- fGroups[, tab[classification %in% c("increase", "decrease")]$group]

The plotVolcano function can be used to visually the FC data:

plotVolcano(fGroups, myFCParams)
fGroups <- fGroupsO

Caching

In patRoon lengthy processing operations such as finding features and generating annotation data is cached. This means that when you run such a calculation again (without changing any parameters), the data is simply loaded from the cache data instead of re-generating it. This in turn is very useful, for instance, if you have closed your R session and want to continue with data processing at a later stage.

The cache data is stored in a [sqlite] database file. This file is stored by default under the name cache.sqlite in the current working directory (for this reason it is very important to always restore your working directory!). However, the name and location can be changed by setting a global package option:

options(patRoon.cache.fileName = "~/myCacheFile.sqlite")

For instance, this might be useful if you want to use a shared cache file between projects.

After a while you may see that your cache file can get quite large. This is especially true when testing different parameters to optimize your workflow. Furthermore, you may want to clear the cache after you have updated patRoon and want to make sure that the latest code is used to generate the data. At any point you can simply remove the cache file. A more fine tuned approach which doesn't wipe all your cached data is by using the clearCache() function. With this function you can selectively remove parts of the cache file. The function has two arguments: what, which specifies what should be removed, and file which specifies the path to the cache file. The latter only needs to be specified if you want to manage a different cache file.

In order to figure what is in the cache you can run clearCache() without any arguments:

clearCache()

Using this output you can re-run the function again, for instance:

clearCache("featuresOpenMS")
clearCache(c("featureGroupsOpenMS", "formulasGenForm")) # clear multiple
clearCache("OpenMS") # clear all with OpenMS in name (ie partial matched)
clearCache("all") # same as simply removing the file

Parallelization

Some steps in the non-target screening workflow are inherently computationally intensive. To reduce computational times patRoon is able to perform parallelization for most of the important functionality. This is especially useful if you have a modern system with multiple CPU cores and sufficient RAM.

For various technical reasons several parallelization techniques are used, these can be categorized as parallelization of R functions and multiprocessing. The next sections describe both parallelization approaches in order to let you optimize the workflow.

Parellization of R functions

Several functions of patRoon support parallelization.

Function | Purpose | Remarks ------------------------ | ----------------------------------- | ------------------------ findFeatures | Obtain feature data | Only envipick and kpic2 algorithms. generateComponents | Generate components | Only cliquems algorithm. report | Reporting data | generateTPs | Obtain transformation products | Only cts algorithm. optimizeFeatureFinding, optimizeFeatureGrouping | Optimize feature finding/grouping parameters | Discussed here. calculatePeakQualities | Calculate feature (group) qualities | Discussed here. predictTox / predictRespFactors | Prediction of toxicities/concentrations } | Only compounds methods. Discussed here.

The parallelization is achieved with the [future] and [future.apply] R packages. To enable parallelization of these functions the parallel argument must be set to TRUE and the future framework must be properly configured in advance. For example:

# setup three workers to run in parallel
future::plan("multisession", workers = 3)

# find features with enviPick in parallel
fList <- findFeatures(anaInfo, "envipick", parallel = TRUE)

It is important to properly configure the right future plan. Please see the documentation of the [future] package for more details.

Multiprocessing

patRoon relies on several external (command-line) tools to generate workflow data. These commands may be executed in parallel to reduce computational times ('multiprocessing'). The table below outlines the tools that are executed in parallel.

Tool | Used by | Notes --------------------- | -------------------------------------------- | --------------------------------- msConvert | convertMSFiles(algorithm="pwiz", ...) | FileConverter | convertMSFiles(algorithm="openms", ...) | FeatureFinderMetabo | findFeatures(algorithm="openms", ...) | julia | findFeatures(algorithm="safd", ...) | SIRIUS | findFeatures(algorithm="sirius", ...) | MetaboliteAdductDecharger | generateComponents(algorithm="openms", ...) | GenForm | generateFormulas(agorithm="genform", ...) | SIRIUS | generateFormulas(agorithm="sirius", ...), generateCompounds(agorithm="sirius", ...) | Only if splitBatches=TRUE MetFrag | generateCompounds(agorithm="metfrag", ...) | pngquant | reportHTML(...) | Only if optimizePng=TRUE BioTransformer | generateTPs(algorithm = "biotransformer") | Disabled by default (see ?generateTPs for details).

Multiprocessing is either performed by executing processes in the background with the [processx] R package (classic interface) or by futures, which were introduced in the previous section. An overview of the characteristics of both parallelization techniques is shown below.

classic | future --------------------------------------------------- | ------------------------------------------------------- requires little or no configuration | configuration needed to setup works with all tools | doesn't work with pngquant and slower with GenForm only supports parallelization on the local computer | allows both local and cluster computing

Which method is used is controlled by the patRoon.MP.method package option. Note that reportHTML() will always use the classic method for pngquant.

Classic multiprocessing interface

The classic interface is the 'original' method implemented in patRoon, and is therefore well tested and optimized. It is easier to setup, works well with all tools, and is therefore the default method. It is enabled as follows:

options(patRoon.MP.method = "classic")

The number of parallel processes is configured through the patRoon.MP.maxProcs option. By default it is set to the number of available CPU cores, which results usually in the best performance. However, you may want to lower this, for instance, to keep your computer more responsive while processing or limit the RAM used by the data processing workflow.

options(patRoon.MP.maxProcs = 2) # do not execute more than two tools in parallel. 

This will change the parallelization for the complete workflow. However, it may be desirable to change this for only a part the workflow. This is easily achieved with the withOpt() function.

# do not execute more than two tools in parallel.
options(patRoon.MP.maxProcs = 2)

# ... but execute up to four GenForm processes
withOpt(MP.maxProcs = 4, {
    formulas <- generateFormulas(fGroups, "genform", ...)
})

The withOpt function will temporarily change the given option(s) while executing a given code block and restore it afterwards (it is very similar to the with_options() function from the [withr] R package). Furthermore, notice how withOpt() does not require you to prefix the option names with patRoon..

Multiprocessing with futures

The primary goal of the "future" method is to allow parallel processing on one or more external computers. Since it uses the [future] R package, many approaches are supported, such as local parallelization (similar to the classic method), cluster computing via multiple networked computers and more advanced HPC approaches such as slurm via the [future.batchtools] R package. This parallelization method can be activated as follows:

options(patRoon.MP.method = "future")

# set a future plan

# example 1: start a local cluster with four nodes
future::plan("cluster", workers = 4)

# example 2: start a networked cluster with four nodes on PC with hostname "otherpc"
future::plan("cluster", workers = rep("otherpc", 4)) 

Please see the documentation of the respective packages (e.g. [future] and [future.batchtools]) for more details on how to configure the workers.

The withOpt() function introduced in the previous subsection can also be used to temporarily switch between parallelization approaches, for instance:

# default to future parallelization
options(patRoon.MP.method = "future")
future::plan("cluster", workers = 4)

# ... do workflow

# do classic parallelization for GenForm
withOpt(MP.method = "classic", {
    formulas <- generateFormulas(fGroups, "genform", ...)
})

# .. do more workflow

Logging

Most tools that are executed in parallel will log their output to text files. These files may contain valuable information, for instance, when an error occurred. By default, the logfiles are stored in the log directory placed in the current working directory. However, you can change this location by setting the patRoon.MP.logPath option. If you set this option to FALSE then no logging occurs.

Notes when using parallelization with futures

Some important notes when using the future parallelization method:




rickhelmus/patRoon documentation built on April 25, 2024, 8:15 a.m.