When generating formulae and compound annotations and some other functionalities it is required to specify the adduct species. Behind the scenes, different algorithms typically use different formats. For instance, in order to specify a protonated species...
GenForm
either accepts "M+H"
and "+H"
MetFrag
either accepts the numeric code 1
or "[M+H]+"
SIRIUS
accepts "[M+H]+"
In addition, most algorithms only accept a limited set of possible adducts, which do not necessarily all overlap with each other. The GenFormAdducts()
and MetFragAdducts()
functions list the possible adducts for GenForm
and MetFrag
, respectively.
In order to simplify the situation patRoon
internally uses its own format and converts it automatically to the algorithm specific format when necessary. Furthermore, during conversion it checks if the specified adduct format is actually allowed by the algorithm. Adducts in patRoon
are stored in the adduct
S4 class. Objects from this class specify which elements are added and/or subtracted, the final charge and the number of molecules present in the adduct (e.g. 2 for a dimer).
adduct(add = "H") # [M+H]+ adduct(sub = "H", charge = -1) # [M-H]- adduct(add = "K", sub = "H2", charge = -1) # [M+K-H2]- adduct(add = "H3", charge = 3) # [M+H3]3+ adduct(add = "H", molMult = 2) # [2M+H]+
A more easy way to generate adduct objects is by using the as.adduct()
function:
as.adduct("[M+H]+") as.adduct("[M+H2]2+") as.adduct("[2M+H]+") as.adduct("[M-H]-") as.adduct("+H", format = "genform") as.adduct(1, isPositive = TRUE, format = "metfrag")
In fact, the adduct
argument to workflow functions such as generateFormulas()
and generateCompounds()
is automatically converted to an adduct
class with the as.adduct()
function if necessary:
formulas <- generateFormulas(..., adduct = adduct(sub = "H", charge = -1)) formulas <- generateFormulas(..., adduct = "[M-H]-") # same as above
More details can be found in the reference manual (?adduct
and ?`adduct-utils`
).
Feature intensities are often compared between sample analyses, for instance, to evaluate trends between sample points. However, matrix effects, varying detector sensitivity and differences in analysed sample amount may complicate such comparison. For this reason, it may be desired to normalize the feature intensities.
The normInts()
function is used to normalize feature intensities (peak heights and areas). Two different types are supported:
Both normalization types can be combined.
Feature normalization itself supports the following normalization methods:
Method | Usage | Description
------------------ | ---------------------------------- | --------------------------------------------------
TIC | normInts(featNorm = "tic", ...)
| Normalizes by the combined intensity of all features, also known as the Total Ion Current (TIC).
Internal Standard | normInts(featNorm = "istd", ...)
| Uses internal standards (IS) to normalize feature intensities.
Concentration | normInts(featNorm = "conc", ...)
| Normalizes feature intensities of a sample analysis by its normalization concentration (explained below).
None | normInts(featNorm = "none", ...)
| Performs no feature normalization. Set this if you only want to perform group normalization (discussed in the next section).
All methods (except "none"
) are influenced by the normalization concentration, which is a property set for each sample analysis. For IS normalization, this should equal the concentration of the IS present in the sample. Otherwise the normalization concentration resembles the injected sample amount. The normalization concentration is defined in the norm_conc
column of the analysis information. For example:
# obtain analysis information as usual, but add normalization concentrations. # The blanks are set to NA, and will therefore not be normalized. generateAnalysisInfo(paths = patRoonData::exampleDataPath(), groups = c(rep("solvent", 3), rep("standard", 3)), blanks = "solvent", norm_concs = c(NA, NA, NA, 2, 2, 1))
The normalization concentration does not need to be an absolute value. In the end, what matters are the relative numbers between the sample analyses. For example, if the concentrations for two analyses are c(1, 2)
or c(1.5, 3.0)
the normalization occurs the same. Setting the concentration to NA
(or 0
) will skip normalization for an analysis. If the normalization concentration is absent from the analysis information it will be defaulted to 1
.
For IS normalization an internal standard list should be specified with the properties of the internal standards. Essentially, the format of this list is exactly the same as a suspect list. Example lists can be found in the patRoonData
package:
patRoonData::ISTDListPos[1:5, ]
As can be seen from above, labelled isotopes can be specified with square brackets, e.g. [2]H for deuterium.
The next step is to perform the normalization with normInts()
:
fGroupsNorm <- normInts(fGroups, featNorm = "istd", standards = patRoonData::ISTDListPos, adduct = "[M+H]+", ISTDRTWindow = 20, ISTDMZWindow = 200, minISTDs = 2)
This will do the following:
standards
argument).ISTDRTWindow
argument), m/z (ISTDMZWindow
argument) and a minimum number (minISTDs
). If the number of IS candidates within specified retention time and m/z windows is below minISTDs
, the close(st) candidate(s) outside these windows are additionally chosen.To evaluate the assignments for a particular feature group, the internalStandardAssignments()
function and plotGraph()
functions can be used:
fg <- names(fGroupsNorm)[2] internalStandardAssignments(fGroupsNorm)[[fg]] # IS assignments for 2nd feature group plotGraph(fGroupsNorm) # interactively explore assignments
Like IS normalization, other feature normalization methods also occurs with normInts()
:
fGroupsNorm <- normInts(fGroups, featNorm = "tic") # TIC normalization fGroupsNorm <- normInts(fGroups, featNorm = "conc") # Concentration normalization
Normalizing feature intensities among group member is easily performed by setting groupNorm=TRUE
:
# only perform group normalization fGroupsNorm <- normInts(fGroups, featNorm = "none", groupNorm = TRUE) # first perform TIC feature normalization and then group normalization fGroupsNorm <- normInts(fGroups, featNorm = "tic", groupNorm = TRUE)
The normalized intensity (peak heigh/area) values can easily be obtained with as.data.table()
:
as.data.table(fGroupsNorm, normalized = TRUE)[1:5] # can be combined with other parameters as.data.table(fGroupsNorm, normalized = TRUE, average = TRUE, areas = TRUE)[1:5] # feature values (no need to set normalized=TRUE) as.data.table(fGroupsNorm, features = TRUE)[1:5, .(group, analysis, intensity_rel, area_rel)]
Several other patRoon
functions also accept the normalized
argument to use normalized data, such as plotInt()
(discussed here), plotVolcano()
(discussed here) and generateComponents()
with intensity clustering (discussed here).
If normalized data is requested (normalized=TRUE
, see previous section) and normInts()
was not called on the feature group data, a default normalization will occur. This is nothing more than a group normalization (normInts(groupNorm=TRUE, ...)
), and was mainly implemented to ensure backwards compatibility with previous patRoon
versions.
Many different parameters exist that may affect the output quality of feature finding and grouping. To avoid time consuming manual experimentation, functionality is provided to largely automate the optimization process. The methodology, which uses design of experiments (DoE), is based on the excellent Isotopologue Parameter Optimization (IPO) R
package. The functionality of this package is directly integrated in patRoon. Some functionality was added or changed, the most important being support for other feature finding and grouping algorithms besides [XCMS] and basic optimization support for qualitative parameters. Nevertheless, the core optimization algorithms are largely untouched.
This section will introduce the most important concepts and functionality. Please see the reference manual for more information (e.g. ?`feature-optimization`
).
NOTE The [SIRIUS] and [SAFD] algorithms are currently not (yet) supported.
Before starting an optimization experiment we have to define parameter sets. These sets contain the parameters and (initial) numeric ranges that should be tested. A parameter set is defined as a regular list
, and can be easily constructed with the generateFeatureOptPSet()
and generateFGroupsOptPSet()
functions (for feature finding and feature grouping, respectively).
pSet <- generateFeatureOptPSet("openms") # default test set for OpenMS pSet <- generateFeatureOptPSet("openms", chromSNR = c(5, 10)) # add parameter # of course manually making a list is also possible (e.g. if you don't want to test the default parameters) pSet <- list(noiseThrInt = c(1000, 5000))
When optimizing with [XCMS] or [KPIC2] a few things have to be considered. First of all, when using the XCMS3 interface (i.e. algorithm="xcms3"
) the underlying method that should be used for finding and grouping features and retention alignment should be set. In case these are not set default methods will be used.
pSet <- list(method = "centWave", ppm = c(2, 8)) pSet <- list(ppm = c(2, 8)) # same: centWave is default # get defaults, but for different grouping/alignment methods pSetFG <- generateFGroupsOptPSet("xcms3", groupMethod = "nearest", retAlignMethod = "peakgroups")
In addition, when optimizing feature grouping (both XCMS
interfaces and KPIC2
) we need to set the grouping and retention alignment parameters in two different nested lists: these are groupArgs
/retcorArgs
(algorithm="xcms"
), groupParams
/retAlignParams
(algorithm="xcms3"
) or groupArgs
/alignArgs
(algorithm="kpic2"
).
pSetFG <- list(groupParams = list(bw = c(20, 30))) # xcms3 pSetFG <- list(retcorArgs = list(gapInit = c(0, 7))) # xcms pSetFG <- list(groupArgs = list(mz_weight = c(0.3, 0.9))) # kpic2
When a parameter set has been defined it should be used as input for the optimizeFeatureFinding()
or optimizeFeatureGrouping()
functions.
ftOpt <- optimizeFeatureFinding(anaInfo, "openms", pSet) fgOpt <- optimizeFeatureGrouping(fList, "openms", pSetFG) # fList is an existing features object
Similar to findFeatures()
, the first argument to optimizeFeatureFinding()
should be the analysis information. Note that it is not uncommon to perform the optimization with only a subset of (representative) analyses (i.e. to reduce processing time).
ftOpt <- optimizeFeatureFinding(anaInfo[1:2, ], "openms", pSet) # only use first two analyses
From the parameter set a design of experiment will be automatically created. Obviously, the more parameters are specified, the longer such an experiment will take. After an experiment has finished, the optimization algorithm will start a new experiment where numeric ranges for each parameter are increased or decreased in order to more accurately find optimum values. Hence, the numeric ranges specified in the parameter set are only initial ranges, and will be changed in subsequent experiments. After each experiment iteration the results will be evaluated and a new experiment will be started as long as better results were obtained during the last experiment (although there is a hard limit defined by the maxIterations
argument).
For some parameters it is recommended or even necessary to set hard limits on the numeric ranges that are allowed to be tested. For instance, setting a minimum feature intensity threshold is highly recommended to avoid excessive processing time and potentially suboptimal results due to excessive amounts of resulting features. Configuring absolute parameter ranges is done by setting the paramRanges
argument.
# set minimum intensity threshold (but no max) ftOpt <- optimizeFeatureFinding(anaInfo, "openms", list(noiseThrInt = c(1000, 5000), # initial testing range paramRanges = list(noiseThrInt = c(500, Inf))) # never test below 500
Depending on the used algorithm, several default absolute limits are imposed. These may be obtained with the getDefFeaturesOptParamRanges()
and getDefFGroupsOptParamRanges()
functions.
The common operation is to optimize numeric parameters. However, parameters that are not numeric (i.e. qualitative) need a different approach. In this case you will need to define multiple parameter sets, where each set defines a different qualitative value.
ftOpt <- optimizeFeatureFinding(anaInfo, "openms", list(chromFWHM = c(4, 8), isotopeFilteringModel = "metabolites (5% RMS)"), list(chromFWHM = c(4, 8), isotopeFilteringModel = "metabolites (2% RMS)"))
In the above example there are two parameter sets: both define the numeric chromFWHM
parameter, whereas the qualitative isotopeFilteringModel
parameter has a different value for each. Note that we had to specify the chromFWHM
twice, this can be remediated by using the templateParams
argument:
ftOpt <- optimizeFeatureFinding(anaInfo, "openms", list(isotopeFilteringModel = "metabolites (5% RMS)"), list(isotopeFilteringModel = "metabolites (2% RMS)"), templateParams = list(chromFWHM = c(4, 8)))
As its name suggests, the templateParams
argument serves as a template parameter set, and its values are essentially combined with each given parameter set. Note that current support for optimizing qualitative parameters is relatively basic and time consuming. This is because tests are essentially repeated for each parameter set (e.g. in the above example the chromFWHM
parameter is optimized twice, each time with a different value for isotopeFilteringModel
).
The results of an optimization process are stored in objects from the S4 optimizationResult
class. Several methods are defined to inspect and visualize these results.
The optimizedParameters()
function is used to inspect the best parameter settings. Similarly, the optimizedObject()
function can be used to obtain the object that was created with these settings (i.e. a features
or featureGroups
object).
optimizedParameters(ftOpt) # best settings for whole experiment optimizedObject(ftOpt) # features object with best settings for whole experiment
Results can also be obtained for specific parameter sets/iterations.
optimizedParameters(ftOpt, 1) # best settings for first parameter set optimizedParameters(ftOpt, 1, 1) # best settings for first parameter set and experiment iteration optimizedObject(ftOpt, 1) # features object with best settings for first parameter set
The plot()
function can be used to visualize optimization results. This function will plot graphs for results of all tested parameter pairs. The graphs can be contour, image or perspective plots (as specified by the type
argument).
plot(ftOpt, paramSet = 1, DoEIteration = 1) # contour plots for first param set/experiment plot(ftOpt, paramSet = 1, DoEIteration = 1, type = "persp") # pretty perspective plots
Please refer to the reference manual for more methods to inspect optimization results (e.g. ?optimizationResult
).
The algorithms used by findFeatures
detect chromatographic peaks automatically to find the features. However, it is common that not all detected features have 'proper' chromatographic peaks, and some features could be just noise. The [MetaClean] R
package supports various quality measures for chromatographic peaks. The quality measures include Gaussian fit, symmetry, sharpness and others. In addition, [MetaClean] averages all feature data for each feature group and adds a few additional group specific quality measures (e.g. retention time consistency). Please see @Chetnik2020 for more details. The calculations are integrated into patRoon
, and are easily performed with the calculatePeakQualities()
generic function.
fList <- calculatePeakQualities(fList) # calculate for all features fGroups <- calculatePeakQualities(fGroups) # calculate for all features and groups
Most often the featureGroups
method is only used, unless you want to filter features (discussed below) prior to grouping.
An extension in patRoon
is that the qualities are used to calculate peak scores. The score for each quality measure is calculated by normalizing and scaling the values into a 0-1
range, where zero is the worst and one the best. Note that most scores are relative, hence, the values should only be used to compare features among each other. Finally, a totalScore
is calculated which sums all individual scores and serves as a rough overall score indicator for a feature (group).
The qualities and scores are easily obtained with the as.data.table()
function.
# (limit rows/columns for clarity) as.data.table(fList)[1:5, 26:30] # the qualities argument is necessary to include the scores. # valid values are: "quality", "score" or "both" as.data.table(fGroups, qualities = "both")[1:5, 25:29]
The feature quality values can also be reviewed interactively with reports generated with report
(see [Reporting]) and with checkFeatures
(see here). The filter
function can be used filter out low scoring features and feature groups:
# only keep features with at least 0.3 Modality score and 0.5 symmetry score fList <- filter(fList, qualityRange = list(ModalityScore = c(0.3, Inf), SymmetryScore = c(0.5, Inf))) # same as above fGroups <- filter(fGroups, featQualityRange = list(ModalityScore = c(0.3, Inf), SymmetryScore = c(0.5, Inf))) # filter group averaged data fGroups <- filter(fGroups, groupQualityRange = list(totalScore = c(0.5, Inf)))
An important feature of [MetaClean] is to use the quality measures to train a machine learning model to automatically recognize 'good' and 'bad' features. patRoon
provides a few extensions to simplify training and using a model. Furthermore, while MetaClean
was primarily designed to work with [XCMS], the extensions of patRoon
allow the usage of data from all the algorithms supported by patRoon
.
The getMCTrainData
function can be used to convert data from a feature check session to training data that can be used by [MetaClean]. This allows you to use interactively select good/bad peaks. The workflow looks like this:
# untick the 'keep' checkbox for all 'bad' feature groups checkFeatures(fGroupsTrain, "train_session.yml") # get train data. This gives comparable data as MetaClean::getPeakQualityMetrics() trainData <- getMCTrainData(fGroupsTrain, "train_session.yml") # use train data with MetaClean with MetaClean::runCrossValidation(), # MetaClean::getEvaluationMeasures(), MetaClean::trainClassifier() etc # --> see the MetaClean vignette for details
Once you have created a model with [MetaClean] it can be used with the predictCheckFeaturesSession()
function:
predictCheckFeaturesSession(fGroups, "model_session.yml", model)
This will generate another check session file: all the feature groups that are considered good will be with a 'keep' state, the others without. As described elsewhere, the checkFeatures
function is used to review the results from a session and the filter
function can be used to remove unwanted feature groups. Note that calculatePeakQualitites()
must be called before getMCTrainData
/predictCheckFeaturesSession
can be used.
NOTE
MetaClean
only predicts at the feature group level. Thus, only the kept feature groups from a feature check session will be used for training, and any indivual features that were marked as removed will be ignored.
The feature group data obtained during the workflow can be exported to various formats with the export()
generic function. There are currently three formats supported: "brukerpa"
(Bruker ProfileAnalysis), "brukertasq"
(Bruker TASQ) and "mzmine"
(mzMine). The former exports a 'bucket table' which can be loaded in ProfileAnalysis, the second and third export a target list that can be processed with TASQ and mzMine, respectively.
The getXCMSSet()
function converts a features
or featureGroups
object to an xcmsSet
object which can be used for further processing with [xcms]. Similarly, the getXCMSnExp()
function can be used for conversion to an XCMS3 style XCMSnExp
object, and the getPICSet()
function can be used to convert features
to [KPIC2] data.
Some examples for these functions are shown below.
export(fGroups, "brukertasq", out = "my_targets.csv") # convert features to xcmsSet. # NOTE: loadRawData should only be FALSE when the analysis data files cannot be # loaded by the algorithm (e.g. when features were obtained with DataAnalysis and data was not exported to mz(X)ML) xset <- getXCMSSet(fList, loadRawData = TRUE) xsetg <- getXCMSSet(fGroups, loadRawData = TRUE) # get grouped xcmsSet # using the new XCMS3 interface xdata <- getXCMSnExp(fList) xdata <- getXCMSnExp(fGroups) # KPIC2 conversion. Like XCMS it optionally loads the raw data. picSet <- getPICSet(fList, loadRawData = TRUE)
With patRoon
you have the option to choose between several algorithms for most workflow steps. Each algorithm is typically characterized by its efficiency, robustness, and may be optimized towards certain data properties. Comparing their output is therefore advantageous in order to design an optimum workflow. The consensus()
generic function will compare different results from different algorithms and returns a consensus, which may be based on minimal overlap, uniqueness or simply a combination of all results from involved objects. The output from the consensus()
function is of similar type as the input types and is therefore compatible to any 'regular' further data processing operations (e.g. input for other workflow steps or plotting). Note that a consensus can also be made from objects generated by the same algorithm, for instance, to compare or combine results obtained with different parameters (e.g. different databases used for compound annotation).
The consensus()
generic is defined for most workflow objects. Some of its common function arguments are listed below.
Argument | Classes | Remarks
------------- | -------------------------------------------------- | -------------------------------------------------------------
obj
, ...
| All | Two or more objects (of the same type) that should be compared to generate the consensus.
compThreshold
, relAbundance
, absAbundance
, formThreshold
| compounds
, formulas
, featureGroupsComparison
| The minimum overlap (relative/absolute) for a result (feature, candidate) to be kept.
uniqueFrom
| compounds
, formulas
, transformationProductsStructure
, featureGroupsComparison
| Only keep unique results from specified objects.
uniqueOuter
| compounds
, formulas
, transformationProductsStructure
, featureGroupsComparison
| Should be combined with uniqueFrom
. If TRUE
then only results are kept which are also unique between the objects specified with uniqueFrom
.
Note that current support for generating a consensus between components
objects is very simplistic; here results are not compared, but the consensus simply consists a combination of all the components from each object.
Generating a consensus for feature groups involves first generating a featureGroupsComparison
object. This step is necessary since (small) deviations between retention times and/or mass values reported by different feature finding/grouping algorithms complicates a direct comparison. The comparison objects are made by the comparison()
function, and its results can be visualized by the plotting functions discussed in the previous chapter.
Some examples are shown below
compoundsCons <- consensus(compoundsMF, compoundsSIR) # combine MetFrag/SIRIUS results compoundsCons <- consensus(compoundsMF, compoundsSIR, compThreshold = 1) # only keep results that overlap TPsCons <- consensus(TPsLib, TPsBT) # combine library and BioTransformer TPs fGroupComp <- comparison(fGroupsXCMS, fGroupsOpenMS, fGroupsEnviPick, groupAlgo = "openms") plotVenn(fGroupComp) # visualize overlap/uniqueness fGroupsCons <- consensus(fGroupComp, uniqueFrom = 1:2) # only keep results unique in OpenMS+XCMS fGroupsCons <- consensus(fGroupComp, uniqueFrom = 1:2, uniqueOuter = TRUE) # as above, but also exclude any overlap between OpenMS/XCMS
The loadMSLibraries()
function is used to load MS spectral libraries, and was already briefly introduced for compound annotation. Currently, loading of MSP files and [MoNA] JSON files is supported, while loading of formula annotations for MS peaks is currently only supported for the latter. The underlying algorithms implement several optimizations to efficiently load large number of records. Furthermore, loadMSLibraries()
automatically verifies record data such as formulas, adducts and masses, and automatically calculates missing or invalid data where possible.
mslibraryMSP <- loadMSLibrary("MoNA-export-CASMI_2016.msp", "msp") mslibraryJSON <- loadMSLibrary("MoNA-export-CASMI_2016.json", "json")
Several advanced parameters are available that influence the loading of MS library data, see the reference manual (?loadMSLibrary
) for details.
Once loaded, the usual methods are available to inspect its data:
show(mslibraryMSP) mslibraryMSP[[1]] # MS/MS spectrum for first candidate mslibraryJSON[["SM801601"]] # a record with annotations # overview of all metadata (select few columns for readability) records(mslibraryJSON)[, .(DB_ID, Name, InChIKey, formula)] # convert all data to a data.table (may be huge!) as.data.table(mslibraryMSP)[, .(DB_ID, SMILES, formula, mz, intensity)]
Furthermore, like many other objects in patRoon
, the MS library objects can be subset and filtered:
mslibrarySub <- mslibrary[1:100] # only keep first 100 records # only keep records a neutral mass of 100-200 mslibraryF <- filter(mslibrary, massRange = c(100, 200)) # remove records with neutral mass below 100 mslibraryF <- filter(mslibrary, massRange = c(0, 100), negate = TRUE) # only keep mass peaks with m/z 100-500 mslibraryF <- filter(mslibrary, mzRangeSpec = c(100, 500)) # remove low intensity peaks (<1%) and only keep top 10 mslibraryF <- filter(mslibrary, relMinIntensity = 0.01, topMost = 10) # only keep mass peak with annotations mslibraryF <- filter(mslibraryJSON, onlyAnnotated = TRUE)
In addition, the properties
filter may be useful to tailor the library data. The library properties can be obtained as following:
oldW <- options(width = 80) # HACK: make sure vectors below are shown nicely names(records(mslibrary)) # get all property names unique(records(mslibrary)[["Instrument_type"]]) # Get the available instrument types options(width = oldW[[1]])
Then to filter the MS library:
# only keep APCI instrument types mslibraryF <- filter(mslibrary, properties = list(Instrument_type = c("LC-APCI-ITFT", "APCI-ITFT"))) # remove Q-TOF by negation mslibraryF <- filter(mslibrary, properties = list(Instrument_type = "LC-ESI-QTOF"), negate = TRUE)
More advanced filtering can be performed with the delete()
generic function, see the reference manual for details (?MSLibrary
).
Finally, functionality exists to convert, export and merge MS libraries:
# Convert the MS library to a suspect list. # By setting collapse to TRUE, all records with the same first block InChIKey # are collapsed and mass peaks are averaged. suspL <- convertToSuspects(mslibrary, adduct = "[M+H]+", collapse = TRUE) # Amend custom suspect list with library data (fragments_mz column) suspL <- convertToSuspects(mslibrary, adduct = "[M+H]+", suspects = patRoonData::suspectsPos) export(mslibrary, out = "myMSLib.msp") # export to a new MSP library mslibraryM <- merge(mslibraryMSP, mslibraryJSON) # merge two libraries
When large databases such as [PubChem] or [ChemSpider] are used for compound annotation, it is common to find many candidate structures for even a single feature. While choosing the right scoring settings can significantly improve their ranking, it is still very much possible that many candidates of potential interest remain. In this situation it might help to perform compound clustering. During this process, all candidates for a feature are clustered hierarchically on basis of similar chemical structure. From the resulting cluster the maximum common substructure (MCS) can be derived, which represents the largest possible substructure that 'fit' in all candidates. By visual inspection of the MCS it may be possible to identify likely common structural properties of a feature.
In order to perform compound clustering the makeHCluster()
generic function should be used. This function heavily relies on chemical fingerprinting functionality provided by [rcdk].
compounds <- generateCompounds(...) # get our compounds compsClust <- makeHCluster(compounds)
This function accepts several arguments to fine tune the clustering process:
method
: the clustering method (e.g. "complete"
(default), "ward.D2"
), see ?hclust
for optionsfpType
: finger printing type ("extended"
by default), see ?get.fingerprint
fpSimMethod
: similarity method for generating the distance method ("tanimoto"
by default), see ?fp.sim.matrix
For all arguments see the reference manual (?makeHClust
).
The resulting object is of type compoundsCluster
. Several methods are defined to modify and inspect these results:
# plot MCS of first cluster from candidates of M192_R355_191 plotStructure(compsClust, groupName = "M192_R355_191", 1) # plot dendrogram plot(compsClust, groupName = "M192_R355_191") # re-assign clusters for a feature group compsClust <- treeCut(compsClust, k = 5, groupName = "M192_R355_191") # ditto, but automatic cluster determination compsClust <- treeCutDynamic(compsClust, minModuleSize = 3, groupName = "M192_R355_191")
For a complete overview see the reference manual (?compoundsCluster
).
Some basic support in patRoon
is available to perform simple linear regression on feature intensities vs given experimental conditions. Examples of such conditions are dilution factor, sampling time or initial concentration of a parent in a degradation experiment. By testing if there is a significant linearity, features of interest can be isolated in a relative easy way. Originally, this functionality was implemented as a very basic method to perform rough calculations of concentrations. However, the next sections describes a much better way by using the MS2Quant
package. Regardless, this functionality still uses 'concentrations' as terminology for the experimental conditions of interest. The conditions are specified in the conc
column of the analysis information, for instance:
# obtain analysis information as usual, but add some experimental parameters of interest (dilution, time etc). # The blanks are set to NA, whereas the standards are set to increasing levels. anaInfo <- generateAnalysisInfo(paths = patRoonData::exampleDataPath(), groups = c(rep("solvent", 3), rep("standard", 3)), blanks = "solvent", concs = c(NA, NA, NA, 1, 2, 3))
If no experimental conditions are available for a particular analysis then the conc
value should be NA
. For these analyses the experimental condition will be calculated using the regression model obtained from the other analyses.
The as.data.table()
function (or as.data.frame()
) can then be used to calculate regression data:
# use areas for quantitation and make sure that feature data is reported # (only relevant columns are shown for clarity) as.data.table(fGroups, areas = TRUE, features = TRUE, regression = TRUE)
as.data.table(fGroupsConc, areas = TRUE, features = TRUE, regression = TRUE)[, c("group", "conc", "RSQ", "intercept", "slope", "conc_reg")]
The calculated experimental conditions are stored in the conc_reg
column (this column is only present if features=TRUE
). In addition, the table also contains other regression data such as RSQ
, intercept
and slope
. To perform basic trend analysis the RSQ
(i.e. R squared) can be used:
fGroupsTab <- as.data.table(fGroups, areas = TRUE, features = FALSE, regression = TRUE) # subset fGroups with reasonable correlation increasingFGroups <- fGroups[, fGroupsTab[RSQ >= 0.8, group]]
The [MS2Tox] and [MS2Quant] R
packages predict toxicities and feature concentrations using a machine learning approach. The predictions are performed with either SMILES data or fingerprints calculated from MS/MS data with SIRIUS+CSI:FingerID
. While using SMILES data is generally more accurate, using MS/MS fingerprints is generally faster and may be more suitable for features without know or suspected structure.
In patRoon
the predictions are done in two steps:
MS2Tox
/MS2Quant
. This step is performed by the predictTox()
/predictConcs()
method function.calculateTox()
/calculateConcs()
method function.Various workflow data can be used to perform the predictions for step 1:
a. Suspect hits that were obtained with screenSuspects
(see suspect screening).
b. Formula annotations obtained with SIRIUS+CSI:FingerID
.
c. Compound annotations obtained with SIRIUS+CSI:FingerID
.
d. Compound annotations obtained with other algorithms, e.g. MetFrag
.
For a and d SMILES is used to perform the calculations, for b MS/MS fingerprints are used and for c either can be used.
NOTE For option b, make sure that
getFingerprints=TRUE
when runninggenerateFormulas()
in order to obtain fingerprints.
Some example workflows are shown below:
# Calculate toxicity for suspect hits. fGroupsSuspTox <- predictTox(fGroupsSusp) fGroupsSuspTox <- calculateTox(fGroupsSuspTox) # Calculate toxicity for compound hits. Limit to the top 5 to reduce calculation time. compoundsTop5 <- filter(compounds, topMost = 5) compoundsTox <- predictTox(compoundsTop5) fGroupsTox <- calculateTox(fGroups, compoundsTox)
It is also possible to calculate toxicities from multiple workflow objects:
fGroupsSuspTox <- predictTox(fGroupsSusp) # as above # Predict toxicities from compound candidates, using both SMILES and MS/MS fingerprints # compoundsSuspSIR is an object produced with generateCompounds() with algorithm="sirius" compoundsSuspSIRTox <- predictTox(compoundsSuspSIR, type = "both") # Assign toxicities to feature groups from both suspect hits and SIRIUS annotations fGroupsSuspTox <- calculateTox(fGroupsSuspTox, compoundsSuspSIRTox)
More details are in the reference manual: ?`pred-tox`
.
The workflow to predict concentrations is quite similar to predicting toxicities. However, before we can start we first have to specify the calibrants and the LC gradient elution program.
The calibrant data is used by MS2Quant
to convert predicted ionization efficiencies to actual response factors, which are specific to the used LC instrument and methodology. For this purpose, several mixtures with known concentrations (i.e. standards) should be measured alongside your samples. The calibrants should be specified as a data.frame
, for instance:
# generate calibrants with some imaginary intensities/concentrations calibrants <- data.table(name = c("DEET", "1h-benzotriazole", "Caffeine", "Atrazine", "Carbamazepine", "Venlafaxine"), conc = rep(c(1, 2, 5, 10, 25, 50), each = 6)) setorderv(calibrants, "name") calibrants[, SMILES := patRoonData::suspectsPos[match(name, patRoonData::suspectsPos$name), "SMILES"]] calibrants[, rt := patRoonData::suspectsPos[match(name, patRoonData::suspectsPos$name), "rt"]] fgScr <- screenSuspects(fGroups, calibrants, onlyHits = TRUE, adduct = "[M+H]+") fgTab <- as.data.table(fgScr, average = TRUE, collapseSuspects = NULL) calibrants <- merge(calibrants, fgTab[, c("susp_name", "standard-pos"), with = FALSE], by.x = "name", by.y = "susp_name") setnames(calibrants, "standard-pos", "intensity") withr::with_seed(1, calibrants[, intensity := round(runif(.N, 0.9, 1.1) * (intensity / 20 * conc))]) setcolorder(calibrants, c("name", "SMILES", "intensity", "conc")) makeTable(calibrants)
The intensity column should contain either the peak intensity (height) or area. Note that some feature detection algorithms can sometimes produce inaccurate peak areas, and the area determination methodology is often different among algorithms. For this reason, using peak intensities may be more reliable, however, it is worth testing this with your data.
It is also possible to use the getQuantCalibFromScreening()
function to automatically create the calibrant table from feature group data:
calibList <- data.frame(...) # this should be a suspect list with your calibrants fGroups <- screenSuspects(fGroups, calibList) # screen for the calibrants concs <- data.frame(...) # concentration data for each calibrant compound, see below calibrants <- getQuantCalibFromScreening(fGroups, concs) calibrants <- getQuantCalibFromScreening(fGroups, concs, areas = TRUE) # obtain feature areas instead of intensities
The first step is to perform a screening for the calibrant compounds. Please ensure that this list should contains SMILES data, and to ensure correct feature assignment it is highly recommended to include retention times. The second requirement for getQuantCalibFromScreening()
is a table with concentrations for each calibrant compound, e.g.:
concs <- data.frame( name = c("DEET", "1h-benzotriazole", "Caffeine", "Atrazine", "Carbamazepine", "Venlafaxine"), standard_1 = c(1.00, 1.05, 1.10, 0.99, 1.01, 1.12), standard_2 = c(2.00, 2.15, 2.20, 1.98, 2.02, 1.82), standard_5 = c(5.01, 5.05, 5.22, 5.00, 4.88, 4.65), standard_10 = c(10.2, 10.11, 10.23, 11.77, 11.75, 12.13), standard_25 = c(25.3, 25.12, 25.34, 24.89, 24.78, 24.68), standard_50 = c(50.34, 50.05, 50.10, 49.97, 49.71, 50.52) ) concs
The concentrations are specified separately for each calibrant compound. The column names should follow the names of the replicate groups assigned to the standards. The concentration unit is µg/l by default. The next section describes how this can be changed.
The gradient elution program is also specified by a data.frame
, which for every time point (in seconds!) describes the percentage of 'B'. In this case, 'B' represents the total amount of organic modifier.
eluent <- data.frame( time = c(0, 180, 600, 900, 960), B = c(10, 30, 100, 100, 10) ) plot(eluent, type = "l") eluent
Then, the workflow to predict concentrations is very similar then predicting toxicities (previous section):
# Calculate concentrations for suspect hits. fGroupsSuspConc <- predictRespFactors( fGroupsSusp, calibrants = calibrants, eluent = eluent, organicModifier = "MeCN", # organic modifier: MeOH or MeCN pHAq = 4 # pH of the aqueous part of the mobile phase ) # set areas to TRUE if the calibrant table contains areas fGroupsSuspConc <- calculateConcs(fGroupsSuspConc, areas = FALSE)
As was shown for toxicities it is possible to use different data sources (e.g. compound annotations, suspects) for predictions.
More details are in the reference manual: ?`pred-quant`
.
The default unit for toxicity and concentration data is µg/l. However, this can be configured when calling the predictTox()
/predictRespFactors()
functions:
fGroupsSuspTox <- predictTox(fGroupsSusp) # default unit: ug/l fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "ug/l") # same as above fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "mM") # millimolar fGroupsSuspTox <- predictTox(fGroupsSusp, concUnit = "log mM") # unit used by MS2Tox # calculated concentrations are ng/l, calibrants are specified in ug/l # (by default calibConcUnit=concUnit) fGroupsSuspConc <- predictRespFactors( fGroupsSusp, calibrants = calibrants, eluent = eluent, organicModifier = "MeCN", pHAq = 4, concUnit = "ng/l", calibConcUnit = "ug/l" )
See the reference manuals (?`pred-tox`
/?`pred-quant`
) For more details on which units can be specified
The raw toxicity and concentration data assigned to feature groups can be retrieved with the toxicities()
and concentrations()
method functions, respectively.
toxicities(fGroupsSuspTox) concentrations(fGroupsSuspConc)
If there were multiple candidates for a single feature group then these are split over the table rows:
toxicities(fGroupsTox)
The as.data.table()
method function, which was discussed previously, can be used to summarize toxicity and concentration values.
# NOTE: NA values are filtered and columns are subset for readability as.data.table(fGroupsTox)[!is.na(LC50), c("group", "LC50", "LC50_types")] concCols <- c("group", paste0(analyses(fGroupsSuspConc), "_conc"), "conc_types") as.data.table(fGroupsSuspConc)[!is.na(conc_types), concCols, with = FALSE]
The as.data.table()
method function aggregates the data for a feature group in case multiple candidates were assigned to it. By default the values are mean averaged, but this be changed with the toxAggrParams
/concAggrParams
arguments, for instance:
# as above, but aggregate by taking maximum values as.data.table(fGroupsTox, toxAggrParams = getDefPredAggrParams(max))[!is.na(LC50), c("group", "LC50", "LC50_types")]
If the as.data.table()
method is used on suspect screening results, and predictions were performed directly for suspect hits, then predicted values can be reported for individual suspect match instead of aggregating them per feature group:
# Reports predicted values for each suspect separately. If multiple suspects are assigned to a feature group, # then each suspect match is split into a different row. as.data.table(fGroupsSuspTox, collapseSuspects = NULL)
Finally, the reporting functionality can be used to overview all predicted values, both aggregated and raw.
The filter()
method function that was introduced before can also be used to filter data based on predicted toxicities, response factors and concentrations. For instance, this allows you to remove annotation candidates which are unlikely to be toxic or sensitive enough to be detected or any features with very low concentrations. Some examples are shown below.
# compoundsSuspSIRTox is an object with predicted toxicities (LC50 values) for each candidate # we can use the common scoreLimits filter to select a range of allowed values (min/max) compoundsSuspSIRToxF <- filter(compoundsSuspSIRTox, scoreLimits = list(LC50_SMILES = c(0, 1E4))) # for suspects with predicted toxicities/response factors there are dedicated filters fGroupsSuspConcF <- filter(fGroupsSuspConc, minRF = 5E4) # remove suspect hits with response factor <5E4 fGroupsSuspToxF <- filter(ffGroupsSuspTox, maxLC50 = 100) # remove suspect hits with LC50 values > 100 # similarly, for feature data there are dedicated filters. # note that these aggregate data prior to filtering (see previous section) fGroupsConcF <- filter(fGroupsConc, absMinConc = 0.02) # only keep features with concentrations that are at least 1% of their toxicity # note that both concentrations/toxicity values should have been calculated with calculateConcs()/calculateTox() fGroupsConcToxF <- filter(fGroupsConcTox, absMinConcTox = 0.01) # also get rid of features without concentrations (these are ignored by default) fGroupsConcF <- filter(fGroupsConc, absMinConc = 0.02, removeNA = TRUE) # like as.data.table we can configure how values are aggregated # here the minimum is used instead of the default mean fGroupsToxF <- filter(fGroupsTox, absMaxTox = 5E3, predAggrParams = getDefPredAggrParams(min))
More details are found in the reference manual (?`feature-filtering`
).
A specific statistical way to prioritize feature data is by Fold changes (FC). This is a relative simple method to quickly identify (significant) changes between two sample groups. A typical use case is to compare the feature intensities before and after an experiment.
To perform FC calculations we first need to specify its parameters. This is best achieved with the getFCParams()
function:
getFCParams(c("before", "after"))
In this example we generate a list
with parameters in order to make a comparison between two replicate groups: before
and after
. Several advanced parameters are available to tweak the calculation process. These are explained in the reference manual (?featureGroups
).
The as.data.table
function for feature groups is used to perform the FC calculations.
fGroupsO <- fGroups fGroups <- filter(fGroupsUF, absMinIntensity = 1E4)
myFCParams <- getFCParams(c("solvent-pos", "standard-pos")) # compare solvent/standard as.data.table(fGroups, FCParams = myFCParams)[, c("group", "FC", "FC_log", "PV", "PV_log", "classification")]
The classification
column allows you to easily identify if and how a feature changes between the two sample groups. This can also be used to prioritize feature groups:
tab <- as.data.table(fGroups, FCParams = myFCParams) # only keep feature groups that significantly increase or decrease fGroupsChanged <- fGroups[, tab[classification %in% c("increase", "decrease")]$group]
The plotVolcano
function can be used to visually the FC data:
plotVolcano(fGroups, myFCParams)
fGroups <- fGroupsO
In patRoon
lengthy processing operations such as finding features and generating annotation data is cached. This means that when you run such a calculation again (without changing any parameters), the data is simply loaded from the cache data instead of re-generating it. This in turn is very useful, for instance, if you have closed your R session and want to continue with data processing at a later stage.
The cache data is stored in a [sqlite] database file. This file is stored by default under the name cache.sqlite
in the current working directory (for this reason it is very important to always restore your working directory!). However, the name and location can be changed by setting a global package option:
options(patRoon.cache.fileName = "~/myCacheFile.sqlite")
For instance, this might be useful if you want to use a shared cache file between projects.
After a while you may see that your cache file can get quite large. This is especially true when testing different parameters to optimize your workflow. Furthermore, you may want to clear the cache after you have updated patRoon
and want to make sure that the latest code is used to generate the data. At any point you can simply remove the cache file. A more fine tuned approach which doesn't wipe all your cached data is by using the clearCache()
function. With this function you can selectively remove parts of the cache file. The function has two arguments: what
, which specifies what should be removed, and file
which specifies the path to the cache file. The latter only needs to be specified if you want to manage a different cache file.
In order to figure what is in the cache you can run clearCache()
without any arguments:
clearCache()
Using this output you can re-run the function again, for instance:
clearCache("featuresOpenMS") clearCache(c("featureGroupsOpenMS", "formulasGenForm")) # clear multiple clearCache("OpenMS") # clear all with OpenMS in name (ie partial matched) clearCache("all") # same as simply removing the file
Some steps in the non-target screening workflow are inherently computationally intensive. To reduce computational times patRoon
is able to perform parallelization for most of the important functionality. This is especially useful if you have a modern system with multiple CPU cores and sufficient RAM.
For various technical reasons several parallelization techniques are used, these can be categorized as parallelization of R
functions and multiprocessing. The next sections describe both parallelization approaches in order to let you optimize the workflow.
Several functions of patRoon
support parallelization.
Function | Purpose | Remarks
------------------------ | ----------------------------------- | ------------------------
findFeatures
| Obtain feature data | Only envipick
and kpic2
algorithms.
generateComponents
| Generate components | Only cliquems
algorithm.
report
| Reporting data |
generateTPs
| Obtain transformation products | Only cts
algorithm.
optimizeFeatureFinding
, optimizeFeatureGrouping
| Optimize feature finding/grouping parameters | Discussed here.
calculatePeakQualities
| Calculate feature (group) qualities | Discussed here.
predictTox
/ predictRespFactors
| Prediction of toxicities/concentrations } | Only compounds
methods. Discussed here.
The parallelization is achieved with the [future] and [future.apply] R
packages. To enable parallelization of these functions the parallel
argument must be set to TRUE
and the future framework must be properly configured in advance. For example:
# setup three workers to run in parallel future::plan("multisession", workers = 3) # find features with enviPick in parallel fList <- findFeatures(anaInfo, "envipick", parallel = TRUE)
It is important to properly configure the right future plan. Please see the documentation of the [future] package for more details.
patRoon
relies on several external (command-line) tools to generate workflow data. These commands may be executed in parallel to reduce computational times ('multiprocessing'). The table below outlines the tools that are executed in parallel.
Tool | Used by | Notes
--------------------- | -------------------------------------------- | ---------------------------------
msConvert
| convertMSFiles(algorithm="pwiz", ...)
|
FileConverter
| convertMSFiles(algorithm="openms", ...)
|
FeatureFinderMetabo
| findFeatures(algorithm="openms", ...)
|
julia
| findFeatures(algorithm="safd", ...)
|
SIRIUS
| findFeatures(algorithm="sirius", ...)
|
MetaboliteAdductDecharger
| generateComponents(algorithm="openms", ...)
|
GenForm
| generateFormulas(agorithm="genform", ...)
|
SIRIUS
| generateFormulas(agorithm="sirius", ...)
, generateCompounds(agorithm="sirius", ...)
| Only if splitBatches=TRUE
MetFrag
| generateCompounds(agorithm="metfrag", ...)
|
pngquant
| reportHTML(...)
| Only if optimizePng=TRUE
BioTransformer
| generateTPs(algorithm = "biotransformer")
| Disabled by default (see ?generateTPs
for details).
Multiprocessing is either performed by executing processes in the background with the [processx] R
package (classic interface) or by futures, which were introduced in the previous section. An overview of the characteristics of both parallelization techniques is shown below.
classic
| future
--------------------------------------------------- | -------------------------------------------------------
requires little or no configuration | configuration needed to setup
works with all tools | doesn't work with pngquant
and slower with GenForm
only supports parallelization on the local computer | allows both local and cluster computing
Which method is used is controlled by the patRoon.MP.method
package option. Note that reportHTML()
will always use the classic method for pngquant
.
The classic interface is the 'original' method implemented in patRoon
, and is therefore well tested and optimized. It is easier to setup, works well with all tools, and is therefore the default method. It is enabled as follows:
options(patRoon.MP.method = "classic")
The number of parallel processes is configured through the patRoon.MP.maxProcs
option. By default it is set to the number of available CPU cores, which results usually in the best performance. However, you may want to lower this, for instance, to keep your computer more responsive while processing or limit the RAM used by the data processing workflow.
options(patRoon.MP.maxProcs = 2) # do not execute more than two tools in parallel.
This will change the parallelization for the complete workflow. However, it may be desirable to change this for only a part the workflow. This is easily achieved with the withOpt()
function.
# do not execute more than two tools in parallel. options(patRoon.MP.maxProcs = 2) # ... but execute up to four GenForm processes withOpt(MP.maxProcs = 4, { formulas <- generateFormulas(fGroups, "genform", ...) })
The withOpt
function will temporarily change the given option(s) while executing a given code block and restore it afterwards (it is very similar to the with_options()
function from the [withr] R
package). Furthermore, notice how withOpt()
does not require you to prefix the option names with patRoon.
.
The primary goal of the "future" method is to allow parallel processing on one or more external computers. Since it uses the [future] R
package, many approaches are supported, such as local parallelization (similar to the classic
method), cluster computing via multiple networked computers and more advanced HPC approaches such as slurm
via the [future.batchtools] R
package. This parallelization method can be activated as follows:
options(patRoon.MP.method = "future") # set a future plan # example 1: start a local cluster with four nodes future::plan("cluster", workers = 4) # example 2: start a networked cluster with four nodes on PC with hostname "otherpc" future::plan("cluster", workers = rep("otherpc", 4))
Please see the documentation of the respective packages (e.g. [future] and [future.batchtools]) for more details on how to configure the workers.
The withOpt()
function introduced in the previous subsection can also be used to temporarily switch between parallelization approaches, for instance:
# default to future parallelization options(patRoon.MP.method = "future") future::plan("cluster", workers = 4) # ... do workflow # do classic parallelization for GenForm withOpt(MP.method = "classic", { formulas <- generateFormulas(fGroups, "genform", ...) }) # .. do more workflow
Most tools that are executed in parallel will log their output to text files. These files may contain valuable information, for instance, when an error occurred. By default, the logfiles are stored in the log
directory placed in the current working directory. However, you can change this location by setting the patRoon.MP.logPath
option. If you set this option to FALSE
then no logging occurs.
Some important notes when using the future
parallelization method:
GenForm
currently performs less optimal with future multiprocessing to the classic
approach. Nevertheless, it may still be interesting to use the future
method to move the computations to another system to free up resources on your local system.patRoon.MP.futureSched
option sets the value for the future.scheduling
argument to the future_lapply()
function, and therefore allows you to tweak the scheduling.patRoon
is present and with the same version on all computing hosts.MetFrag
and SIRIUS
, and local compound databases, such as as PubChemLite
, are also with the same version and are configured properly. See the [Installation] section for more details.future::plan("sequential")
and see if it works or you get more descriptive error messages.patRoon
, updating R
packages etc, simply re-execute future::plan(...)
.future.debug
package option to TRUE
may give you more insight what is happening to find problems.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.