Processing workflow data {#processing}

The previous chapter mainly discussed how to create workflow data. This chapter will discuss how to use the data.

Inspecting results

Several generic functions exist that can be used to inspect data that is stored in a particular object (e.g. features, compounds etc):

Generic | Classes | Remarks --------------------------------------------------- | ----------------------------- | ------------------------------------------ length() | All | Returns the length of the object (e.g. number of features, compounds etc) algorithm() | All | Returns the name of the algorithm used to generate the object. groupNames() | All | Returns all the unique identitifiers (or names) of the feature groups for which this object contains results. names() | featureGroups, components | Returns names of the feature groups (similar to groupNames()) or components show() | All | Prints general information. "[[" / "$" operators | All | Extract general information, see below. as.data.table() / as.data.frame() | All | Convert data to a data.table or data.frame, see below. analysisInfo(), analyses(), replicateGroups() | features, featureGroups | Returns the analysis information, analyses or replicate groups for which this object contains data. groupInfo() | featureGroups | Returns feature group information (m/z and retention time values). screenInfo() | featureGroupsScreening | Returns information on hits from suspect screening. componentInfo() | components | Returns information for all components. annotatedPeakList() | formulas, compounds | Returns a table with annotated mass peaks (see below).

The common R extraction operators "[[", "$" can be used to obtain data for a particular feature groups, analysis etc:

# Feature table (only first columns for readability)
fList[["standard-1"]][, 1:6]

# Feature group intensities
fGroups$M120_R268_30
fGroups[[1, "M120_R268_30"]] # only first analysis

# obtains MS/MS peak list  (feature group averaged data)
mslists[["M120_R268_30"]]$MSMS

# get all formula candidates for a feature group
formulas[["M120_R268_30"]][, 1:7]

# get all compound candidates for a feature group
compounds[["M120_R268_30"]][, 1:4]

# get a table with information of a component
components[["CMP7"]][, 1:6]

A more sophisticated way to obtain data from a workflow object is to use as.data.table() or as.data.frame(). These functions will convert all information within the object to a table (data.table or data.frame) and allow various options to add extra information. An advantage is that this common data format can be used with many other functions within R. The output is in a tidy format.

NOTE If you are not familiar with data.table and want to know more see [data.table]. Briefly, this is a more efficient and largely compatible alternative to the regular data.frame.

NOTE The as.data.frame() methods defined in patRoon simply convert the results from as.data.table(), hence, both functions are equal in their usage and are defined for the same object classes.

Some typical examples are shown below.

# obtain table with all features (only first columns for readability)
as.data.table(fList)[, 1:6]

# Returns group info and intensity values for each feature group
as.data.table(fGroups, average = TRUE) # average intensities for replicates

# As above, but with suspect matches on separate rows and additional screening information
# (select some columns to simplify the output below)
as.data.table(fGroupsSusp, average = TRUE, collapseSuspects = NULL,
              onlyHits = TRUE)[, c("group", "susp_name", "susp_compRank", "susp_annSimBoth", "susp_estIDLevel")]

# Returns all peak lists for each feature group
as.data.table(mslists)

# Returns all formula candidates for each feature group with scoring
# information, neutral loss etc
as.data.table(formulas)[, 1:6]

# Returns all compound candidates for each feature group with scoring and other metadata
as.data.table(compounds)[, 1:4]

# Returns table with all components (including feature group info, annotations etc)
as.data.table(components)[, 1:6]

Finally, the annotatedPeakList() function is useful to inspect annotation results for a formula or compound candidate:

# formula annotations for the first formula candidate of feature group M137_R249_53
annotatedPeakList(formulas, index = 1, groupName = "M137_R249_53",
                  MSPeakLists = mslists)

# compound annotation for first candidate of feature group M137_R249_53
annotatedPeakList(compounds, index = 1, groupName = "M137_R249_53",
                  MSPeakLists = mslists)

More advanced examples for these functions are shown below.

# Feature table, can also be accessed by numeric index
fList[[1]]
mslists[["standard-1", "M120_R268_30"]] # feature data (instead of feature group averaged)
formulas[[1, "M120_R268_30"]] # feature data (if available, i.e. calculateFeatures=TRUE)
components[["CMP1", 1]] # only for first feature group in component

as.data.frame(fList) # classic data.frame format, works for all objects
as.data.table(fGroups) # return non-averaged intensities (default)
as.data.table(fGroups, features = TRUE) # include feature information
as.data.table(mslists, averaged = FALSE) # peak lists for each feature
as.data.table(mslists, fGroups = fGroups) # add feature group information

as.data.table(formulas, countElements = c("C", "H")) # include C/H counts (e.g. for van Krevelen plots)
# add various information for organic matter characterization (common elemental
# counts/ratios, classifications etc)
as.data.table(formulas, OM = TRUE)

as.data.table(compounds, fGroups = fGroups) # add feature group information
as.data.table(compounds, fragments = TRUE) # include information of all annotated fragments

annotatedPeakList(formulas, index = 1, groupName = "M120_R268_30",
                  MSPeakLists = mslists, onlyAnnotated = TRUE) # only include annotated peaks
annotatedPeakList(compounds, index = 1, groupName = "M120_R268_30",
                  MSPeakLists = mslists, formulas = formulas) # include formula annotations

Filtering {#filtering}

During a non-target workflow it is not uncommon that some kind of data-cleanup is necessary. Datasets are often highly complex, which makes separating data of interest from the rest highly important. Furthermore, general cleanup typically improves the quality of the dataset, for instance by removing low scoring annotation results or features that are unlikely to be 'correct' (e.g. noise or present in blanks). For this reason patRoon supports many different filters that easily clean data produced during the workflow in a highly customizable way.

All major workflow objects (e.g. featureGroups, compounds, components etc.) support filtering operations by the filter() generic. This function takes the object to be filtered as first argument and any remaining arguments describe the desired filter options. The filter() generic function then returns the modified object back. Some examples are shown below.

# remove low intensity (<500) features
features <- filter(features, absMinIntensity = 500)

# remove features with intensities lower than 5 times the blank
fGroups <- filter(fGroups, blankThreshold = 5)

# only retain compounds with >1 explained MS/MS peaks
compounds <- filter(compounds, minExplainedPeaks = 1)

The following sections will provide a more detailed overview of available data filters.

NOTE Some other R packages (notably dplyr) also provide a filter() generic function. To use the filter() function from different packages you may need to explicitly specify which one to use in your script. This can be done by prefixing it with the package name, e.g. patRoon::filter(...), dplyr::filter(...) etc.

Features

There are many filters available for feature data:

Filter | Classes | Remarks ------------------------------------------ | --------------------------- | --------------------------------------------------------- absMinIntensity, relMinIntensity | features, featureGroups | Minimum intensity preAbsMinIntensity, preRelMinIntensity | featureGroups | Minimum intensity prior to other filtering (see below) retentionRange, mzRange, mzDefectRange, chromWidthRange | features, featureGroups | Filter by feature properties absMinAnalyses, relMinAnalyses | featureGroups | Minimum feature abundance in all analyses absMinReplicates, relMinReplicates | featureGroups | Minimum feature abundance in different replicates absMinFeatures, relMinFeatures | featureGroups | Only keep analyses with at least this amount of features absMinReplicateAbundance, relMinReplicateAbundance | featureGroups | Minimum feature abundance in a replicate group maxReplicateIntRSD | featureGroups | Maximum relative standard deviation of feature intensities in a replicate group. blankThreshold | featureGroups | Minimum intensity factor above blank intensity rGroups | featureGroups | Only keep (features of) these replicate groups results | featureGroups | Only keep feature groups with formula/compound annotations or componentization results

Application of filters to feature data is important for (environmental) non-target analysis. Especially blank and replicate filters (i.e. blankThreshold and absMinReplicateAbundance/relMinReplicateAbundance) are important filters and are highly recommended to always apply for cleaning up your dataset.

All filters are available for feature group data, whereas only a subset is available for feature objects. The main reason is that other filters need grouping of features between analyses. Regardless, in patRoon filtering feature data is less important, and typically only needed when the number of features are extremely large and direct grouping is undesired.

From the table above you can notice that many filters concern both absolute and relative data (i.e. as prefixed with abs and rel). When a relative filter is used the value is scaled between 0 and 1. For instance:

# remove features not present in at least half of the analyses within a replicate group
fGroups <- filter(fGroups, relMinReplicateAbundance = 0.5)

An advantage of relative filters is that you will not have to worry about the data size involved. For instance, in the above example the filter always takes half of the number of analyses within a replicate group, even when replicate groups have different number of analyses.

Note that multiple filters can be specified at once. Especially for feature group data the order of filtering may impact the final results, this is explained further in the reference manual (i.e. ?`feature-filtering`).

Some examples are shown below.

# filter features prior to grouping: remove any features eluting before first 2 minutes
fList <- filter(fList, retentionRange = c(120, Inf))

# common filters for feature groups
fGroups <- filter(fGroups,
                  absMinIntensity = 500, # remove features <500 intensity
                  relMinReplicateAbundance = 1, # features should be in all analysis of replicate groups
                  maxReplicateIntRSD = 0.75, # remove features with intensity RSD in replicates >75%
                  blankThreshold = 5, # remove features <5x intensity of (average) blank intensity
                  removeBlanks = TRUE) # remove blank analyses from object afterwards

# filter by feature properties
fGroups <- filter(mzDefectRange = c(0.8, 0.9),
                  chromWidthRange = c(6, 120))

# remove features not present in at least 3 analyses
fGroups <- filter(fGroups, absMinAnalyses = 3)

# remove features not present in at least 20% of all replicate groups
fGroups <- filter(fGroups, relMinReplicates = 0.2)

# only keep data present in replicate groups "repl1" and "repl2"
# all other features and analyses will be removed
fGroups <- filter(fGroups, rGroups = c("repl1", "repl2"))

# only keep feature groups with compound annotations
fGroups <- filter(fGroups, results = compounds)
# only keep feature groups with formula or compound annotations
fGroups <- filter(fGroups, results = list(formulas, compounds))

Suspect screening

Several additional filters are available for feature groups obtained with screenSuspects():

Filter | Classes | Remarks ----------------------------------------- | ------------------------ | --------------------------------------------------------- onlyHits | featureGroupsScreening | Only retain feature groups assigned to one or more suspects. selectHitsBy | featureGroupsScreening | Select the feature group that matches best with a suspect (in case there are multiple). selectBestFGroups | featureGroupsScreening | Select the suspect that matches best with a feature group (in case there are multiple). maxLevel, maxFormRank, maxCompRank | featureGroupsScreening | Only retain suspect hits with identification/annotation ranks below a threshold. minAnnSimForm, minAnnSimComp, minAnnSimBoth | featureGroupsScreening | Remove suspect hits with annotation similarity scores below this value. absMinFragMatches, relMinFragMatches | featureGroupsScreening | Only keep suspect hits with a minimum (relative) number of fragment matches from the suspect list.

NOTE: most filters only remove suspect hit results. Set onlyHits=TRUE to also remove any feature groups that end up without suspect hits.

The selectHitsBy and selectBestFGroups filters are useful to remove duplicate hits (one suspect assigned to multiple feature groups or multiple feature groups assigned to the same suspect, respectively). The former selects based on either best identification level (selectHitsBy="level") or highest mean intensity (selectHitsBy="intensity"). The selectBestFGroups can only be TRUE/FALSE and always selects by best identification level.

Some examples are shown below.

# only keep feature groups assigned to at least one suspect
fGroupsSusp <- filter(fGroupsSusp, onlyHits = TRUE)
# remove duplicate suspect to feature group matches and keep the best
fGroupsSusp <- filter(fGroupsSusp, selectHitsBy = "level")
# remove suspect hits with ID levels >3 and make sure no feature groups
# are present without suspect hits afterwards
fGroupsSusp <- filter(fGroupsSusp, maxLevel = 3, onlyHits = TRUE)

Annotation

There are various filters available for handling annotation data:

Filter | Classes | Remarks --------------------------------------------- | ------------------------------ | ------------------------------------------------- absMSIntThr, absMSMSIntThr, relMSIntThr, relMSMSIntThr | MSPeakLists | Minimum intensity of mass peaks topMSPeaks, topMSMSPeaks | MSPeakLists | Only keep most intense mass peaks withMSMS | MSPeakLists | Only keep results with MS/MS data minMSMSPeaks | MSPeakLists | Only keep an MS/MS peak list if it contains a minimum number of peaks (excluding the precursor peak) annotatedBy | MSPeakLists | Only keep MS/MS peaks that have formula or compound annotations minExplainedPeaks | formulas, compounds | Minimum number of annotated mass peaks elements, fragElements, lossElements | formulas, compounds | Restrain elemental composition topMost | formulas, compounds | Only keep highest ranked candidates minScore, minFragScore, minFormulaScore | compounds | Minimum compound scorings scoreLimits | formulas, compounds | Minimum/Maximum scorings OM | formulas, compounds | Only keep candidates with likely elemental composition found in organic matter

Several intensity related filters are available to clean-up MS peak list data. For instance, the topMSPeaks/topMSMSPeaks filters provide a simple way to remove noisy data by only retaining a defined number of most intense mass peaks. Note that none of these filters will remove the precursor mass peak of the feature itself.

The filters applicable to formula and compound annotation generally concern minimal scoring or chemical properties. The former is useful to remove unlikely candidates, whereas the second is useful to focus on certain study specific chemical properties (e.g. known neutral losses).

Common examples are shown below.

# intensity filtering
mslists <- filter(mslists,
                  absMSIntThr = 500, # minimum MS mass peak intensity of 500
                  relMSMSIntThr = 0.1) # minimum MS/MS mass peak intensity of 10%

# only retain 10 most intens mass peaks
# (feature mass is always retained)
mslists <- filter(mslists, topMSPeaks = 10)

# remove MS/MS peaks without compound annotations
mslists <- filter(mslists, annotatedBy = compounds)

# remove MS/MS peaks not annotated by either a formula or compound candidate
mslists <- filter(mslists, annotatedBy = list(formulas, compounds))

# only keep formulae with 1-10 sulphur or phosphorus elements
formulas <- filter(formulas, elements = c("S1-10", "P1-10"))

# only keep candidates with MS/MS fragments that contain 1-10 carbons and 0-2 oxygens
formulas <- filter(formulas, fragElements = "C1-10O0-2")

# only keep candidates with CO2 neutral loss
formulas <- filter(formulas, lossElements = "CO2")

# only keep the 15 highest ranked candidates with at least 1 annotated MS/MS peak
compounds <- filter(compounds, minExplainedPeaks = 1, topMost = 15)

# minimum in-silico score
compounds <- filter(compounds, minFragScore = 10)

# candidate should be referenced in at least 1 patent
# (only works if database lists number of patents, e.g. PubChem)
compounds <- filter(compounds,
                    scoreLimits = list(numberPatents = c(1, Inf))

NOTE As of patRoon 2.0 MS peak lists are not re-generated after a filtering operation (unless the reAverage parameter is explicity set to TRUE). The reason for this change is that re-averaging invalidates any formula/compound annotation data (e.g. used for plotting and reporting) that were generated prior to the filter operation.

Components

Finally several filters are available for components:

Filter | Remarks ---------------------------- | ----------------- size | Minimum component size adducts, isotopes | Filter features by adduct/istopes annotation rtIncrement, mzIncrement | Filter homologs by retention/mz increment range

Note that these filters are only applied if the components contain the data the filter works on. For instance, filtering by adducts will not affect components obtained from homologous series.

As before, some typical examples are shown below.

# only keep components with at least 4 features
componInt <- filter(componInt, minSize = 4)

# remove all features from components are not annotated as an adduct
componRC <- filter(componRC, adducts = TRUE)

# only keep protonated and sodium adducts
componRC <- filter(componRC, adducts = c("[M+H]+", "[M+Na]+"))

# remove all features not recognized as isotopes
componRC <- filter(componRC, isotopes = FALSE)

# only keep monoisotopic mass
componRC <- filter(componRC, isotopes = 0)

# min/max rt/mz increments for homologs
componNT <- filter(componNT, rtIncrement = c(10, 30),
                   mzIncrement = c(16, 50))

NOTE As mentioned before, components are still in a relative young development phase and results should always be verified!

Negation

All filters support negation: if enabled all specified filters will be executed in an opposite manner. Negation may not be so commonly used, but allows greater flexibility which is sometimes needed for advanced filtering steps. Furthermore, it is also useful to specifically isolate the data that otherwise would have been removed. Some examples are shown below.

# keep all features/analyses _not_ present from replicate groups "repl1" and "repl2"
fGroups <- filter(fGroups, rGroups = c("repl1", "repl2"), negate = TRUE)

# only retain features with a mass defect outside 0.8-0.9
fGroups <- filter(mzDefectRange = c(0.8, 0.9), negate = TRUE)

# remove duplicate suspect hits and only keep the _worst_ hit
fGroupsSusp <- filter(fGroupsSusp, selectHitsBy = "level", negate = TRUE)

# remove candidates with CO2 neutral loss
formulas <- filter(formulas, lossElements = "CO2", negate = TRUE)

# select 15 worst ranked candidates
compounds <- filter(compounds, topMost = 15, negate = TRUE)

# only keep components with <5 features
componInt <- filter(componInt, minSize = 5, negate = TRUE)

Subsetting {#subset}

The previous section discussed the filter() generic function to perform various data cleaning operations. A more generic way to select data is by subsetting: here you can manually specify which parts of an object should be retained. Subsetting is supported for all workflow objects and is performed by the R subset operator ("["). This operator either subsets by one or two arguments, which are referred to as the i and j arguments.

Class | Argument i | Argument j | Remarks --------------- | -------------- | -------------- | ------------------------------------------------ features | analyses | | featureGroups | analyses | feature groups | MSPeakLists | analyses | feature groups | peak lists for feature groups will be re-averaged when subset on analyses (by default) formulas | feature groups | | compounds | feature groups | | components | components | feature groups |

For objects that support two-dimensional subsetting (e.g. featureGroups, MSPeakLists), either the i or j argument is optional. Furthermore, unlike subsetting a data.frame, the position of i and j does not change when only one argument is specified:

df[1, 1] # subset data.frame by first row/column
df[1] # subset by first column
df[1, ] # subset by first row

fGroups[1, 1] # subset by first analysis/feature group
fGroups[, 1] # subset by first feature group (i.e. column)
fGroups[1] # subset by first analysis (i.e. row)

The subset operator allows three types of input:

When a logical vector is used as input it will be re-cycled if necessary. For instance, the following will select by the first, third, fifth, etc. analysis.

fGroups[c(TRUE, FALSE)]

In order to select by a character you will need to know the names for each element. These can, for instance, be obtained by the groupNames() (feature group names), analyses() (analysis names) and names() (names for components or feature groups for featureGroups objects) generic functions.

Some more examples of common subsetting operations are shown below.

# select first three analyses
fList[1:3]

# select first three analyses and first 500 feature groups
fGroups[1:3, 1:500]

# select all feature groups from first component
fGroupsNT <- fGroups[, componNT[[1]]$group]

# only keep feature groups with formula annotation results
fGroupsForms <- fGroups[, groupNames(formulas)]

# only keep feature groups with either formula or compound annotation results
fGroupsAnn <- fGroups[, union(groupNames(formulas), groupNames(compounds))]

# select first 15 components
components[1:15]

# select by name
components[c("CMP1", "CMP5")]

# only retain feature groups in components for which compound annotations are
# available
components[, groupNames(compounds)]

In addition, feature groups can also be subset by given replicate groups or annotation/componentization results (similar to filter()). Similarly, suspect screening results can also be subset by given suspect names.

# equal as filter(fGroups, rGroups = ...)
fGroups[rGroups = c("repl1", "repl2")]
# equal as filter(fGroups, results = ...)
fGroups[results = compounds]
# only keep feature groups assigned to given suspects
fGroupsSusp[suspects = c("1H-benzotriazole", "2-Hydroxyquinoline")]

NOTE As of patRoon 2.0 MS peak lists are not re-generated after a subsetting operation (unless the reAverage parameter is explicity set to TRUE). The reason for this change is that re-averaging invalidates any formula/compound annotation data (e.g. used for plotting and reporting) that were generated prior to the subset operation.

Prioritization workflow

An important use case of subsetting is prioritization of data. For instance, after statistical analysis only certain feature groups are deemed relevant for the rest of the workflow. A common prioritization workflow is illustrated below:

plotGV("
digraph Prioritization {
  graph [ rankdir = LR ]
  node [ shape = box,
         fixedsize = true,
         width = 2.3,
         height = 1,
         fontsize = 18,
         fillcolor = darkseagreen1,
         style = filled ]

    'Object conversion' -> 'Prioritization' -> Subsetting
}", height = 120, width = 500)

During the first step the workflow object is converted to a suitable format, most often using the as.data.frame() function. The converted data is then used as input for the prioritization strategy. Finally, these results are then used to select the data of interest in the original object.

A very simplified example of such a process is shown below.

featTab <- as.data.frame(fGroups, average = TRUE)

# prioritization: sort by (averaged) intensity of the "sample" replicate group
# (from high to low) and then obtain the feature group identifiers of the top 5.
featTab <- featTab[order(featTab$standard, decreasing = TRUE), ]
groupsOfInterest <- featTab$group[1:5]

# subset the original data
fGroups <- fGroups[, groupsOfInterest]

# fGroups now only contains the feature groups for which intensity values in the
# "sample" replicate group were in the top 5

Deleting data

The delete() generic function can be used to manually delete workflow data. This function is used internally within patRoon to implement filtering and subsetting operations, but may also be useful for advanced data processing.

Like the subset operator this function accepts a i and j parameter to specify which data should be operated on:

Class | Argument i | Argument j ----------------------- | ------------- | ------------- features | analysis | feature index featureGroups | analysis | feature group formulas, compounds | feature group | candidate index components | component | feature group

If i or j is not specified (NULL) then data is removed for the complete selection. Some examples are shown below:

# delete 2nd feature in analysis-1
fList <- delete(fList, i = "analysis-1", j = 2)
# delete first ten features in all analyses
fList <- delete(fList, i = NULL, j = 1:10)

# completely remove third/fourth analyses from feature groups
fGroups <- delete(fGroups, i = 3:4)
# delete specific feature group
fGroups <- delete(fGroups, j = "M120_R268_30")
# delete range of feature groups
fGroups <- delete(fGroups, j = 500:750)

# remove all results for a feature group
formulas <- delete(formulas, i = "M120_R268_30")

# remove top candidate for all feature groups
compounds <- delete(compounds, j = 1)

# remove a component
components <- delete(components, i = "CMP1")
# remove specific feature group from a component
components <- delete(components, i = "CMP1", j = "M120_R268_30")
# remove specific feature group from all components
components <- delete(components, j = "M120_R268_30")

The j parameter can also be a function: in this case it is called repeatedly on parts of the data to select what should be deleted. How the function is called and what it should return depends on the workflow data class:

Class | Called on every | First argument | Second argument | Return value ----------------------- | --------------- | -------------------------------- | ------------------ | ------------------------------------------- features | analysis | data.table with features | analysis name | Features indices (as integer or logical) featureGroups | feature group | vector with group intensities | feature group name | The analyses of the features to remove (as character, integer, logical) formulas, compounds | feature group | data.table with annotations | feature group name | Candidate indices (rows) components | component | data.table with the component | component name | The feature groups (as character, integer)

Some examples for this:

# remove features with intensities below 5000
fList <- delete(fList, j = function(f, ...) f$intensity <= 5E3)

# same, but for features in all feature groups from specific analyses
fGroups <- delete(i = 1:3, j = function(g, ...) g <= 5E3)

# remove formula candidates with high relative mass deviation
formulas <- delete(formulas, j = function(ft, ...) ft$error > 5)

Unique and overlapping features {#unOv}

Often an analysis batch is composed of different sample groups, such as different treatments, influent/effluent etc. In such scenarios it may be highly interesting to evaluate uniqueness or overlap between these samples. Furthermore, extracting overlapping or unique features is a simple but effective prioritization strategy.

The overlap() and unique() functions can be used to extract overlapping and unique features between replicate groups, respectively. Both functions return a subset of the given featureGroups object. An overview of their arguments is given below.

Argument | Function(s) | Remarks ------------ | ----------------------- | ---------------------------------------------------------------- which | unique(), overlap() | The replicate groups to compare. relativeTo | unique() | Only return unique features compared to these replicate groups (NULL for all). Replicate groups in which are ignored. outer | unique() | If TRUE then only return features which are also unique among the compared replicates groups. exclusive | overlap | Only keep features that only overlap between the compared replicate groups.

Some examples:

# only keep features uniquely present in replicate group "repl1"
fGroupsUn1 <- unique(fGroups, which = "repl1")
# only keep features in repl1/repl2 which are not in repl3
fGroupsUn2 <- unique(fGroups, which = c("repl1", "repl2"),
                     relativeTo = "repl3")
# only keep features that are only present in repl1 OR repl2
fGroupsUn3 <- unique(fGroups, which = c("repl1", "repl2"),
                     outer = TRUE)

# only keep features overlapping in repl1/repl2
fGroupsOv1 <- overlap(fGroups, which = c("repl1", "repl2"))
# only keep features overlapping in repl1/repl2 AND are not present in any other
# replicate group
fGroupsOv2 <- overlap(fGroups, which = c("repl1", "repl2"),
                      exclusive = TRUE)

In addition, several plotting functions are discussed in the visualization section that visualize overlap and uniqueness of features.

MS similarity {#specSim}

The spectral similarity is used to compare spectra from different features. For this purpose the spectrumSimilarity function can be used. This function operates on [MS peak lists], and accepts the following function arguments:

Argument | Remarks -------------------------- | --------------------------------------------------------------------------------------- MSPeakLists | The [MS peak lists] object from which peak lists data should be taken. groupName1, groupName2 | The name(s) of the first and second feature group(s) to compare analysis1, analysis2 | The analysis names of the data to be compared. Set this when feature data (instead of feature group data) should be compared. MSLevel | The MS level: 1 or 2 for MS and MS/MS, respectively. specSimParams | Parameters that define how similarities are calculated. NAToZero | If TRUE then NA values are converted to zeros. NA values are reported if a comparison cannot be made because of missing peak list data.

The specSimParams argument defines the parameters for similarity calculations. It is a list, and the default values are obtained with the getDefSpecSimParams() function:

getDefSpecSimParams()

The method field describes the calculation measure: this is either "cosine" or "jaccard".

The shift field is primarily useful when comparing MS/MS data and defines if and how a spectral shift should be performed prior to similarity calculation:

To override a default setting, simply pass it as an argument to getDefSpecSimParams:

getDefSpecSimParams(shift = "both")

For more details on the various similarity calculation parameters see the reference manual (?getDefSpecSimParams).

Some examples are shown below:

# similarity between MS spectra with default parameters
spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M137_R249_53")

# similarity between MS/MS spectra with default parameters
spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M192_R355_191",
                   MSLevel = 2)

# As above, with jaccard calculation
spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M192_R355_191",
                   MSLevel = 2, specSimParams = getDefSpecSimParams(method = "jaccard"))

# With shifting
spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M192_R355_191",
                   MSLevel = 2, specSimParams = getDefSpecSimParams(shift = "both"))

The spectrumSimilarity function can also be used to calculate multiple similarities. Simply specify multiple feature group names for the groupNameX parameters. Alternatively, if you want to compare the same set of feature groups with each other pass their names only as the groupName1 parameter:

# compare two pairs
spectrumSimilarity(mslists,
                   groupName1 = c("M120_R268_30", "M137_R249_53"),
                   groupName2 = c("M146_R309_68", "M192_R355_191"),
                   MSLevel = 2, specSimParams = getDefSpecSimParams(shift = "both"))

# compare all
spectrumSimilarity(mslists, groupName1 = groupNames(mslists),
                   MSLevel = 2, specSimParams = getDefSpecSimParams(shift = "both"))

Visualization

Features and annatation data {#vis_feat_ann}

Several generic functions are available to visualize feature and annotation data:

Generic | Classes | Remarks ----------------- | ---------------------------------------------------- | --------------------------------------------------------------- plot() | featureGroups, featureGroupsComparison | Scatter plot for retention and m/z values plotInt() | featureGroups | Intensity profiles across analyses plotChroms() | featureGroups, components | Plot extracted ion chromatograms (EICs) plotSpectrum() | MSPeakLists, formulas, compounds, components | Plots (annotated) spectra plotStructure() | compounds | Draws candidate structures plotScores() | formulas, compounds | Barplot with candidate scoring plotGraph() | componentsNT | Draws interactive graphs of linked homologous series

The most common plotting functions are plotChroms(), which plots chromatographic data for features, and plotSpectrum(), which will plot (annotated) spectra. An overview of their most important function arguments are shown below.

Argument | Generic | Remarks --------------------------------------------- | -------------------------------- | ------------------------------------------------------------------- retMin | plotChroms() | If TRUE plot retention times in minutes EICParams | plotChroms() | Advanced parameters to control the creation of extracted ion chromatograms (described below) showPeakArea, showFGroupRect | plotChroms() | Fill peak areas / draw rectangles around feature groups? title | plotChroms(), plotSpectrum() | Override plot title colourBy | plotChroms() | Colour individual feature groups ("fGroups") or replicate groups ("rGroups"). By default nothing is coloured ("none") showLegend | plotChroms() | Display a legend? (only if colourBy!="none") xlim, ylim | plotChroms(), plotSpectrum() | Override x/y axis ranges, i.e. to manually set plotting range. groupName, analysis, precursor, index | plotSpectrum() | What to plot. See examples below. MSLevel | plotSpectrum() | Whether to plot an MS or MS/MS spectrum (only MSPeakLists) formulas | plotSpectrum() | Whether formula annotations should be added (only compounds) plotStruct | plotSpectrum() | Whether the structure should be added to the plot (only compounds) mincex | plotSpectrum() | Minimum annotation font size (only formulas/compounds)

Note that we can use subsetting to select which feature data we want to plot, e.g.

plotChroms(fGroups[1:2]) # only plot EICs from first and second analyses.
plotChroms(fGroups[, 1]) # only plot all features of first group

The plotStructure() function will draw a chemical structure for a compound candidate. In addition, this function can draw the maximum common substructure (MCS) of multiple candidates in order to assess common structural features.

# structure for first candidate
plotStructure(compounds, index = 1, groupName = "M120_R268_30")
# MCS for first three candidates
plotStructure(compounds, index = 1:3, groupName = "M120_R268_30")

Some other common and less common plotting operations are shown below.

plot(fGroups) # simple scatter plot of retention and m/z values

plotChroms(fGroups) # plot EICs for all features
plotChroms(fGroups[, 1], # only plot all features of first group
           colourBy = "rGroup") # and mark them individually per replicate group
plotChroms(components, index = 7, fGroups = fGroups) # EICs from a component

plotSpectrum(mslists, "M120_R268_30") # non-annotated MS spectrum
plotSpectrum(mslists, "M120_R268_30", MSLevel = 2) # non-annotated MS/MS spectrum
# formula annotated spectrum
plotSpectrum(formulas, index = 1, groupName = "M120_R268_30",
             MSPeakLists = mslists)
# compound annotated spectrum, with added formula annotations
plotSpectrum(compounds, index = 1, groupName = "M120_R268_30", MSPeakLists = mslists,
             formulas = formulas, plotStruct = TRUE)
# custom intensity range (e.g. to zoom in)
plotSpectrum(compounds, index = 1, groupName = "M120_R268_30", MSPeakLists = mslists,
             ylim = c(0, 5000), plotStruct = FALSE)
plotSpectrum(components, index = 7) # component spectrum
# Inspect homologous series
plotGraph(componNT)

Extracted Ion Chromatogram parameters

The EICParams argument to plotChroms() is used to specify more advanced parameters for the creation of extracted ion chromatograms (EICs). Some parameters of interest:

Parameter | Description ----------------- | ------------------------------------------------- rtWindow | Expands the EIC retention time range +/- the feature peak width (in seconds). This is e.g. useful to zoom out. topMost | Only consider this amount of highest intensity features in a group. topMostByRGroup | If TRUE then the topMost parameter concerns the top most intense features in a replicate group (e.g. topMost=1 would draw the most intense feature for each replicate group). onlyPresent | Only create EICs for analyses where a feature was detected? Setting to FALSE is useful to inspect if a feature was 'missed'.

The parameters are configured by giving a named list to the EICParams argument. To obtain such a list with default settings, the getDefEICParams() function can be used:

getDefEICParams()

Any arguments specified to this function will alter the values of the returned parameter list. Some examples:

# investigate if any features were not detected in a feature group
plotChroms(fGroups[, 10], EICParams = getDefEICParams(onlyPresent = FALSE))

# get overview of all feature groups
plotChroms(fGroups,
           colourBy = "fGroup", # unique colour for each group
           EICParams = getDefEICParams(topMost = 1), # only most intense feature in each group
           showPeakArea = TRUE, # show integrated areas
           showFGroupRect = FALSE,
           showLegend = FALSE) # no legend (too busy for many feature groups)

The reference manual (?EICParams) gives a full detail on all parameters.

Overlapping and unique data {#visComp}

There are three functions that can be used to visualize overlap and uniqueness between data:

Generic | Classes ----------- | ------------------------------------------------------------------- plotVenn | featureGroups, featureGroupsComparison, formulas, compounds plotUpSet | featureGroups, featureGroupsComparison, formulas, compounds plotChord | featureGroups, featureGroupsComparison

The most simple comparison plot is a Venn diagram (i.e. plotVenn()). This function is especially useful for two or three-way comparisons. More complex comparisons are better visualized with [UpSet] diagrams (i.e. plotUpSet()). Finally, chord diagrams (i.e. plotChord()) provide visually pleasing diagrams to assess overlap between data.

These functions can either be used to compare feature data or different objects of the same type. The former is typically used to compare overlap or uniqueness between features in different replicate groups, whereas comparison between objects is useful to visualize differences in algorithmic output. Besides visualization, note that both operations can also be performed to modify or combine objects (see unique and overlapping features and algorithm consensus).

As usual, some examples are shown below.

fGroupsO <- fGroups
fGroups <- fGroupsRG
plotUpSet(fGroups) # compare replicate groups
plotVenn(fGroups, which = c("repl1", "repl2")) # compare some replicate groups
plotChord(fGroups, average = TRUE) # overlap between replicate groups
# compare with custom made groups
plotChord(fGroups, average = TRUE,
          outer = c(repl1 = "grp1", repl2 = "grp1", repl3 = "grp2", repl4 = "grp3"))

# compare GenForm and SIRIUS results
plotVenn(formsGF, formsSIR,
         labels = c("GF", "SIR")) # manual labeling
fGroups <- fGroupsO

MS similarity

The plotSpectrum function is also useful to visually compare (annotated) spectra. This works for MSPeakLists, formulas and compounds object data.

plotSpectrum(mslists, groupName = c("M120_R268_30", "M137_R249_53"), MSLevel = 2)
plotSpectrum(compounds, groupName = c("M120_R268_30", "M146_R309_68"), index = c(1, 1),
             MSPeakLists = mslists)

The specSimParams argument, which was discussed in MS similarity, can be used to configure the similarity calculation:

plotSpectrum(mslists, groupName = c("M120_R268_30", "M137_R249_53"), MSLevel = 2,
             specSimParams = getDefSpecSimParams(shift = "both"))

Hierarchical clustering results {#plotClust}

In patRoon hierarchical clustering is used for some componentization algorithms and to cluster candidate compounds with similar chemical structure (see compound clustering). The functions below can be used to visualize their results.

Generic | Classes | Remarks ------------------- | ---------------------------------------- | -------------------------------------------------- plot() | All | Plots a dendrogram plotInt() | componentsIntClust | Plots normalized intensity profiles in a cluster plotHeatMap() | componentsIntClust | Plots an heatmap plotSilhouettes() | componentsClust | Plot silhouette information to determine the cluster amount plotStructure() | compoundsCluster | Plots the maximum common substructure (MCS) of a cluster

plot(componInt) # dendrogram
plot(compsClust, groupName = "M120_R268_30") # dendrogram for clustered compounds
plotInt(componInt, index = 4) # intensities of 4th cluster
plotHeatMap(componInt) # plot heatmap
plotHeatMap(componInt, interactive = TRUE) # interactive heatmap (with zoom-in!)
plotSilhouettes(componInt, 5:20) # plot silhouettes (e.g. to obtain ideal cluster amount)

Generating EICs in DataAnalysis

If you have Bruker data and the DataAnalysis software installed, you can automatically add EIC data in a DataAnalysis session. The addDAEIC() will do this for a single m/z in one analysis, whereas the addAllDAEICs() function adds EICs for all features in a featureGroups object.

# add a single EIC with background subtraction
addDAEIC("mysample", "~/path/to/sample", mz = 120.1234, bgsubtr = TRUE)
# add TIC for MS/MS signal of precursor 120.1234 (value of mz is ignored for TICs)
addDAEIC("mysample", "~/path/to/sample", mz = 100, ctype = "TIC",
         mtype = "MSMS", fragpath = "120.1234", name = "MSMS 120")

addAllDAEICs(fGroups) # add EICs for all features
addAllDAEICs(fGroups[, 1:50]) # as usual, subsetting can be used for partial data

Interactively explore and review data {#intReview}

The checkFeatures and checkComponents functions start a graphical user interface (GUI) which allows you to interactively explore and review feature and components data, respectively.

checkFeatures(fGroups) # inspect features and feature groups
checkComponents(componCAM, fGroups) # inspect components

Both functions allow you to easily explore the data in an interactive way. Furthermore, these functions allow you to remove unwanted data. This is useful to remove for example features that are actually noise and feature groups that shouldn't be in the same component. To remove an unwanted feature, feature group or components, simply uncheck its 'keep' checkbox. The next step is to save the selections you made. A check session is a file that stores which data should be removed. Once the session file is saved the filter function can be used to actually remove the data:

fGroupsF <- filter(fGroups, checkFeaturesSession = TRUE)
componCAMF <- filter(componCAM, checkComponentsSession = TRUE)

If you saved the session and you re-launch the GUI it will restore the selections made earlier. The clearSession argument can be used to fully clear a session before starting the GUI, hence, all the data will be restored to their 'keep state'.

checkFeatures(fGroups, clearSession = TRUE) # start GUI with fresh session

It is also possible to use multiple different sessions. This is especially useful if you do not want to overwrite previous session data or want to inspect different objects. In this case the session file name should be specified:

checkFeatures(fGroups, "mysession.yml")
fGroupsF <- filter(fGroups, checkFeaturesSession = "mysession.yml")

The default session names are "checked-features.yml" and "checked-components.yml" for feature and component data, respectively.

The extension of session file names is .yml since the [YAML] file format is used. An advantage of this format is that it is easily readable and editable with a text editor.

Note that the session data is tied to the feature group names of your data. This means that, for instance, when you re-group your feature data after changing some parameters, the session data you prepared earlier cannot be used anymore. Since probably quite some manual work went into creating the session file, a special function is available to import a session that was made for previous data. This function tries its best to guess the new feature group name based on similarity of their retention times and m/z values.

checkFeatures(fGroups) # do manual inspection

fGroups <- groupFeatures(fList, ...) # re-group with different parameters

importCheckFeaturesSession("checked-features.yml", "checked-features-new.yml", fGroups)

checkFeatures(fGroups, session = "checked-features-new.yml") # inspect new data

Take care to monitor the messages that importCheckFeaturesSession may output, as it may be possible that some 'old' feature groups are not found or are matched by multiple candidates of the new dataset.

Some additional parameters exist to the functions described in this section. As usualy check the reference manual for more details (e.g. ?checkFeatures).

NOTE Although the GUI tools described here allow you to easily filter out results, it is highly recommended to first prioritize your data to avoid doing a lot of unneeded manual work.

Reporting {#report}

The previous sections showed various functionality to inspect and visualize results. An easy way to do this automatically is by using the reporting functionality of patRoon. There are currently two interfaces: a legacy interface that is described in the next subsection, and the modernized version discussed here.

The reports are generated by the report() function, which combines the data generated during the workflow. This function outputs an HTML file (other formats may follow in future versions), which can be opened with a regular web browser to interactively explore and visualize the data. The report combines chromatograms, mass spectra, tables with feature and annotation properties and many other useful ways to easily explore your data.

Which data is reported is controlled by the following function arguments:

Argument | Description -------------------------- | -------------------------------------------------------------------------------- fGroups | The featureGroups object that should be reported (mandatory). MSPeakLists | The MS peak lists object used for annotations (mandatory if formulas/compounds are specified). formulas, compounds | The formulas and compounds objects that should be used to report feature annotations. compsCluster | The result object from compound clustering. components | Any componentization results, e.g. with adduct annotations and from transformation product screening. TPs | Output from object from generated transformation products.

Most of these arguments are optional, and if not specified this part of the workflow is simply not reported. This also means that reporting can be performed at every stage during the workflow, which, for instance, can be useful to quickly inspect results when testing out various settings to generate workflow data. More advanced arguments to report() are discussed in the reference manual (?reporting).

Some examples:

report(fGroups) # only report feature groups
report(fGroups[, 1:50]) # same, but only first 50, e.g. to do a quick inspection

# include feature annotations
report(fGroups, MSPeakLists = mslists, formulas = formulas, compounds = compounds)

# TP screening
report(fGroups, MSPeakLists = mslists, formulas = formulas, compounds = compounds,
           components = componentsTP, TPs = TPs)

The report itself is primarily configured through a report settings file, which is an easily editable [YAML] file. The default file is as follows:

`r patRoon:::readAllFile(system.file("report", "settings.yml", package = "patRoon"))`

A detailed description for all the settings can be found in the reference manual (?reporting). The table below summarizes the most interesting options:

Parameter | Description ----------------------------------------- | ----------------------------------------------------------- general --> format | The output format. Currently only "html". general --> selfContained | If set (true) then the output will be a self contained .html file. Handy to share reports, but not recommended for large amounts of data. features --> chromatograms --> features | If enabled (true) then the report includes chromatograms of individual features. If set to all then also chromatograms are generated for analyses in which a feature was not detected. This is especially useful to inspect if features were 'missed' during feature detection or accidentally removed after a filter step. formulas/compounds --> topMost | Specifies the maximum number of top-ranked candidates to plot. Often it will take a considerable amount of time to report all candidates, hence, by default this is limited.

When the newProject tool is used to create a new patRoon project a template settings file (report.yml) is automatically created. Otherwise, this file can be generated with the genReportSettingsFile() function. Simply running this function without any arguments is enough:

genReportSettingsFile()

Legacy interface

The legacy interface was the default reporting interface for patRoon versions older than 2.2. The interface now mainly serves for backward compatibility reasons, but may still be useful since the new interface does not (yet) support all the formats from the legacy interface. The following three reporting functions are available:

Like the report() function described above, the arguments to these functions control which data will be reported. However, these functions do not use a report settings file, and all configuration happens through function arguments. Some common arguments are listed below; for a complete listing see the reference manual (?`reporting-legacy`).

Argument | Functions | Remarks ------------------------------------- | ----------------------------- | ---------------------------------------------------------- fGroups, formulas, compounds, formulas, components, compsCluster, TPs | All | Objects to plot. Only fGroups is mandatory. MSPeakLists | reportPDF(), reportHTML() | The MSPeakLists object that was used to generate annotation data. Only needs to be specified if formulas or compounds are reported. path | All | Directory path where report files will be stored ("report" by default). formulasTopMost, compoundsTopMost | reportPDF(), reportHTML() | Report no more than this amount of highest ranked candidates. selfContained | reportHTML() | Outputs to a single and self contained .html file. Handy to share reports, but not recommended for large amounts of data.

Some typical examples:

reportHTML(fGroups) # simple interactive report with feature data
# generate PDFs with feature and compound annotation data
reportPDF(fGroups, compounds = compounds, MSPeakLists = mslists)
reportCSV(fGroups, path = "myReport") # change destination path

# generate report with all workflow types and increase maximum number of
# compound candidates to top 10
reportHTML(fGroups, formulas = formulas, compounds = compounds,
           components = components, MSPeakLists = mslists,
           compsCluster = compsClust,
           compoundsTopMost = 10)



rickhelmus/patRoon documentation built on April 25, 2024, 8:15 a.m.