The previous chapter mainly discussed how to create workflow data. This chapter will discuss how to use the data.
Several generic functions exist that can be used to inspect data that is stored in a particular object (e.g. features, compounds etc):
Generic | Classes | Remarks
--------------------------------------------------- | ----------------------------- | ------------------------------------------
length()
| All | Returns the length of the object (e.g. number of features, compounds etc)
algorithm()
| All | Returns the name of the algorithm used to generate the object.
groupNames()
| All | Returns all the unique identitifiers (or names) of the feature groups for which this object contains results.
names()
| featureGroups
, components
| Returns names of the feature groups (similar to groupNames()
) or components
show()
| All | Prints general information.
"[["
/ "$"
operators | All | Extract general information, see below.
as.data.table()
/ as.data.frame()
| All | Convert data to a data.table
or data.frame
, see below.
analysisInfo()
, analyses()
, replicateGroups()
| features
, featureGroups
| Returns the analysis information, analyses or replicate groups for which this object contains data.
groupInfo()
| featureGroups
| Returns feature group information (m/z and retention time values).
screenInfo()
| featureGroupsScreening
| Returns information on hits from suspect screening.
componentInfo()
| components
| Returns information for all components.
annotatedPeakList()
| formulas
, compounds
| Returns a table with annotated mass peaks (see below).
The common R
extraction operators "[["
, "$"
can be used to obtain data for a particular feature groups, analysis etc:
# Feature table (only first columns for readability) fList[["standard-1"]][, 1:6] # Feature group intensities fGroups$M120_R268_30 fGroups[[1, "M120_R268_30"]] # only first analysis # obtains MS/MS peak list (feature group averaged data) mslists[["M120_R268_30"]]$MSMS # get all formula candidates for a feature group formulas[["M120_R268_30"]][, 1:7] # get all compound candidates for a feature group compounds[["M120_R268_30"]][, 1:4] # get a table with information of a component components[["CMP7"]][, 1:6]
A more sophisticated way to obtain data from a workflow object is to use as.data.table()
or as.data.frame()
. These functions will convert all information within the object to a table (data.table
or data.frame
) and allow various options to add extra information. An advantage is that this common data format can be used with many other functions within R
. The output is in a tidy format.
NOTE If you are not familiar with
data.table
and want to know more see [data.table]. Briefly, this is a more efficient and largely compatible alternative to the regulardata.frame
.NOTE The
as.data.frame()
methods defined inpatRoon
simply convert the results fromas.data.table()
, hence, both functions are equal in their usage and are defined for the same object classes.
Some typical examples are shown below.
# obtain table with all features (only first columns for readability) as.data.table(fList)[, 1:6] # Returns group info and intensity values for each feature group as.data.table(fGroups, average = TRUE) # average intensities for replicates # As above, but with suspect matches on separate rows and additional screening information # (select some columns to simplify the output below) as.data.table(fGroupsSusp, average = TRUE, collapseSuspects = NULL, onlyHits = TRUE)[, c("group", "susp_name", "susp_compRank", "susp_annSimBoth", "susp_estIDLevel")] # Returns all peak lists for each feature group as.data.table(mslists) # Returns all formula candidates for each feature group with scoring # information, neutral loss etc as.data.table(formulas)[, 1:6] # Returns all compound candidates for each feature group with scoring and other metadata as.data.table(compounds)[, 1:4] # Returns table with all components (including feature group info, annotations etc) as.data.table(components)[, 1:6]
Finally, the annotatedPeakList()
function is useful to inspect annotation results for a formula or compound candidate:
# formula annotations for the first formula candidate of feature group M137_R249_53 annotatedPeakList(formulas, index = 1, groupName = "M137_R249_53", MSPeakLists = mslists) # compound annotation for first candidate of feature group M137_R249_53 annotatedPeakList(compounds, index = 1, groupName = "M137_R249_53", MSPeakLists = mslists)
More advanced examples for these functions are shown below.
# Feature table, can also be accessed by numeric index fList[[1]] mslists[["standard-1", "M120_R268_30"]] # feature data (instead of feature group averaged) formulas[[1, "M120_R268_30"]] # feature data (if available, i.e. calculateFeatures=TRUE) components[["CMP1", 1]] # only for first feature group in component as.data.frame(fList) # classic data.frame format, works for all objects as.data.table(fGroups) # return non-averaged intensities (default) as.data.table(fGroups, features = TRUE) # include feature information as.data.table(mslists, averaged = FALSE) # peak lists for each feature as.data.table(mslists, fGroups = fGroups) # add feature group information as.data.table(formulas, countElements = c("C", "H")) # include C/H counts (e.g. for van Krevelen plots) # add various information for organic matter characterization (common elemental # counts/ratios, classifications etc) as.data.table(formulas, OM = TRUE) as.data.table(compounds, fGroups = fGroups) # add feature group information as.data.table(compounds, fragments = TRUE) # include information of all annotated fragments annotatedPeakList(formulas, index = 1, groupName = "M120_R268_30", MSPeakLists = mslists, onlyAnnotated = TRUE) # only include annotated peaks annotatedPeakList(compounds, index = 1, groupName = "M120_R268_30", MSPeakLists = mslists, formulas = formulas) # include formula annotations
During a non-target workflow it is not uncommon that some kind of data-cleanup is necessary. Datasets are often highly complex, which makes separating data of interest from the rest highly important. Furthermore, general cleanup typically improves the quality of the dataset, for instance by removing low scoring annotation results or features that are unlikely to be 'correct' (e.g. noise or present in blanks). For this reason patRoon
supports many different filters that easily clean data produced during the workflow in a highly customizable way.
All major workflow objects (e.g. featureGroups
, compounds
, components
etc.) support filtering operations by the filter()
generic. This function takes the object to be filtered as first argument and any remaining arguments describe the desired filter options. The filter()
generic function then returns the modified object back. Some examples are shown below.
# remove low intensity (<500) features features <- filter(features, absMinIntensity = 500) # remove features with intensities lower than 5 times the blank fGroups <- filter(fGroups, blankThreshold = 5) # only retain compounds with >1 explained MS/MS peaks compounds <- filter(compounds, minExplainedPeaks = 1)
The following sections will provide a more detailed overview of available data filters.
NOTE Some other
R
packages (notablydplyr
) also provide afilter()
generic function. To use thefilter()
function from different packages you may need to explicitly specify which one to use in your script. This can be done by prefixing it with the package name, e.g.patRoon::filter(...)
,dplyr::filter(...)
etc.
There are many filters available for feature data:
Filter | Classes | Remarks
------------------------------------------ | --------------------------- | ---------------------------------------------------------
absMinIntensity
, relMinIntensity
| features
, featureGroups
| Minimum intensity
preAbsMinIntensity
, preRelMinIntensity
| featureGroups
| Minimum intensity prior to other filtering (see below)
retentionRange
, mzRange
, mzDefectRange
, chromWidthRange
| features
, featureGroups
| Filter by feature properties
absMinAnalyses
, relMinAnalyses
| featureGroups
| Minimum feature abundance in all analyses
absMinReplicates
, relMinReplicates
| featureGroups
| Minimum feature abundance in different replicates
absMinFeatures
, relMinFeatures
| featureGroups
| Only keep analyses with at least this amount of features
absMinReplicateAbundance
, relMinReplicateAbundance
| featureGroups
| Minimum feature abundance in a replicate group
maxReplicateIntRSD
| featureGroups
| Maximum relative standard deviation of feature intensities in a replicate group.
blankThreshold
| featureGroups
| Minimum intensity factor above blank intensity
rGroups
| featureGroups
| Only keep (features of) these replicate groups
results
| featureGroups
| Only keep feature groups with formula/compound annotations or componentization results
Application of filters to feature data is important for (environmental) non-target analysis. Especially blank and replicate filters (i.e. blankThreshold
and absMinReplicateAbundance
/relMinReplicateAbundance
) are important filters and are highly recommended to always apply for cleaning up your dataset.
All filters are available for feature group data, whereas only a subset is available for feature objects. The main reason is that other filters need grouping of features between analyses. Regardless, in patRoon
filtering feature data is less important, and typically only needed when the number of features are extremely large and direct grouping is undesired.
From the table above you can notice that many filters concern both absolute and relative data (i.e. as prefixed with abs
and rel
). When a relative filter is used the value is scaled between 0 and 1. For instance:
# remove features not present in at least half of the analyses within a replicate group fGroups <- filter(fGroups, relMinReplicateAbundance = 0.5)
An advantage of relative filters is that you will not have to worry about the data size involved. For instance, in the above example the filter always takes half of the number of analyses within a replicate group, even when replicate groups have different number of analyses.
Note that multiple filters can be specified at once. Especially for feature group data the order of filtering may impact the final results, this is explained further in the reference manual (i.e. ?`feature-filtering`
).
Some examples are shown below.
# filter features prior to grouping: remove any features eluting before first 2 minutes fList <- filter(fList, retentionRange = c(120, Inf)) # common filters for feature groups fGroups <- filter(fGroups, absMinIntensity = 500, # remove features <500 intensity relMinReplicateAbundance = 1, # features should be in all analysis of replicate groups maxReplicateIntRSD = 0.75, # remove features with intensity RSD in replicates >75% blankThreshold = 5, # remove features <5x intensity of (average) blank intensity removeBlanks = TRUE) # remove blank analyses from object afterwards # filter by feature properties fGroups <- filter(mzDefectRange = c(0.8, 0.9), chromWidthRange = c(6, 120)) # remove features not present in at least 3 analyses fGroups <- filter(fGroups, absMinAnalyses = 3) # remove features not present in at least 20% of all replicate groups fGroups <- filter(fGroups, relMinReplicates = 0.2) # only keep data present in replicate groups "repl1" and "repl2" # all other features and analyses will be removed fGroups <- filter(fGroups, rGroups = c("repl1", "repl2")) # only keep feature groups with compound annotations fGroups <- filter(fGroups, results = compounds) # only keep feature groups with formula or compound annotations fGroups <- filter(fGroups, results = list(formulas, compounds))
Several additional filters are available for feature groups obtained with screenSuspects()
:
Filter | Classes | Remarks
----------------------------------------- | ------------------------ | ---------------------------------------------------------
onlyHits
| featureGroupsScreening
| Only retain feature groups assigned to one or more suspects.
selectHitsBy
| featureGroupsScreening
| Select the feature group that matches best with a suspect (in case there are multiple).
selectBestFGroups
| featureGroupsScreening
| Select the suspect that matches best with a feature group (in case there are multiple).
maxLevel
, maxFormRank
, maxCompRank
| featureGroupsScreening
| Only retain suspect hits with identification/annotation ranks below a threshold.
minAnnSimForm
, minAnnSimComp
, minAnnSimBoth
| featureGroupsScreening
| Remove suspect hits with annotation similarity scores below this value.
absMinFragMatches
, relMinFragMatches
| featureGroupsScreening
| Only keep suspect hits with a minimum (relative) number of fragment matches from the suspect list.
NOTE: most filters only remove suspect hit results. Set
onlyHits=TRUE
to also remove any feature groups that end up without suspect hits.
The selectHitsBy
and selectBestFGroups
filters are useful to remove duplicate hits, i.e. the same suspect assigned to multiple feature groups or multiple suspects assigned to the same feature group, respectively. The former selects based on either best identification level (selectHitsBy="level"
) or highest mean intensity (selectHitsBy="intensity"
). The selectBestFGroups
can only be TRUE
/FALSE
and always selects by best identification level.
Some examples are shown below.
# only keep feature groups assigned to at least one suspect fGroupsSusp <- filter(fGroupsSusp, onlyHits = TRUE) # remove duplicate suspect to feature group matches and keep the best fGroupsSusp <- filter(fGroupsSusp, selectHitsBy = "level") # remove suspect hits with ID levels >3 and make sure no feature groups # are present without suspect hits afterwards fGroupsSusp <- filter(fGroupsSusp, maxLevel = 3, onlyHits = TRUE)
There are various filters available for handling annotation data:
Filter | Classes | Remarks
--------------------------------------------- | ------------------------------ | -------------------------------------------------
absMSIntThr
, absMSMSIntThr
, relMSIntThr
, relMSMSIntThr
| MSPeakLists
| Minimum intensity of mass peaks
topMSPeaks
, topMSMSPeaks
| MSPeakLists
| Only keep most intense mass peaks
withMSMS
| MSPeakLists
| Only keep results with MS/MS data
minMSMSPeaks
| MSPeakLists
| Only keep an MS/MS peak list if it contains a minimum number of peaks (excluding the precursor peak)
annotatedBy
| MSPeakLists
| Only keep MS/MS peaks that have formula or compound annotations
minExplainedPeaks
| formulas
, compounds
| Minimum number of annotated mass peaks
elements
, fragElements
, lossElements
| formulas
, compounds
| Restrain elemental composition
topMost
| formulas
, compounds
| Only keep highest ranked candidates
minScore
, minFragScore
, minFormulaScore
| compounds
| Minimum compound scorings
scoreLimits
| formulas
, compounds
| Minimum/Maximum scorings
OM
| formulas
, compounds
| Only keep candidates with likely elemental composition found in organic matter
Several intensity related filters are available to clean-up MS peak list data. For instance, the topMSPeaks
/topMSMSPeaks
filters provide a simple way to remove noisy data by only retaining a defined number of most intense mass peaks. Note that none of these filters will remove the precursor mass peak of the feature itself.
The filters applicable to formula and compound annotation generally concern minimal scoring or chemical properties. The former is useful to remove unlikely candidates, whereas the second is useful to focus on certain study specific chemical properties (e.g. known neutral losses).
Common examples are shown below.
# intensity filtering mslists <- filter(mslists, absMSIntThr = 500, # minimum MS mass peak intensity of 500 relMSMSIntThr = 0.1) # minimum MS/MS mass peak intensity of 10% # only retain 10 most intens mass peaks # (feature mass is always retained) mslists <- filter(mslists, topMSPeaks = 10) # remove MS/MS peaks without compound annotations mslists <- filter(mslists, annotatedBy = compounds) # remove MS/MS peaks not annotated by either a formula or compound candidate mslists <- filter(mslists, annotatedBy = list(formulas, compounds)) # only keep formulae with 1-10 sulphur or phosphorus elements formulas <- filter(formulas, elements = c("S1-10", "P1-10")) # only keep candidates with MS/MS fragments that contain 1-10 carbons and 0-2 oxygens formulas <- filter(formulas, fragElements = "C1-10O0-2") # only keep candidates with CO2 neutral loss formulas <- filter(formulas, lossElements = "CO2") # only keep the 15 highest ranked candidates with at least 1 annotated MS/MS peak compounds <- filter(compounds, minExplainedPeaks = 1, topMost = 15) # minimum in-silico score compounds <- filter(compounds, minFragScore = 10) # candidate should be referenced in at least 1 patent # (only works if database lists number of patents, e.g. PubChem) compounds <- filter(compounds, scoreLimits = list(numberPatents = c(1, Inf))
NOTE As of
patRoon 2.0
MS peak lists are not re-generated after a filtering operation (unless thereAverage
parameter is explicity set toTRUE
). The reason for this change is that re-averaging invalidates any formula/compound annotation data (e.g. used for plotting and reporting) that were generated prior to the filter operation.
Finally several filters are available for components:
Filter | Remarks
---------------------------- | -----------------
size
| Minimum component size
adducts
, isotopes
| Filter features by adduct/istopes annotation
rtIncrement
, mzIncrement
| Filter homologs by retention/mz increment range
Note that these filters are only applied if the components contain the data the filter works on. For instance, filtering by adducts will not affect components obtained from homologous series.
As before, some typical examples are shown below.
# only keep components with at least 4 features componInt <- filter(componInt, minSize = 4) # remove all features from components are not annotated as an adduct componRC <- filter(componRC, adducts = TRUE) # only keep protonated and sodium adducts componRC <- filter(componRC, adducts = c("[M+H]+", "[M+Na]+")) # remove all features not recognized as isotopes componRC <- filter(componRC, isotopes = FALSE) # only keep monoisotopic mass componRC <- filter(componRC, isotopes = 0) # min/max rt/mz increments for homologs componNT <- filter(componNT, rtIncrement = c(10, 30), mzIncrement = c(16, 50))
NOTE As mentioned before, components are still in a relative young development phase and results should always be verified!
All filters support negation: if enabled all specified filters will be executed in an opposite manner. Negation may not be so commonly used, but allows greater flexibility which is sometimes needed for advanced filtering steps. Furthermore, it is also useful to specifically isolate the data that otherwise would have been removed. Some examples are shown below.
# keep all features/analyses _not_ present from replicate groups "repl1" and "repl2" fGroups <- filter(fGroups, rGroups = c("repl1", "repl2"), negate = TRUE) # only retain features with a mass defect outside 0.8-0.9 fGroups <- filter(mzDefectRange = c(0.8, 0.9), negate = TRUE) # remove duplicate suspect hits and only keep the _worst_ hit fGroupsSusp <- filter(fGroupsSusp, selectHitsBy = "level", negate = TRUE) # remove candidates with CO2 neutral loss formulas <- filter(formulas, lossElements = "CO2", negate = TRUE) # select 15 worst ranked candidates compounds <- filter(compounds, topMost = 15, negate = TRUE) # only keep components with <5 features componInt <- filter(componInt, minSize = 5, negate = TRUE)
The previous section discussed the filter()
generic function to perform various data cleaning operations. A more generic way to select data is by subsetting: here you can manually specify which parts of an object should be retained. Subsetting is supported for all workflow objects and is performed by the R subset operator ("["
). This operator either subsets by one or two arguments, which are referred to as the i
and j
arguments.
Class | Argument i
| Argument j
| Remarks
--------------- | -------------- | -------------- | ------------------------------------------------
features
| analyses | |
featureGroups
| analyses | feature groups |
MSPeakLists
| analyses | feature groups | peak lists for feature groups will be re-averaged when subset on analyses (by default)
formulas
| feature groups | |
compounds
| feature groups | |
components
| components | feature groups |
For objects that support two-dimensional subsetting (e.g. featureGroups
, MSPeakLists
), either the i
or j
argument is optional. Furthermore, unlike subsetting a data.frame
, the position of i
and j
does not change when only one argument is specified:
df[1, 1] # subset data.frame by first row/column df[1] # subset by first column df[1, ] # subset by first row fGroups[1, 1] # subset by first analysis/feature group fGroups[, 1] # subset by first feature group (i.e. column) fGroups[1] # subset by first analysis (i.e. row)
The subset operator allows three types of input:
TRUE
.When a logical vector is used as input it will be re-cycled if necessary. For instance, the following will select by the first, third, fifth, etc. analysis.
fGroups[c(TRUE, FALSE)]
In order to select by a character
you will need to know the names for each element. These can, for instance, be obtained by the groupNames()
(feature group names), analyses()
(analysis names) and names()
(names for components or feature groups for featureGroups
objects) generic functions.
Some more examples of common subsetting operations are shown below.
# select first three analyses fList[1:3] # select first three analyses and first 500 feature groups fGroups[1:3, 1:500] # select all feature groups from first component fGroupsNT <- fGroups[, componNT[[1]]$group] # only keep feature groups with formula annotation results fGroupsForms <- fGroups[, groupNames(formulas)] # only keep feature groups with either formula or compound annotation results fGroupsAnn <- fGroups[, union(groupNames(formulas), groupNames(compounds))] # select first 15 components components[1:15] # select by name components[c("CMP1", "CMP5")] # only retain feature groups in components for which compound annotations are # available components[, groupNames(compounds)]
In addition, feature groups can also be subset by given replicate groups or annotation/componentization results (similar to filter()
). Similarly, suspect screening results can also be subset by given suspect names.
# equal as filter(fGroups, rGroups = ...) fGroups[rGroups = c("repl1", "repl2")] # equal as filter(fGroups, results = ...) fGroups[results = compounds] # only keep feature groups assigned to given suspects fGroupsSusp[suspects = c("1H-benzotriazole", "2-Hydroxyquinoline")]
NOTE As of
patRoon 2.0
MS peak lists are not re-generated after a subsetting operation (unless thereAverage
parameter is explicity set toTRUE
). The reason for this change is that re-averaging invalidates any formula/compound annotation data (e.g. used for plotting and reporting) that were generated prior to the subset operation.
An important use case of subsetting is prioritization of data. For instance, after statistical analysis only certain feature groups are deemed relevant for the rest of the workflow. A common prioritization workflow is illustrated below:
plotGV(" digraph Prioritization { graph [ rankdir = LR ] node [ shape = box, fixedsize = true, width = 2.3, height = 1, fontsize = 18, fillcolor = darkseagreen1, style = filled ] 'Object conversion' -> 'Prioritization' -> Subsetting }", height = 120, width = 500)
During the first step the workflow object is converted to a suitable format, most often using the as.data.frame()
function. The converted data is then used as input for the prioritization strategy. Finally, these results are then used to select the data of interest in the original object.
A very simplified example of such a process is shown below.
featTab <- as.data.frame(fGroups, average = TRUE) # prioritization: sort by (averaged) intensity of the "sample" replicate group # (from high to low) and then obtain the feature group identifiers of the top 5. featTab <- featTab[order(featTab$standard, decreasing = TRUE), ] groupsOfInterest <- featTab$group[1:5] # subset the original data fGroups <- fGroups[, groupsOfInterest] # fGroups now only contains the feature groups for which intensity values in the # "sample" replicate group were in the top 5
The delete()
generic function can be used to manually delete workflow data. This function is used internally within patRoon
to implement filtering and subsetting operations, but may also be useful for advanced data processing.
Like the subset operator this function accepts a i
and j
parameter to specify which data should be operated on:
Class | Argument i
| Argument j
----------------------- | ------------- | -------------
features
| analysis | feature index
featureGroups
| analysis | feature group
formulas
, compounds
| feature group | candidate index
components
| component | feature group
If i
or j
is not specified (NULL
) then data is removed for the complete selection. Some examples are shown below:
# delete 2nd feature in analysis-1 fList <- delete(fList, i = "analysis-1", j = 2) # delete first ten features in all analyses fList <- delete(fList, i = NULL, j = 1:10) # completely remove third/fourth analyses from feature groups fGroups <- delete(fGroups, i = 3:4) # delete specific feature group fGroups <- delete(fGroups, j = "M120_R268_30") # delete range of feature groups fGroups <- delete(fGroups, j = 500:750) # remove all results for a feature group formulas <- delete(formulas, i = "M120_R268_30") # remove top candidate for all feature groups compounds <- delete(compounds, j = 1) # remove a component components <- delete(components, i = "CMP1") # remove specific feature group from a component components <- delete(components, i = "CMP1", j = "M120_R268_30") # remove specific feature group from all components components <- delete(components, j = "M120_R268_30")
The j
parameter can also be a function: in this case it is called repeatedly on parts of the data to select what should be deleted. How the function is called and what it should return depends on the workflow data class:
Class | Called on every | First argument | Second argument | Return value
----------------------- | --------------- | -------------------------------- | ------------------ | -------------------------------------------
features
| analysis | data.table
with features | analysis name | Features indices (as integer
or logical
)
featureGroups
| feature group | vector with group intensities | feature group name | The analyses of the features to remove (as character
, integer
, logical
)
formulas
, compounds
| feature group | data.table
with annotations | feature group name | Candidate indices (rows)
components
| component | data.table
with the component | component name | The feature groups (as character
, integer
)
Some examples for this:
# remove features with intensities below 5000 fList <- delete(fList, j = function(f, ...) f$intensity <= 5E3) # same, but for features in all feature groups from specific analyses fGroups <- delete(i = 1:3, j = function(g, ...) g <= 5E3) # remove formula candidates with high relative mass deviation formulas <- delete(formulas, j = function(ft, ...) ft$error > 5)
Often an analysis batch is composed of different sample groups, such as different treatments, influent/effluent etc. In such scenarios it may be highly interesting to evaluate uniqueness or overlap between these samples. Furthermore, extracting overlapping or unique features is a simple but effective prioritization strategy.
The overlap()
and unique()
functions can be used to extract overlapping and unique features between replicate groups, respectively. Both functions return a subset of the given featureGroups
object. An overview of their arguments is given below.
Argument | Function(s) | Remarks
------------ | ----------------------- | ----------------------------------------------------------------
which
| unique()
, overlap()
| The replicate groups to compare.
relativeTo
| unique()
| Only return unique features compared to these replicate groups (NULL
for all). Replicate groups in which
are ignored.
outer
| unique()
| If TRUE
then only return features which are also unique among the compared replicates groups.
exclusive
| overlap
| Only keep features that only overlap between the compared replicate groups.
Some examples:
# only keep features uniquely present in replicate group "repl1" fGroupsUn1 <- unique(fGroups, which = "repl1") # only keep features in repl1/repl2 which are not in repl3 fGroupsUn2 <- unique(fGroups, which = c("repl1", "repl2"), relativeTo = "repl3") # only keep features that are only present in repl1 OR repl2 fGroupsUn3 <- unique(fGroups, which = c("repl1", "repl2"), outer = TRUE) # only keep features overlapping in repl1/repl2 fGroupsOv1 <- overlap(fGroups, which = c("repl1", "repl2")) # only keep features overlapping in repl1/repl2 AND are not present in any other # replicate group fGroupsOv2 <- overlap(fGroups, which = c("repl1", "repl2"), exclusive = TRUE)
In addition, several plotting functions are discussed in the visualization section that visualize overlap and uniqueness of features.
The spectral similarity is used to compare spectra from different features. For this purpose the spectrumSimilarity
function can be used. This function operates on [MS peak lists], and accepts the following function arguments:
Argument | Remarks
-------------------------- | ---------------------------------------------------------------------------------------
MSPeakLists
| The [MS peak lists] object from which peak lists data should be taken.
groupName1
, groupName2
| The name(s) of the first and second feature group(s) to compare
analysis1
, analysis2
| The analysis names of the data to be compared. Set this when feature data (instead of feature group data) should be compared.
MSLevel
| The MS level: 1
or 2
for MS and MS/MS, respectively.
specSimParams
| Parameters that define how similarities are calculated.
NAToZero
| If TRUE
then NA
values are converted to zeros. NA
values are reported if a comparison cannot be made because of missing peak list data.
The specSimParams
argument defines the parameters for similarity calculations. It is a list
, and the default values are obtained with the getDefSpecSimParams()
function:
getDefSpecSimParams()
The method
field describes the calculation measure: this is either "cosine"
or "jaccard"
.
The shift
field is primarily useful when comparing MS/MS data and defines if and how a spectral shift should be performed prior to similarity calculation:
"none"
: The default, no shifting is performed."precursor"
The mass difference between the precursor mass of both spectra (i.e. the feature mass) is first calculated. This difference is then subtracted from each of the mass peaks of the second spectrum. This shifting increases similarity if the MS fragmentation process itself occurs similarly (i.e. if both features show similar neutral losses)."both
" This combines both shifting methods: first peaks are aligned that have the same mass, then the precursor
strategy is applied for the remaining mass peaks. This shifting method yields higher similarities if either fragment masses or neutral losses are similar.To override a default setting, simply pass it as an argument to getDefSpecSimParams
:
getDefSpecSimParams(shift = "both")
For more details on the various similarity calculation parameters see the reference manual (?getDefSpecSimParams
).
Some examples are shown below:
# similarity between MS spectra with default parameters spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M137_R249_53") # similarity between MS/MS spectra with default parameters spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M192_R355_191", MSLevel = 2) # As above, with jaccard calculation spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M192_R355_191", MSLevel = 2, specSimParams = getDefSpecSimParams(method = "jaccard")) # With shifting spectrumSimilarity(mslists, groupName1 = "M120_R268_30", groupName2 = "M192_R355_191", MSLevel = 2, specSimParams = getDefSpecSimParams(shift = "both"))
The spectrumSimilarity
function can also be used to calculate multiple similarities. Simply specify multiple feature group names for the groupNameX
parameters. Alternatively, if you want to compare the same set of feature groups with each other pass their names only as the groupName1
parameter:
# compare two pairs spectrumSimilarity(mslists, groupName1 = c("M120_R268_30", "M137_R249_53"), groupName2 = c("M146_R309_68", "M192_R355_191"), MSLevel = 2, specSimParams = getDefSpecSimParams(shift = "both")) # compare all spectrumSimilarity(mslists, groupName1 = groupNames(mslists), MSLevel = 2, specSimParams = getDefSpecSimParams(shift = "both"))
Several generic functions are available to visualize feature and annotation data:
Generic | Classes | Remarks
----------------- | ---------------------------------------------------- | ---------------------------------------------------------------
plot()
| featureGroups
, featureGroupsComparison
| Scatter plot for retention and m/z values
plotInt()
| featureGroups
| Intensity profiles across analyses
plotChroms()
| featureGroups
, components
| Plot extracted ion chromatograms (EICs)
plotSpectrum()
| MSPeakLists
, formulas
, compounds
, components
| Plots (annotated) spectra
plotStructure()
| compounds
| Draws candidate structures
plotScores()
| formulas
, compounds
| Barplot with candidate scoring
plotGraph()
| componentsNT
| Draws interactive graphs of linked homologous series
The most common plotting functions are plotChroms()
, which plots chromatographic data for features, and plotSpectrum()
, which will plot (annotated) spectra. An overview of their most important function arguments are shown below.
Argument | Generic | Remarks
--------------------------------------------- | -------------------------------- | -------------------------------------------------------------------
retMin
| plotChroms()
| If TRUE
plot retention times in minutes
EICParams
| plotChroms()
| Advanced parameters to control the creation of extracted ion chromatograms (described below)
showPeakArea
, showFGroupRect
| plotChroms()
| Fill peak areas / draw rectangles around feature groups?
title
| plotChroms()
, plotSpectrum()
| Override plot title
colourBy
| plotChroms()
| Colour individual feature groups ("fGroups"
) or replicate groups ("rGroups"
). By default nothing is coloured ("none"
)
showLegend
| plotChroms()
| Display a legend? (only if colourBy!="none"
)
xlim
, ylim
| plotChroms()
, plotSpectrum()
| Override x/y axis ranges, i.e. to manually set plotting range.
groupName
, analysis
, precursor
, index
| plotSpectrum()
| What to plot. See examples below.
MSLevel
| plotSpectrum()
| Whether to plot an MS or MS/MS spectrum (only MSPeakLists
)
formulas
| plotSpectrum()
| Whether formula
annotations should be added (only compounds
)
plotStruct
| plotSpectrum()
| Whether the structure should be added to the plot (only compounds
)
mincex
| plotSpectrum()
| Minimum annotation font size (only formulas
/compounds
)
Note that we can use subsetting to select which feature data we want to plot, e.g.
plotChroms(fGroups[1:2]) # only plot EICs from first and second analyses. plotChroms(fGroups[, 1]) # only plot all features of first group
The plotStructure()
function will draw a chemical structure for a compound candidate. In addition, this function can draw the maximum common substructure (MCS) of multiple candidates in order to assess common structural features.
# structure for first candidate plotStructure(compounds, index = 1, groupName = "M120_R268_30") # MCS for first three candidates plotStructure(compounds, index = 1:3, groupName = "M120_R268_30")
Some other common and less common plotting operations are shown below.
plot(fGroups) # simple scatter plot of retention and m/z values plotChroms(fGroups) # plot EICs for all features plotChroms(fGroups[, 1], # only plot all features of first group colourBy = "rGroup") # and mark them individually per replicate group plotChroms(components, index = 7, fGroups = fGroups) # EICs from a component plotSpectrum(mslists, "M120_R268_30") # non-annotated MS spectrum plotSpectrum(mslists, "M120_R268_30", MSLevel = 2) # non-annotated MS/MS spectrum # formula annotated spectrum plotSpectrum(formulas, index = 1, groupName = "M120_R268_30", MSPeakLists = mslists) # compound annotated spectrum, with added formula annotations plotSpectrum(compounds, index = 1, groupName = "M120_R268_30", MSPeakLists = mslists, formulas = formulas, plotStruct = TRUE) # custom intensity range (e.g. to zoom in) plotSpectrum(compounds, index = 1, groupName = "M120_R268_30", MSPeakLists = mslists, ylim = c(0, 5000), plotStruct = FALSE) plotSpectrum(components, index = 7) # component spectrum
# Inspect homologous series plotGraph(componNT)
The EICParams
argument to plotChroms()
is used to specify more advanced parameters for the creation of extracted ion chromatograms (EICs). Some parameters of interest:
Parameter | Description
----------------- | -------------------------------------------------
rtWindow
| Expands the EIC retention time range +/- the feature peak width (in seconds). This is e.g. useful to zoom out.
topMost
| Only consider this amount of highest intensity features in a group.
topMostByRGroup
| If TRUE
then the topMost
parameter concerns the top most intense features in a replicate group (e.g. topMost=1
would draw the most intense feature for each replicate group).
onlyPresent
| Only create EICs for analyses where a feature was detected? Setting to FALSE
is useful to inspect if a feature was 'missed'.
The parameters are configured by giving a named list
to the EICParams
argument. To obtain such a list
with default settings, the getDefEICParams()
function can be used:
getDefEICParams()
Any arguments specified to this function will alter the values of the returned parameter list. Some examples:
# investigate if any features were not detected in a feature group plotChroms(fGroups[, 10], EICParams = getDefEICParams(onlyPresent = FALSE)) # get overview of all feature groups plotChroms(fGroups, colourBy = "fGroup", # unique colour for each group EICParams = getDefEICParams(topMost = 1), # only most intense feature in each group showPeakArea = TRUE, # show integrated areas showFGroupRect = FALSE, showLegend = FALSE) # no legend (too busy for many feature groups)
The reference manual (?EICParams
) gives a full detail on all parameters.
There are three functions that can be used to visualize overlap and uniqueness between data:
Generic | Classes
----------- | -------------------------------------------------------------------
plotVenn
| featureGroups
, featureGroupsComparison
, formulas
, compounds
plotUpSet
| featureGroups
, featureGroupsComparison
, formulas
, compounds
plotChord
| featureGroups
, featureGroupsComparison
The most simple comparison plot is a Venn diagram (i.e. plotVenn()
). This function is especially useful for two or three-way comparisons. More complex comparisons are better visualized with [UpSet] diagrams (i.e. plotUpSet()
). Finally, chord diagrams (i.e. plotChord()
) provide visually pleasing diagrams to assess overlap between data.
These functions can either be used to compare feature data or different objects of the same type. The former is typically used to compare overlap or uniqueness between features in different replicate groups, whereas comparison between objects is useful to visualize differences in algorithmic output. Besides visualization, note that both operations can also be performed to modify or combine objects (see unique and overlapping features and algorithm consensus).
As usual, some examples are shown below.
fGroupsO <- fGroups fGroups <- fGroupsRG
plotUpSet(fGroups) # compare replicate groups plotVenn(fGroups, which = c("repl1", "repl2")) # compare some replicate groups plotChord(fGroups, average = TRUE) # overlap between replicate groups # compare with custom made groups plotChord(fGroups, average = TRUE, outer = c(repl1 = "grp1", repl2 = "grp1", repl3 = "grp2", repl4 = "grp3")) # compare GenForm and SIRIUS results plotVenn(formsGF, formsSIR, labels = c("GF", "SIR")) # manual labeling
fGroups <- fGroupsO
The plotSpectrum
function is also useful to visually compare (annotated) spectra. This works for MSPeakLists
, formulas
and compounds
object data.
plotSpectrum(mslists, groupName = c("M120_R268_30", "M137_R249_53"), MSLevel = 2) plotSpectrum(compounds, groupName = c("M120_R268_30", "M146_R309_68"), index = c(1, 1), MSPeakLists = mslists)
The specSimParams
argument, which was discussed in MS similarity, can be used to configure the similarity calculation:
plotSpectrum(mslists, groupName = c("M120_R268_30", "M137_R249_53"), MSLevel = 2, specSimParams = getDefSpecSimParams(shift = "both"))
In patRoon
hierarchical clustering is used for some componentization algorithms and to cluster candidate compounds with similar chemical structure (see compound clustering). The functions below can be used to visualize their results.
Generic | Classes | Remarks
------------------- | ---------------------------------------- | --------------------------------------------------
plot()
| All | Plots a dendrogram
plotInt()
| componentsIntClust
| Plots normalized intensity profiles in a cluster
plotHeatMap()
| componentsIntClust
| Plots an heatmap
plotSilhouettes()
| componentsClust
| Plot silhouette information to determine the cluster amount
plotStructure()
| compoundsCluster
| Plots the maximum common substructure (MCS) of a cluster
plot(componInt) # dendrogram plot(compsClust, groupName = "M120_R268_30") # dendrogram for clustered compounds plotInt(componInt, index = 4) # intensities of 4th cluster
plotHeatMap(componInt) # plot heatmap
plotHeatMap(componInt, interactive = TRUE) # interactive heatmap (with zoom-in!) plotSilhouettes(componInt, 5:20) # plot silhouettes (e.g. to obtain ideal cluster amount)
If you have Bruker data and the DataAnalysis software installed, you can automatically add EIC data in a DataAnalysis session. The addDAEIC()
will do this for a single m/z in one analysis, whereas the addAllDAEICs()
function adds EICs for all features in a featureGroups
object.
# add a single EIC with background subtraction addDAEIC("mysample", "~/path/to/sample", mz = 120.1234, bgsubtr = TRUE) # add TIC for MS/MS signal of precursor 120.1234 (value of mz is ignored for TICs) addDAEIC("mysample", "~/path/to/sample", mz = 100, ctype = "TIC", mtype = "MSMS", fragpath = "120.1234", name = "MSMS 120") addAllDAEICs(fGroups) # add EICs for all features addAllDAEICs(fGroups[, 1:50]) # as usual, subsetting can be used for partial data
The checkFeatures
and checkComponents
functions start a graphical user interface (GUI) which allows you to interactively explore and review feature and components data, respectively.
checkFeatures(fGroups) # inspect features and feature groups checkComponents(componCAM, fGroups) # inspect components
Both functions allow you to easily explore the data in an interactive way. Furthermore, these functions allow you to remove unwanted data. This is useful to remove for example features that are actually noise and feature groups that shouldn't be in the same component. To remove an unwanted feature, feature group or components, simply uncheck its 'keep' checkbox. The next step is to save the selections you made. A check session is a file that stores which data should be removed. Once the session file is saved the filter
function can be used to actually remove the data:
fGroupsF <- filter(fGroups, checkFeaturesSession = TRUE) componCAMF <- filter(componCAM, checkComponentsSession = TRUE)
If you saved the session and you re-launch the GUI it will restore the selections made earlier. The clearSession
argument can be used to fully clear a session before starting the GUI, hence, all the data will be restored to their 'keep state'.
checkFeatures(fGroups, clearSession = TRUE) # start GUI with fresh session
It is also possible to use multiple different sessions. This is especially useful if you do not want to overwrite previous session data or want to inspect different objects. In this case the session file name should be specified:
checkFeatures(fGroups, "mysession.yml") fGroupsF <- filter(fGroups, checkFeaturesSession = "mysession.yml")
The default session names are "checked-features.yml"
and "checked-components.yml"
for feature and component data, respectively.
The extension of session file names is .yml
since the [YAML] file format is used. An advantage of this format is that it is easily readable and editable with a text editor.
Note that the session data is tied to the feature group names of your data. This means that, for instance, when you re-group your feature data after changing some parameters, the session data you prepared earlier cannot be used anymore. Since probably quite some manual work went into creating the session file, a special function is available to import a session that was made for previous data. This function tries its best to guess the new feature group name based on similarity of their retention times and m/z values.
checkFeatures(fGroups) # do manual inspection fGroups <- groupFeatures(fList, ...) # re-group with different parameters importCheckFeaturesSession("checked-features.yml", "checked-features-new.yml", fGroups) checkFeatures(fGroups, session = "checked-features-new.yml") # inspect new data
Take care to monitor the messages that importCheckFeaturesSession
may output, as it may be possible that some 'old' feature groups are not found or are matched by multiple candidates of the new dataset.
Some additional parameters exist to the functions described in this section. As usualy check the reference manual for more details (e.g. ?checkFeatures
).
NOTE Although the GUI tools described here allow you to easily filter out results, it is highly recommended to first prioritize your data to avoid doing a lot of unneeded manual work.
The previous sections showed various functionality to inspect and visualize results. An easy way to do this automatically is by using the reporting functionality of patRoon
. There are currently two interfaces: a legacy interface that is described in the next subsection, and the modernized version discussed here.
The reports are generated by the report()
function, which combines the data generated during the workflow. This function outputs an HTML
file (other formats may follow in future versions), which can be opened with a regular web browser to interactively explore and visualize the data. The report combines chromatograms, mass spectra, tables with feature and annotation properties and many other useful ways to easily explore your data.
Which data is reported is controlled by the following function arguments:
Argument | Description
-------------------------- | --------------------------------------------------------------------------------
fGroups
| The featureGroups
object that should be reported (mandatory).
MSPeakLists
| The MS peak lists object used for annotations (mandatory if formulas
/compounds
are specified).
formulas
, compounds
| The formulas
and compounds
objects that should be used to report feature annotations.
compsCluster
| The result object from compound clustering.
components
| Any componentization results, e.g. with adduct annotations and from transformation product screening.
TPs
| Output from object from generated transformation products.
Most of these arguments are optional, and if not specified this part of the workflow is simply not reported. This also means that reporting can be performed at every stage during the workflow, which, for instance, can be useful to quickly inspect results when testing out various settings to generate workflow data. More advanced arguments to report()
are discussed in the reference manual (?reporting
).
Some examples:
report(fGroups) # only report feature groups report(fGroups[, 1:50]) # same, but only first 50, e.g. to do a quick inspection # include feature annotations report(fGroups, MSPeakLists = mslists, formulas = formulas, compounds = compounds) # TP screening report(fGroups, MSPeakLists = mslists, formulas = formulas, compounds = compounds, components = componentsTP, TPs = TPs)
The report itself is primarily configured through a report settings file, which is an easily editable [YAML] file. The default file is as follows:
`r patRoon:::readAllFile(system.file("report", "settings.yml", package = "patRoon"))`
A detailed description for all the settings can be found in the reference manual (?reporting
). The table below summarizes the most interesting options:
Parameter | Description
----------------------------------------- | -----------------------------------------------------------
general --> format
| The output format. Currently only "html"
.
general --> selfContained
| If set (true
) then the output will be a self contained .html
file. Handy to share reports, but not recommended for large amounts of data.
features --> chromatograms --> features
| If enabled (true
) then the report includes chromatograms of individual features. If set to all
then also chromatograms are generated for analyses in which a feature was not detected. This is especially useful to inspect if features were 'missed' during feature detection or accidentally removed after a filter step.
formulas/compounds --> topMost
| Specifies the maximum number of top-ranked candidates to plot. Often it will take a considerable amount of time to report all candidates, hence, by default this is limited.
When the newProject tool is used to create a new patRoon
project a template settings file (report.yml
) is automatically created. Otherwise, this file can be generated with the genReportSettingsFile()
function. Simply running this function without any arguments is enough:
genReportSettingsFile()
The legacy interface was the default reporting interface for patRoon
versions older than 2.2
. The interface now mainly serves for backward compatibility reasons, but may still be useful since the new interface does not (yet) support all the formats from the legacy interface. The following three reporting functions are available:
reportCSV()
: exports workflow data to comma-separated value (csv) filesreportPDF()
: generates simple reports by plotting workflow data in portable document files (PDFs)reportHTML()
: generates interactive and easily explorable reportsLike the report()
function described above, the arguments to these functions control which data will be reported. However, these functions do not use a report settings file, and all configuration happens through function arguments. Some common arguments are listed below; for a complete listing see the reference manual (?`reporting-legacy`
).
Argument | Functions | Remarks
------------------------------------- | ----------------------------- | ----------------------------------------------------------
fGroups
, formulas
, compounds
, formulas
, components
, compsCluster
, TPs
| All | Objects to plot. Only fGroups
is mandatory.
MSPeakLists
| reportPDF()
, reportHTML()
| The MSPeakLists
object that was used to generate annotation data. Only needs to be specified if formulas
or compounds
are reported.
path
| All | Directory path where report files will be stored ("report"
by default).
formulasTopMost
, compoundsTopMost
| reportPDF()
, reportHTML()
| Report no more than this amount of highest ranked candidates.
selfContained
| reportHTML()
| Outputs to a single and self contained .html
file. Handy to share reports, but not recommended for large amounts of data.
Some typical examples:
reportHTML(fGroups) # simple interactive report with feature data # generate PDFs with feature and compound annotation data reportPDF(fGroups, compounds = compounds, MSPeakLists = mslists) reportCSV(fGroups, path = "myReport") # change destination path # generate report with all workflow types and increase maximum number of # compound candidates to top 10 reportHTML(fGroups, formulas = formulas, compounds = compounds, components = components, MSPeakLists = mslists, compsCluster = compsClust, compoundsTopMost = 10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.