addProcessing: Data manipulation and analysis methods

applyProcessingR Documentation

Data manipulation and analysis methods

Description

Various data analysis functions are available for Spectra objects. These can be categorized into functions that either return a Spectra object (with the manipulated data) and functions that directly return the result from the calculation. For the former category, the data manipulations are cached in the result object's processing queue and only exectuted on-the-fly when the respective data gets extracted from the Spectra (see section The processing queue for more information).

For the second category, the calculations are directly executed and the result, usually one value per spectrum, returned. Generally, to reduce memory demand, a chunk-wise processing of the data is performed.

Usage

applyProcessing(
  object,
  f = processingChunkFactor(object),
  BPPARAM = bpparam(),
  ...
)

processingLog(x)

scalePeaks(x, by = sum, msLevel. = uniqueMsLevels(x))

## S4 method for signature 'Spectra'
addProcessing(object, FUN, ..., spectraVariables = character())

## S4 method for signature 'Spectra'
bin(
  x,
  binSize = 1L,
  breaks = NULL,
  msLevel. = uniqueMsLevels(x),
  FUN = sum,
  zero.rm = TRUE
)

## S4 method for signature 'Spectra'
containsMz(
  object,
  mz = numeric(),
  tolerance = 0,
  ppm = 20,
  which = c("any", "all"),
  BPPARAM = bpparam()
)

## S4 method for signature 'Spectra'
containsNeutralLoss(
  object,
  neutralLoss = 0,
  tolerance = 0,
  ppm = 20,
  BPPARAM = bpparam()
)

## S4 method for signature 'Spectra'
entropy(object, normalized = TRUE)

## S4 method for signature 'ANY'
entropy(object, ...)

## S4 method for signature 'Spectra'
pickPeaks(
  object,
  halfWindowSize = 2L,
  method = c("MAD", "SuperSmoother"),
  snr = 0,
  k = 0L,
  descending = FALSE,
  threshold = 0,
  msLevel. = uniqueMsLevels(object),
  ...
)

## S4 method for signature 'Spectra'
replaceIntensitiesBelow(
  object,
  threshold = min,
  value = 0,
  msLevel. = uniqueMsLevels(object)
)

## S4 method for signature 'Spectra'
reset(object, ...)

## S4 method for signature 'Spectra'
smooth(
  x,
  halfWindowSize = 2L,
  method = c("MovingAverage", "WeightedMovingAverage", "SavitzkyGolay"),
  msLevel. = uniqueMsLevels(x),
  ...
)

## S4 method for signature 'Spectra'
spectrapply(
  object,
  FUN,
  ...,
  chunkSize = integer(),
  f = factor(),
  BPPARAM = SerialParam()
)

Arguments

object

A Spectra object.

f

For spectrapply() and applyProcessing(): factor defining how object should be splitted for eventual parallel processing. Defaults to factor() for spectrapply() hence the object is not splitted while it defaults to f = processingChunkSize(object) for applyProcessing() splitting thus the object by default into chunks depending on processingChunkSize().

BPPARAM

Parallel setup configuration. See BiocParallel::bpparam() for more information. This is passed directly to the backendInitialize() method of the MsBackend. See also processingChunkSize() for additional information on parallel processing.

...

Additional arguments passed to internal and downstream functions.

x

A Spectra.

by

For scalePeaks(): function to calculate a single numeric from intensity values of a spectrum by which all intensities (of that spectrum) should be divided by. The default by = sum will divide intensities of each spectrum by the sum of intensities of that spectrum.

msLevel.

integer defining the MS level(s) of the spectra to which the function should be applied (defaults to all MS levels of object.

FUN

For addProcessing(): function to be applied to the peak matrix of each spectrum in object. For bin(): function to aggregate intensity values of peaks falling into the same bin. Defaults to FUN = sum thus summing up intensities. For spectrapply() and chunkapply(): function to be applied to each individual or each chunk of Spectra.

spectraVariables

For addProcessing(): character with additional spectra variables that should be passed along to the function defined with FUN. See function description for details.

binSize

For bin(): numeric(1) defining the size for the m/z bins. Defaults to binSize = 1.

breaks

For bin(): numeric defining the m/z breakpoints between bins.

zero.rm

For bin(): logical(1) indicating whether to remove bins with zero intensity. Defaults to TRUE, meaning the function will discard bins created with an intensity of 0 to enhance memory efficiency.

mz

For containsMz(): numeric with the m/z value(s) of the mass peaks to check.

tolerance

For containsMz() and neutralLoss(): numeric(1) allowing to define a constant maximal accepted difference between m/z values for peaks to be matched.

ppm

For containsMz() and neutralLoss(): numeric(1) defining a relative, m/z-dependent, maximal accepted difference between m/z values for peaks to be matched.

which

For containsMz(): either "any" or "all" defining whether any (the default) or all provided mz have to be present in the spectrum.

neutralLoss

for containsNeutralLoss(): numeric(1) defining the value which should be subtracted from the spectrum's precursor m/z.

normalized

for entropy(): logical(1) whether the normalized entropy should be calculated (default). See also MsCoreUtils::nentropy() for details.

halfWindowSize

For pickPeaks(): integer(1), used in the identification of the mass peaks: a local maximum has to be the maximum in the window from (i - halfWindowSize):(i + halfWindowSize). For smooth(): integer(1), used in the smoothing algorithm, the window reaches from (i - halfWindowSize):(i + halfWindowSize).

method

For pickPeaks(): character(1), the noise estimators that should be used, currently the the Median Absolute Deviation (method = "MAD") and Friedman's Super Smoother (method = "SuperSmoother") are supported. For smooth(): character(1), the smoothing function that should be used, currently, the Moving-Average- (method = "MovingAverage"), Weighted-Moving-Average- (⁠method = "WeightedMovingAverage")⁠, Savitzky-Golay-Smoothing (method = "SavitzkyGolay") are supported.

snr

For pickPeaks(): double(1) defining the Signal-to-Noise-Ratio. The intensity of a local maximum has to be higher than snr * noise to be considered as peak.

k

For pickPeaks(): integer(1), number of values left and right of the peak that should be considered in the weighted mean calculation.

descending

For pickPeaks(): logical, if TRUE just values betwee the nearest valleys around the peak centroids are used.

threshold

For pickPeaks(): a numeric(1) defining the proportion of the maximal peak intensity. Only values above the threshold are used for the weighted mean calculation. For replaceIntensitiesBelow(): a numeric(1) defining the threshold or a function to calculate the threshold for each spectrum on its intensity values. Defaults to threshold = min.

value

For replaceIntensitiesBelow(): numeric(1) defining the value with which intensities should be replaced with.

chunkSize

For spectrapply(): size of the chunks into which the Spectra should be split. This parameter overrides parameters f and BPPARAM.

Value

See the documentation of the individual functions for a description of the return value.

Data analysis methods returning a Spectra

The methods listed here return a Spectra object as a result.

  • addProcessing(): adds an arbitrary function that should be applied to the peaks matrix of every spectrum in object. The function (can be passed with parameter FUN) is expected to take a peaks matrix as input and to return a peaks matrix. A peaks matrix is a numeric matrix with two columns, the first containing the m/z values of the peaks and the second the corresponding intensities. The function has to have ... in its definition. Additional arguments can be passed with .... With parameter spectraVariables it is possible to define additional spectra variables from object that should be passed to the function FUN. These will be passed by their name (e.g. specifying spectraVariables = "precursorMz" will pass the spectra's precursor m/z as a parameter named precursorMz to the function. The only exception is the spectra's MS level, these will be passed to the function as a parameter called spectrumMsLevel (i.e. with spectraVariables = "msLevel" the MS levels of each spectrum will be submitted to the function as a parameter called spectrumMsLevel). Examples are provided in the package vignette.

  • bin(): aggregates individual spectra into discrete (m/z) bins. Binning is performed only on spectra of the specified MS level(s) (parameter msLevel, by default all MS levels of x). The bins can be defined with parameter breaks which by default are equally sized bins, with size being defined by parameter binSize, from the minimal to the maximal m/z of all spectra (of MS level msLevel) within x. The same bins are used for all spectra in x. All intensity values for peaks falling into the same bin are aggregated using the function provided with parameter FUN (defaults to FUN = sum, i.e. all intensities are summed up). Note that the binning operation is applied to the peak data on-the-fly upon data access and it is possible to revert the operation with the reset() function (see description of reset() below).

  • countIdentifications: counts the number of identifications each scan has led to. See countIdentifications() for more details.

  • pickPeaks(): picks peaks on individual spectra using a moving window-based approach (window size = 2 * halfWindowSize). For noisy spectra there are currently two different noise estimators available, the Median Absolute Deviation (method = "MAD") and Friedman's Super Smoother (method = "SuperSmoother"), as implemented in the MsCoreUtils::noise(). The method supports also to optionally refine the m/z value of the identified centroids by considering data points that belong (most likely) to the same mass peak. Therefore the m/z value is calculated as an intensity weighted average of the m/z values within the peak region. The peak region is defined as the m/z values (and their respective intensities) of the 2 * k closest signals to the centroid or the closest valleys (descending = TRUE) in the 2 * k region. For the latter the k has to be chosen general larger. See MsCoreUtils::refineCentroids() for details. If the ratio of the signal to the highest intensity of the peak is below threshold it will be ignored for the weighted average.

  • replaceIntensitiesBelow(): replaces intensities below a specified threshold with the provided value. Parameter threshold can be either a single numeric value or a function which is applied to all non-NA intensities of each spectrum to determine a threshold value for each spectrum. The default is threshold = min which replaces all values which are <= the minimum intensity in a spectrum with value (the default for value is 0). Note that the function specified with threshold is expected to have a parameter na.rm since na.rm = TRUE will be passed to the function. If the spectrum is in profile mode, ranges of successive non-0 peaks <= threshold are set to 0. Parameter msLevel. allows to apply this to only spectra of certain MS level(s).

  • scalePeaks(): scales intensities of peaks within each spectrum depending on parameter by. With by = sum (the default) peak intensities are divided by the sum of peak intensities within each spectrum. The sum of intensities is thus 1 for each spectrum after scaling. Parameter msLevel. allows to apply the scaling of spectra of a certain MS level. By default (msLevel. = uniqueMsLevels(x)) intensities for all spectra will be scaled.

  • smooth(): smooths individual spectra using a moving window-based approach (window size = 2 * halfWindowSize). Currently, the Moving-Average- (method = "MovingAverage"), Weighted-Moving-Average- (⁠method = "WeightedMovingAverage")⁠, weights depending on the distance of the center and calculated 1/2^(-halfWindowSize:halfWindowSize)) and Savitzky-Golay-Smoothing (method = "SavitzkyGolay") are supported. For details how to choose the correct halfWindowSize please see MsCoreUtils::smooth().

Data analysis methods returning the result from the calculation

The functions listed in this section return immediately the result from the calculation. To reduce memory demand (and allow parallel processing) the calculations a chunk-wise processing is generally performed.

  • chunkapply(): apply an arbitrary function to chunks of spectra. See chunkapply() for details and examples.

  • containsMz(): checks for each of the spectra whether they contain mass peaks with an m/z equal to mz (given acceptable difference as defined by parameters tolerance and ppm - see MsCoreUtils::common() for details). Parameter which allows to define whether any (which = "any", the default) or all (which = "all") of the mz have to match. The function returns NA if mz is of length 0 or is NA.

  • containsNeutralLoss(): checks for each spectrum in object if it has a peak with an m/z value equal to its precursor m/z - neutralLoss (given acceptable difference as defined by parameters tolerance and ppm). Returns NA for MS1 spectra (or spectra without a precursor m/z).

  • entropy(): calculates the entropy of each spectra based on the metrics suggested by Li et al. (https://doi.org/10.1038/s41592-021-01331-z). See also MsCoreUtils::nentropy() in the MsCoreUtils package for details.

  • estimatePrecursorIntensity(): defines the precursor intensities for MS2 spectra using the intensity of the matching MS1 peak from the closest MS1 spectrum (i.e. the last MS1 spectrum measured before the respective MS2 spectrum). With method = "interpolation" it is also possible to calculate the precursor intensity based on an interpolation of intensity values (and retention times) of the matching MS1 peaks from the previous and next MS1 spectrum. See estimatePrecursorIntensity() for examples and more details.

  • estimatePrecursorMz(): for DDA data: allows to estimate a fragment spectra's precursor m/z based on the reported precursor m/z and the data from the previous MS1 spectrum. See estimatePrecursorMz() for details.

  • neutralLoss(): calculates neutral loss spectra for fragment spectra. See neutralLoss() for detailed documentation.

  • spectrapply(): applies a given function to each individual spectrum or sets of a Spectra object. By default, the Spectra is split into individual spectra (i.e. Spectra of length 1) and the function FUN is applied to each of them. An alternative splitting can be defined with parameter f. Parameters for FUN can be passed using .... The returned result and its order depend on the function FUN and how object is split (hence on f, if provided). Parallel processing is supported and can be configured with parameter BPPARAM, is however only suggested for computational intense FUN. As an alternative to the (eventual parallel) processing of the full Spectra, spectrapply() supports also a chunk-wise processing. For this, parameter chunkSize needs to be specified. object is then split into chunks of size chunkSize which are then (stepwise) processed by FUN. This guarantees a lower memory demand (especially for on-disk backends) since only the data for one chunk needs to be loaded into memory in each iteration. Note that by specifying chunkSize, parameters f and BPPARAM will be ignored. See also chunkapply() above or examples below for details on chunk-wise processing.

The processing queue

Operations that modify mass peak data, i.e. the m/z and intensity values of a Spectra are generally not applied immediately to the data but are cached within the object's processing queue. These operations are then applied to the data only upon request, for example when m/z and/or intensity values are extracted. This lazy execution guarantees that the same functionality can be applied to any Spectra object, regardless of the type of backend that is used. Thus, data manipulation operations can also be applied to data that is read only. As a side effect, this enables also to undo operations using the reset() function.

Functions related to the processing queue are:

  • applyProcessing(): for Spectra objects that use a writeable backend only: apply all steps from the lazy processing queue to the peak data and write it back to the data storage. Parameter f allows to specify how object should be split for parallel processing. This should either be equal to the dataStorage, or f = rep(1, length(object)) to disable parallel processing alltogether. Other partitionings might result in errors (especially if a MsBackendHdf5Peaks backend is used).

  • processingLog(): returns a character vector with the processing log messages.

  • reset(): restores the data to its original state (as much as possible): removes any processing steps from the lazy processing queue and calls reset() on the backend which, depending on the backend, can also undo e.g. data filtering operations. Note that a ⁠reset*(⁠ call after applyProcessing() will not have any effect. See examples below for more information.

Author(s)

Sebastian Gibb, Johannes Rainer, Laurent Gatto, Philippine Louail, Nir Shahaf, Mar Garcia-Aloy

See Also

  • compareSpectra() for calculation of spectra similarity scores.

  • processingChunkSize() for information on parallel and chunk-wise data processing.

  • Spectra for a general description of the Spectra object.

Examples


## Load a `Spectra` object with LC-MS/MS data.
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
    package = "msdata")
sps_dda <- Spectra(fl)
sps_dda


##  --------  FUNCTIONS RETURNING A SPECTRA  --------

## Replace peak intensities below 40 with a value of 1
sps_mod <- replaceIntensitiesBelow(sps_dda, threshold = 20, value = 1)
sps_mod

## Get the intensities of the first spectrum before and after the
## operation
intensity(sps_dda[1])
intensity(sps_mod[1])

## Remove all peaks with an intensity below 5.
sps_mod <- filterIntensity(sps_dda, intensity = c(5, Inf))

intensity(sps_mod)

## In addition it is possible to pass a function to `filterIntensity()`: in
## the example below we want to keep only peaks that have an intensity which
## is larger than one third of the maximal peak intensity in that spectrum.
keep_peaks <- function(x, prop = 3) {
    x > max(x, na.rm = TRUE) / prop
}
sps_mod <- filterIntensity(sps_dda, intensity = keep_peaks)
intensity(sps_mod)

## We can also change the proportion by simply passing the `prop` parameter
## to the function. To keep only peaks that have an intensity which is
## larger than half of the maximum intensity:
sps_mod <- filterIntensity(sps_dda, intensity = keep_peaks, prop = 2)
intensity(sps_mod)

## With the `scalePeaks()` function we can alternatively scale the
## intensities of mass peaks per spectrum to relative intensities. This
## is specifically useful for fragment (MS2) spectra. We below thus
## scale the intensities per spectrum by the total sum of intensities
## (such that the sum of all intensities per spectrum is 1).
## Below we scale the intensities of all MS2 spectra in our data set.
sps_mod <- scalePeaks(sps_dda, msLevel = 2L)

## MS1 spectra were not affected
sps_mod |>
    filterMsLevel(1L) |>
    intensity()

## Intensities of MS2 spectra were scaled
sps_mod |>
    filterMsLevel(2L) |>
    intensity()

## Since data manipulation operations are by default not directly applied to
## the data but only cached in the internal processing queue, it is also
## possible to remove these data manipulations with the `reset()` function:
tmp <- reset(sps_mod)
tmp
lengths(sps_dda) |> head()
lengths(sps_mod) |> head()
lengths(tmp) |> head()

## Data manipulation operations cached in the processing queue can also be
## applied to the mass peaks data with the `applyProcessing()` function, if
## the `Spectra` uses a backend that supports that (i.e. allows replacing
## the mass peaks data). Below we first change the backend to a
## `MsBackendMemory()` and then use the `applyProcessing()` to modify the
## mass peaks data
sps_dda <- setBackend(sps_dda, MsBackendMemory())
sps_mod <- filterIntensity(sps_dda, intensity = c(5, Inf))
sps_mod <- applyProcessing(sps_mod)
sps_mod

## While we can't *undo* this filtering operation now using the `reset()`
## function, accessing the data would now be faster, because the operation
## does no longer to be applied to the original data before returning to the
## user.


##  --------  FUNCTIONS RETURNING THE RESULT  --------

## With the `spectrapply()` function it is possible to apply an
## arbitrary function to each spectrum in a Spectra.
## In the example below we calculate the mean intensity for each spectrum
## in a subset of the sciex_im data. Note that we can access all variables
## of each individual spectrum either with the `$` operator or the
## corresponding method.
res <- spectrapply(sps_dda[1:20], FUN = function(x) mean(x$intensity[[1]]))
head(res)

## As an alternative, applying a function `FUN` to a `Spectra` can be
## performed *chunk-wise*. The advantage of this is, that only the data for
## one chunk at a time needs to be loaded into memory reducing the memory
## demand. This type of processing can be performed by specifying the size
## of the chunks (i.e. number of spectra per chunk) with the `chunkSize`
## parameter
spectrapply(sps_dda[1:20], lengths, chunkSize = 5L)

## Precursor intensity estimation. Some manufacturers don't report the
## precursor intensity for MS2 spectra:
sps_dda |>
    filterMsLevel(2L) |>
    precursorIntensity()

## This intensity can however be estimated from the previously measured
## MS1 scan with the `estimatePrecursorIntensity()` function:
pi <- estimatePrecursorIntensity(sps_dda)

## This function returned the result as a `numeric` vector with one
## value per spectrum:
pi

## We can replace the precursor intensity values of the originating
## object:
sps_dda$precursorIntensity <- pi
sps_dda |>
    filterMsLevel(2L) |>
    precursorIntensity()


rformassspectrometry/Spectra documentation built on Dec. 19, 2024, 1:05 p.m.