binMS: Consolidate mass spectrometry observations

Description Usage Arguments Details Value Examples

Description

Combines mass spectrometry observations that are believed to belong to the same underlying compound into a single observation. In concept, the data produced by the mass spectrometer may produce multiple reads for a single compound; thus, binMS attempts to recover these underlying compounds through a binning procedure, described in more detail in Details.

Usage

1
2
binMS(mass_spec, mtoz, charge, mass = NULL, time_peak_reten,
  ms_inten = NULL, time_range, mass_range, charge_range, mtoz_diff, time_diff)

Arguments

mass_spec

Either a matrix or data.frame. This object must contain mass spectrometry abundances, and may optionally contain mass-to-charge values, charge state information, or additional extraneous variables. The mass spectrometry data is expected to be in a form with each column corresponding to a variable and each row corresponding to a mass-to-charge level.

For example, suppose that a collection of mass spectrometry intensity observations has provided data for 50 fractions across 20,000 mass-to-charge values. Then the input for mass_spec should be a matrix or data.frame with 20,000 rows and 50 or more columns. The additional columns beyond the 50 containing the mass spectrometry intensities can be the mass-to-charge data, the charge data, or other extraneous variables (the extraneous variables will be discarded when constructing the msDat object).

mtoz

A vector of either length 1 or length equal to the number of mass-to-charge values for which mass spectrometry data was collected, and which helps identify the mass-to-charge values for this data in one of several ways.

One way to provide the information is to provide a numeric vector where each entry provides the mass-to-charge value for a corresponding row of mass spectrometry data. Then the k-th entry of the vector would provide the mass-to-charge value for the k-th row of the mass spectrometry data.

A second way is to provide a single number which specifies the column index in the matrix or data.frame provided as the argument for the mass_spec parameter, such that this column contains the mass-to-charge information.

A third way is provide a single character string which provides the column name in the matrix or data.frame provided as the argument for the mass_spec parameter, such that this column contains the mass-to-charge information. Partial matching is supported.

charge

The information for the charge parameter can be provided in the same manner as for the mass-to-charge values.

mass

The information for the mass need not be provided, as it can be derived using the mass-to-charge and charge information; in this case the parameter should be given its default, i.e. NULL. If however the information for mass is already included in the dataset in hand, then providing it to the function will be slightly more efficient then re-performing the calculations. The information for the charge parameter can be provided in the same manner as for the mass-to-charge values.

time_peak_reten

The information for the time_peak_reten parameter can be provided in the same manner as for the mass-to-charge and other information; this paramater specifies the time at which the peak retention level of the compound was achieved.

ms_inten

Either NULL or a vector either of mode character or mode numeric specifying which of the variables in the argument to mass_spec are to be retained as the mass spectrometry intensity data. If NULL, then it is taken to mean that the entirety of the data in mass_spec, after removing variables in the data that are specified as arguments, is the mass spectrometry intensity data. If it is a numeric vector, then the entries should provide the indices for the region of interest in the mass spectrometry data in the argument for msObj. If it is a character vector, then the entries should uniquely specify the region of interest through partial string matching.

time_range

A length-2 numeric vector specifying the lower bound and upper bound (inclusive) of allowed peak retention time occurance for an observation to be included in the consolidation process.

mass_range

A length-2 numeric vector specifying the lower bound and upper bound (inclusive) of allowed mass for an observation to be included in the consolidation process.

charge_range

A length-2 numeric vector specifying the lower bound and upper bound (inclusive) of allowed electrical charge state for an observation to be included in the consolidation process.

mtoz_diff

A single numerical value such that any two observations with a larger absolute difference between their mass-to-charge values are considered to have originated from different underlying compounds. Two observations with a smaller absolute difference between their mass-to-charge values could potentially be considered to originate from the same underlying compound, contingent on other criteria also being met. Nonnegative values are allowed; such a value has the effect of not consolidating any groups, and consequently reduces the function to a filtering routine only.

time_diff

A single numerical value such that any two observations with a larger absolute difference between their peak elution times are considered to have originated from different underlying compounds. Two observations with a smaller absolute difference between their peak elution times could potentially be considered to originate from the same underlying compound, contingent on other criteria also being met. Nonnegative values are allowed; such a value has the effect of not consolidating any groups, and consequently reduces the function to a filtering routine only.

Details

The algorithm described in what follows attempts to combines mass spectrometry observations that are believed to belong to the same underlying compound into a single observation for each compound. There are two conceptually separate steps.

The first step is as follows. All observations must satisfy each of the following criteria for inclusion in the binning process.

  1. Each observation must have its peak elution time occur during the interval specified by time_range

  2. Each observation must have a mass that falls within the interval specified by mass_range

  3. Each observation must have an electrical charge state that falls within the interval specified by charge_range

Once that a set of observations satisfying the above criteria is obtained, then a second step attempts to combine observations believed to belong to the same underlying compound. The algorithm considers two observations that satisfy each of the following criteria to belong to the same compound.

  1. The absolute difference in Daltons of the mass-to-charge value between the two observations is less the the value specified by mtoz_diff

  2. The absolute difference of the peak elution time between the two observations is less than the value specified by time_pr_diff

  3. The electrical charge state must be the same for the two observations

Then the binning algorithm is defined as follows. Consider an observation that satisfies the inclusion criteria; this observation is compaired pairwise with every other observation that satisfies the inclusion criteria. If a pair of observations satisfies the criteria determining them to belong to the same underlying compound then the two observations are merged into a single observation. The two previous compounds are removed from the working set, and the process starts over with the newly created observation. The process repeats until no other observation in the working set meets the criteria determining it to belong to the same underlying compound as that of the current observation; at this point it is considered that all observations belonging to the compound have been found, and the process starts over with a new observation.

The merging process has not yet been defined; it is performed by averaging the mass-to-charge values and peak elution times, and summing the mass spectrometry intensities at each fraction. Although observations are merged pairwise, when multiple observations are combined in a sequence of pairings, the averages are given equal weight for all of the observations. In other words, if a pair of observations are merged, and then a third observation is merged with the new observation created by combining the original two, then the mass-to-charge value and peak elution time values of the new observation are obtained by summing the values for each of the three original observations and dividing by three. The merging process for more than three observations is conducted similarly.

Having described the binning algorithm, it is apparent that there are scenarios in which the order in which observations are merged affects the outcome of the algorithm. Since it seems that a minumum requirement of any binning algorithm is that the algorithm is invariant to the ordering of the observations in the data, this algorithm abides by the following rules. The observations in the data are sorted in increasing order by mass-to-charge value, peak elution time, and electical charge state, respectively. Then when choosing an observation to compare to the rest of the set, we start with the observation at the top of the sort ordering, and compare it one-at-a-time to the other elements in the set according to the same ordering. When a consolidated observation is complete in that no other observation left in the working set satisfies the merging criteria, then this consolidated observation can be removed from consideration for all future merges.

Value

Returns an object of class binMS which inherits from msDat. This object is a list with elements described below. The class is equipped with a print, summary, and extractMS function.

msDatObj

An object of class msDat that encapsulates the mass spectrometry data for the consolidated data.

summ_info

A list containing information pertaining to the consolidation process; for use by the summary function.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Load mass spectrometry data
data(mass_spec)

# Perform consolidation via binMS
bin_out <- binMS(mass_spec = mass_spec,
                 mtoz = "m/z",
                 charge = "Charge",
                 mass = "Mass",
                 time_peak_reten = "Reten",
                 ms_inten = NULL,
                 time_range = c(14, 45),
                 mass_range = c(2000, 15000),
                 charge_range = c(2, 10),
                 mtoz_diff = 0.05,
                 time_diff = 60)

# print, summary function
bin_out
summary(bin_out)

# Extract consolidated mass spectrometry data as a matrix or msDat object
bin_matr <- extractMS(msObj = bin_out, type = "matrix")
bin_msDat <- extractMS(msObj = bin_out, type = "matrix")

PepSAVIms documentation built on May 1, 2019, 10:16 p.m.