```
#' Consolidate mass spectrometry observations
#'
#' Combines mass spectrometry observations that are believed to belong to the
#' same underlying compound into a single observation. In concept, the data
#' produced by the mass spectrometer may produce multiple reads for a single
#' compound; thus, \code{binMS} attempts to recover these underlying compounds
#' through a binning procedure, described in more detail in \code{Details}.
#'
#' @inheritParams msDat
#'
#' @param mass The information for the mass need not be provided, as it can be
#' derived using the mass-to-charge and charge information; in this case the
#' parameter should be given its default, i.e. \code{NULL}. If however the
#' information for mass is already included in the dataset in hand, then
#' providing it to the function will be slightly more efficient then
#' re-performing the calculations. The information for the \code{charge}
#' parameter can be provided in the same manner as for the mass-to-charge
#' values.
#'
#' @param time_peak_reten The information for the \code{time_peak_reten}
#' parameter can be provided in the same manner as for the mass-to-charge
#' and other information; this paramater specifies the time at which the
#' peak retention level of the compound was achieved.
#'
#' @param time_range A length-2 numeric vector specifying the lower bound and
#' upper bound (inclusive) of allowed peak retention time occurance for an
#' observation to be included in the consolidation process.
#'
#' @param mass_range A length-2 numeric vector specifying the lower bound and
#' upper bound (inclusive) of allowed mass for an observation to be included
#' in the consolidation process.
#'
#' @param charge_range A length-2 numeric vector specifying the lower bound and
#' upper bound (inclusive) of allowed electrical charge state for an
#' observation to be included in the consolidation process.
#'
#' @param mtoz_diff A single numerical value such that any two observations with
#' a larger absolute difference between their mass-to-charge values are
#' considered to have originated from different underlying compounds. Two
#' observations with a smaller absolute difference between their
#' mass-to-charge values could potentially be considered to originate from
#' the same underlying compound, contingent on other criteria also being
#' met. Nonnegative values are allowed; such a value has the effect of not
#' consolidating any groups, and consequently reduces the function to a
#' filtering routine only.
#'
#' @param time_diff A single numerical value such that any two observations with
#' a larger absolute difference between their peak elution times are
#' considered to have originated from different underlying compounds. Two
#' observations with a smaller absolute difference between their peak
#' elution times could potentially be considered to originate from the same
#' underlying compound, contingent on other criteria also being met.
#' Nonnegative values are allowed; such a value has the effect of not
#' consolidating any groups, and consequently reduces the function to a
#' filtering routine only.
#'
#' @details The algorithm described in what follows attempts to combines mass
#' spectrometry observations that are believed to belong to the same
#' underlying compound into a single observation for each compound. There
#' are two conceptually separate steps.
#'
#' The first step is as follows. All observations must satisfy each of the
#' following criteria for inclusion in the binning process.
#'
#' \enumerate{
#'
#' \item Each observation must have its peak elution time occur during the
#' interval specified by \code{time_range}
#'
#' \item Each observation must have a mass that falls within the interval
#' specified by \code{mass_range}
#'
#' \item Each observation must have an electrical charge state that falls
#' within the interval specified by \code{charge_range}
#'
#' }
#'
#' Once that a set of observations satisfying the above criteria is
#' obtained, then a second step attempts to combine observations believed to
#' belong to the same underlying compound. The algorithm considers two
#' observations that satisfy each of the following criteria to belong to the
#' same compound.
#'
#' \enumerate{
#'
#' \item The absolute difference in Daltons of the mass-to-charge value
#' between the two observations is less the the value specified by
#' \code{mtoz_diff}
#'
#' \item The absolute difference of the peak elution time between the two
#' observations is less than the value specified by \code{time_pr_diff}
#'
#' \item The electrical charge state must be the same for the two observations
#'
#' }
#'
#' Then the binning algorithm is defined as follows. Consider an
#' observation that satisfies the inclusion criteria; this observation is
#' compaired pairwise with every other observation that satisfies the
#' inclusion criteria. If a pair of observations satisfies the criteria
#' determining them to belong to the same underlying compound then the two
#' observations are merged into a single observation. The two previous
#' compounds are removed from the working set, and the process starts over
#' with the newly created observation. The process repeats until no other
#' observation in the working set meets the criteria determining it to
#' belong to the same underlying compound as that of the current
#' observation; at this point it is considered that all observations
#' belonging to the compound have been found, and the process starts over
#' with a new observation.
#'
#' The merging process has not yet been defined; it is performed by
#' averaging the mass-to-charge values and peak elution times, and summing
#' the mass spectrometry intensities at each fraction. Although
#' observations are merged pairwise, when multiple observations are combined
#' in a sequence of pairings, the averages are given equal weight for all of
#' the observations. In other words, if a pair of observations are merged,
#' and then a third observation is merged with the new observation created
#' by combining the original two, then the mass-to-charge value and peak
#' elution time values of the new observation are obtained by summing the
#' values for each of the three original observations and dividing by three.
#' The merging process for more than three observations is conducted
#' similarly.
#'
#' Having described the binning algorithm, it is apparent that there are
#' scenarios in which the order in which observations are merged affects the
#' outcome of the algorithm. Since it seems that a minumum requirement of
#' any binning algorithm is that the algorithm is invariant to the ordering
#' of the observations in the data, this algorithm abides by the following
#' rules. The observations in the data are sorted in increasing order by
#' mass-to-charge value, peak elution time, and electical charge state,
#' respectively. Then when choosing an observation to compare to the rest
#' of the set, we start with the observation at the top of the sort
#' ordering, and compare it one-at-a-time to the other elements in the set
#' according to the same ordering. When a consolidated observation is
#' complete in that no other observation left in the working set satisfies
#' the merging criteria, then this consolidated observation can be removed
#' from consideration for all future merges.
#'
#' @return Returns an object of class \code{binMS} which inherits from
#' \code{msDat}. This object is a \code{list} with elements described
#' below. The class is equipped with a \code{print}, \code{summary}, and
#' \code{extractMS} function.
#'
#' \describe{
#'
#' \item{\code{msDatObj}}{ An object of class \code{\link{msDat}} that
#' encapsulates the mass spectrometry data for the consolidated data. }
#'
#' \item{\code{summ_info}}{ A list containing information pertaining to the
#' consolidation process; for use by the summary function. }
#'
#' }
#'
#' @examples
#'
#' # Load mass spectrometry data
#' data(mass_spec)
#'
#' # Perform consolidation via binMS
#' bin_out <- binMS(mass_spec = mass_spec,
#' mtoz = "m/z",
#' charge = "Charge",
#' mass = "Mass",
#' time_peak_reten = "Reten",
#' ms_inten = NULL,
#' time_range = c(14, 45),
#' mass_range = c(2000, 15000),
#' charge_range = c(2, 10),
#' mtoz_diff = 0.05,
#' time_diff = 60)
#'
#' # print, summary function
#' bin_out
#' summary(bin_out)
#'
#' # Extract consolidated mass spectrometry data as a matrix or msDat object
#' bin_matr <- extractMS(msObj = bin_out, type = "matrix")
#' bin_msDat <- extractMS(msObj = bin_out, type = "matrix")
#'
#' @export
binMS <- function(mass_spec,
mtoz,
charge,
mass = NULL,
time_peak_reten,
ms_inten = NULL,
time_range,
mass_range,
charge_range,
mtoz_diff,
time_diff) {
# Check for validity of the form of the arguments
binMS_check_valid_input(mass_spec, mtoz, charge, mass, time_peak_reten, ms_inten, time_range,
mass_range, charge_range, mtoz_diff, time_diff)
# Translate arguments to values
dmtoz <- extract_var(mass_spec, mtoz)
dcharge <- extract_var(mass_spec, charge)
dtime_pr <- extract_var(mass_spec, time_peak_reten)
# Check if we need to calculate mass from scratch
if (is.null(mass)) {
dmass <- charge * (dmtoz - 1.007825)
# Can't pass NULL to extract_var so create a variable
mass <- dmass
} else {
dmass <- extract_var(mass_spec, mass)
}
dms_inten <- extract_var(mass_spec, ms_inten, TRUE, mtoz, charge, mass, time_peak_reten)
## Step 1: construct sorted indices of rows in the data that satisfy the
## mass, time of peak retention, and charge criteria
# Calculate inclusion criteria status
time_pr_bool <- (time_range[1] <= dtime_pr) & (dtime_pr <= time_range[2])
mass_bool <- (mass_range[1] <= dmass) & (dmass <= mass_range[2])
charge_bool <- (charge_range[1] <= dcharge) & (dcharge <= charge_range[2])
keepIdx <- which( Reduce("&", list(time_pr_bool, mass_bool, charge_bool)) )
# Number of observations satisfying the inclusion criteria
n_befbin <- length(keepIdx)
## Step 2: reduce data by row criteria and by selecting columns needed in
## future. Transpose so as to conform to column major order.
# case: some observations satisfied the inclusion criteria
if (!identical(n_befbin, 0L)) {
# Remove m/z levels that didn't meet inclusion criteria
dmtoz <- dmtoz[keepIdx]
dcharge <- dcharge[keepIdx]
dtime_pr <- dtime_pr[keepIdx]
dmass <- dmass[keepIdx]
dms_inten <- dms_inten[keepIdx, , drop=FALSE]
# Obtain indices of ordered data
sortIdx <- order(dmtoz, dtime_pr, dcharge)
# ms: transpose of the mass spec data
ms <- t( dms_inten[sortIdx, , drop=FALSE] )
# info: data used to identify and combine the mass-to-charge levels
info_orig <- matrix(c(dmtoz, dcharge, dtime_pr, dmass), nrow=4, byrow=TRUE)
info_orig <- info_orig[, sortIdx]
## Step 3: perform binning of observations that meet the similarity
## criteria based on mass-to-charge values, time of peak retention, and
## charge state
# Allocate memory for binned data. We keep the data in two matrices;
# one for what we call the information which are the values that
# determine whether or not to combine two m/z levels, and another for
# the mass spectrometry abundances. The reason for separating the two
# is that the information data gets averaged within a bin while the ms
# abundances are summed.
info_bin <- matrix(nrow=4, ncol=n_befbin)
ms_bin <- matrix(nrow=nrow(ms), ncol=n_befbin)
row.names(ms_bin) <- colnames(dms_inten)
} # end keepIdx not empty case
# Give names to rows (i.e. the variables) in the binned data information
# matrix so as to make the algorithm more readable. The variables
# correspond to the m/z value, charge, time of peak retention, and mass
# respectively.
rmtoz <- 1L
rchg <- 2L
rtime <- 3L
rmass <- 4L
binCtr <- 1L # Index for next binned data entry
origCtr <- 1L # Index for next orig data to be considered for binned data
innerCtr <- 1L # Index for orig data within inner loop that looks for all
# occurences of data within the allowed m/z difference
nbin <- 1L # Number of mass-to-charge levels in current bin
# Create an out-of-bounds charge value used to signal that a mass-to-charge
# level has already been included as part of a bin
flagv <- charge_range[1] - 1
# Each iteration compares an m/z level to see if we can start a new bin. If
# the current level of the iteration isn't part of a bin, then we start a
# new bin and compare to larger m/z levels, adding levels to the bin when
# criteria are met, and ending when m/z levels gets out of range.
while (origCtr <= n_befbin) {
# case: current row was already included as part of a bin; move on to
# next m/z level in the pre-binned data (recall that when a row was
# already included as part of a bin then the charge is set to
# charge_range[1] - 1 as a signal)
if (info_orig[rchg, origCtr] < charge_range[1]) {
origCtr <- origCtr + 1L
next
}
# If we've made it here then we've found a m/z level in the original
# data not yet included in any previous bin. Begin a new bin starting
# with this level as its first entry.
info_bin[, binCtr] <- info_orig[, origCtr]
ms_bin[, binCtr] <- ms[, origCtr]
# Set innerCtr to be the index of the first m/z level to consider for
# adding to the new bin
innerCtr <- origCtr + 1L
# Initial number of m/z levels in the bin
nbin <- 1L
# Each iteration compares a m/z value to the current m/z level until the
# values being compared against have a m/z outside of the allowed
# range. If all of the criteria for the iteration are met, then the m/z
# level is combined with the current bin.
#
# Note that when comparisons are made that we divide by nbin (i.e. the
# number of compounds in the bin) since we are only summing observations
# from the original data when placing it in the bin (and taking the mean
# occurs after the (following inner) loop) ends.
while ( (innerCtr <= n_befbin) &&
(info_orig[rmtoz, innerCtr] - (info_bin[rmtoz, binCtr] / nbin) < mtoz_diff) ) {
# case: criteria met. Add m/z level to current bin.
if ( (info_orig[rchg, innerCtr] == (info_bin[rchg, binCtr] / nbin))
&& ( abs(info_orig[rtime, innerCtr] - (info_bin[rtime, binCtr] / nbin))
< time_diff ) ) {
# add m/z level to current bin
info_bin[, binCtr] <- info_bin[, binCtr] + info_orig[, innerCtr]
ms_bin[, binCtr] <- ms_bin[, binCtr] + ms[, innerCtr]
# signal that m/z level is already part of a bin
info_orig[rchg, innerCtr] <- flagv
nbin <- nbin + 1L
}
# else: criteria not me; noop
# Update the counter indexing the next m/z level to consider
# for adding to the current bin
innerCtr <- innerCtr + 1L
} # end compare current bin to next m/z level loop
# Take the average of the binned m/z levels (note: don't need to do this
# for ms_bin as these values are strictly summed)
info_bin[, binCtr] <- info_bin[, binCtr] / nbin
# We've found every m/z level in the original data that belongs in the
# binned data. Update the counter indexing the next m/z level from the
# original data, and update the counter indexing the next available bin
# to start placing consolidated data into.
origCtr <- origCtr + 1L
binCtr <- binCtr + 1L
} # end iterate over original m/z levels loop
# Construct msDat object from binned data
nbinned <- binCtr - 1L
if (!identical(nbinned, 0L)) {
# It is possible that the binned data can be out of order. This happens
# when the minimum observation of one bin is smaller than that of
# another, but the mean is larger. binCtr indexes the next available
# bin, so 1 less is the number of bins.
resortIdx <- order(info_bin[rmtoz, 1:nbinned],
info_bin[rtime, 1:nbinned],
info_bin[rchg, 1:nbinned])
# Construct msDat object
msDatObj <- msDat(t(ms_bin[, resortIdx, drop=FALSE]),
info_bin[rmtoz, resortIdx],
info_bin[rchg, resortIdx])
}
else {
warning("No observations satisfied all of the inclusion criteria", call.=FALSE)
msDatObj <- NULL
}
# Info for use by summary.binMS to describe binning process
summ_info <- list(n_tot = nrow(mass_spec),
n_time_pr = sum(time_pr_bool),
n_mass = sum(mass_bool),
n_charge = sum(charge_bool),
n_tiMaCh = length(keepIdx),
n_binned = nbinned,
time_range = time_range,
mass_range = mass_range,
charge_range = charge_range,
mtoz_diff = mtoz_diff,
time_diff = time_diff)
# Construct binMS object
outObj <- list(msDatObj = msDatObj,
summ_info = summ_info)
structure(outObj, class=c("binMS", "msDat"))
}
#' @export
print.binMS <- function(x, ...) {
if (is.null(x$msDatObj)) {
cat("An object of class \"binMS\"; no observations ",
"satisfied all of the inclusion criteria.\n", sep="")
}
else {
cat("An object of class \"binMS\" with ", NROW(x$msDatObj$ms), " compounds and ",
NCOL(x$msDatObj$ms), " fractions.\n", sep="")
}
cat("Use summary to see more details regarding the consolidation process.\n",
"Use extractMS to extract the consolidated mass spectrometry data.\n\n", sep="")
}
#' Overview of the binning process
#'
#' Prints a text description of the binning process. Displays arguments passed
#' to the \code{binMS} routine, how many m/z levels were chosen for each
#' criterion, how many candidate compounds were chosen overall, and how many
#' candidate compounds were obtained after consolidation.
#'
#' @param object An object of class \code{\link{binMS}}
#'
#' @param ... Arguments passed to dot-dot-dot are ignored
#'
#' @export
summary.binMS <- function(object, ...) {
cat(format(object), sep="")
}
# Return a vector of strings supplying the output for summary.binMS
format.binMS <- function(x, ...) {
# Add pointers to summ_info variables for convenience
n_tot <- x$summ_info$n_tot
n_time_pr <- x$summ_info$n_time_pr
n_mass <- x$summ_info$n_mass
n_charge <- x$summ_info$n_charge
n_tiMaCh <- x$summ_info$n_tiMaCh
n_binned <- x$summ_info$n_binned
time_range <- x$summ_info$time_range
mass_range <- x$summ_info$mass_range
charge_range <- x$summ_info$charge_range
mtoz_diff <- x$summ_info$mtoz_diff
time_diff <- x$summ_info$time_diff
# Add commas to filter sizes and make a uniform width
nlev <- format_int( c(n_time_pr, n_mass, n_charge, n_tiMaCh) )
names(nlev) <- c("time", "mass", "charge", "all")
# Create two columns of numbers with uniform width and that line up
incl_crit_lo <- format_float( c(time_range[1L], mass_range[1L], charge_range[1L]) )
incl_crit_hi <- format_float( c(time_range[2L], mass_range[2L], charge_range[2L]) )
names(incl_crit_lo) <- names(incl_crit_hi) <- c("time", "mass", "charge")
# Begin string construction ----------------------------
# User-specified inclusion criteria values to binMS
inclusion_criteria <- sprintf(
paste0("The inclusion criteria was specified as follows:\n",
"------------------------------------------------\n",
" time of peak retention: between %s and %s\n",
" mass: between %s and %s\n",
" charge: between %s and %s\n",
"\n"),
incl_crit_lo["time"],
incl_crit_hi["time"],
incl_crit_lo["mass"],
incl_crit_hi["mass"],
incl_crit_lo["charge"],
incl_crit_hi["charge"])
# User-specified consolidation criteria
consolidation_criteria <- sprintf(
paste0("m/z levels were consolidated when each of the following criteria were met:\n",
"--------------------------------------------------------------------------\n",
" m/z levels were no more than %s units apart\n",
" the time peak retention occured no farther apart than %s units\n",
" the charge states were the same\n",
"\n"),
format(mtoz_diff, big.mark=","),
format(time_diff, big.mark=","))
# Size of the data prior to binning
size_prior <- sprintf(
paste0("The mass spectrometry data prior to binning had:\n",
"------------------------------------------------\n",
" %s m/z levels\n",
"\n"),
format(n_tot, big.mark=","))
# Number of remaining m/z levels after filtering by the criterions
nremain <- sprintf(
paste0("The number of remaining m/z levels after filtering by the inclusion criteria was:\n",
"---------------------------------------------------------------------------------\n",
" time of peak retention: %s\n",
" mass: %s\n",
" charge: %s\n",
" satisfied all: %s\n",
"\n"),
nlev["time"],
nlev["mass"],
nlev["charge"],
nlev["all"])
# Total number of remaining levels
ntotal <- sprintf(
paste0("After consolidating the m/z levels, there were:\n",
"-----------------------------------------------\n",
" %s levels\n",
"\n"),
format(n_binned, big.mark=","))
c(newl = "\n",
incl = inclusion_criteria,
cons = consolidation_criteria,
size = size_prior,
nrem = nremain,
ntot = ntotal)
}
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.