# R/BinMS.R In dpritchLibre/Bioactivity: PepSAVI-MS Data Analysis

#### Documented in binMSsummary.binMS

#' Consolidate mass spectrometry observations
#'
#' Combines mass spectrometry observations that are believed to belong to the
#' same underlying compound into a single observation.  In concept, the data
#' produced by the mass spectrometer may produce multiple reads for a single
#' compound; thus, \code{binMS} attempts to recover these underlying compounds
#' through a binning procedure, described in more detail in \code{Details}.
#'
#' @inheritParams msDat
#'
#' @param mass The information for the mass need not be provided, as it can be
#'     derived using the mass-to-charge and charge information; in this case the
#'     parameter should be given its default, i.e. \code{NULL}.  If however the
#'     information for mass is already included in the dataset in hand, then
#'     providing it to the function will be slightly more efficient then
#'     re-performing the calculations.  The information for the \code{charge}
#'     parameter can be provided in the same manner as for the mass-to-charge
#'     values.
#'
#' @param time_peak_reten The information for the \code{time_peak_reten}
#'     parameter can be provided in the same manner as for the mass-to-charge
#'     and other information; this paramater specifies the time at which the
#'     peak retention level of the compound was achieved.
#'
#' @param time_range A length-2 numeric vector specifying the lower bound and
#'     upper bound (inclusive) of allowed peak retention time occurance for an
#'     observation to be included in the consolidation process.
#'
#' @param mass_range A length-2 numeric vector specifying the lower bound and
#'     upper bound (inclusive) of allowed mass for an observation to be included
#'     in the consolidation process.
#'
#' @param charge_range A length-2 numeric vector specifying the lower bound and
#'     upper bound (inclusive) of allowed electrical charge state for an
#'     observation to be included in the consolidation process.
#'
#' @param mtoz_diff A single numerical value such that any two observations with
#'     a larger absolute difference between their mass-to-charge values are
#'     considered to have originated from different underlying compounds.  Two
#'     observations with a smaller absolute difference between their
#'     mass-to-charge values could potentially be considered to originate from
#'     the same underlying compound, contingent on other criteria also being
#'     met.  Nonnegative values are allowed; such a value has the effect of not
#'     consolidating any groups, and consequently reduces the function to a
#'     filtering routine only.
#'
#' @param time_diff A single numerical value such that any two observations with
#'     a larger absolute difference between their peak elution times are
#'     considered to have originated from different underlying compounds.  Two
#'     observations with a smaller absolute difference between their peak
#'     elution times could potentially be considered to originate from the same
#'     underlying compound, contingent on other criteria also being met.
#'     Nonnegative values are allowed; such a value has the effect of not
#'     consolidating any groups, and consequently reduces the function to a
#'     filtering routine only.
#'
#' @details The algorithm described in what follows attempts to combines mass
#'     spectrometry observations that are believed to belong to the same
#'     underlying compound into a single observation for each compound.  There
#'     are two conceptually separate steps.
#'
#'     The first step is as follows.  All observations must satisfy each of the
#'     following criteria for inclusion in the binning process.
#'
#'     \enumerate{
#'
#'     \item Each observation must have its peak elution time occur during the
#'         interval specified by \code{time_range}
#'
#'     \item Each observation must have a mass that falls within the interval
#'         specified by \code{mass_range}
#'
#'     \item Each observation must have an electrical charge state that falls
#'         within the interval specified by \code{charge_range}
#'
#'     }
#'
#'     Once that a set of observations satisfying the above criteria is
#'     obtained, then a second step attempts to combine observations believed to
#'     belong to the same underlying compound.  The algorithm considers two
#'     observations that satisfy each of the following criteria to belong to the
#'     same compound.
#'
#'     \enumerate{
#'
#'     \item The absolute difference in Daltons of the mass-to-charge value
#'         between the two observations is less the the value specified by
#'         \code{mtoz_diff}
#'
#'      \item The absolute difference of the peak elution time between the two
#'          observations is less than the value specified by \code{time_pr_diff}
#'
#'     \item The electrical charge state must be the same for the two observations
#'
#'     }
#'
#'     Then the binning algorithm is defined as follows.  Consider an
#'     observation that satisfies the inclusion criteria; this observation is
#'     compaired pairwise with every other observation that satisfies the
#'     inclusion criteria.  If a pair of observations satisfies the criteria
#'     determining them to belong to the same underlying compound then the two
#'     observations are merged into a single observation.  The two previous
#'     compounds are removed from the working set, and the process starts over
#'     with the newly created observation.  The process repeats until no other
#'     observation in the working set meets the criteria determining it to
#'     belong to the same underlying compound as that of the current
#'     observation; at this point it is considered that all observations
#'     belonging to the compound have been found, and the process starts over
#'     with a new observation.
#'
#'     The merging process has not yet been defined; it is performed by
#'     averaging the mass-to-charge values and peak elution times, and summing
#'     the mass spectrometry intensities at each fraction.  Although
#'     observations are merged pairwise, when multiple observations are combined
#'     in a sequence of pairings, the averages are given equal weight for all of
#'     the observations.  In other words, if a pair of observations are merged,
#'     and then a third observation is merged with the new observation created
#'     by combining the original two, then the mass-to-charge value and peak
#'     elution time values of the new observation are obtained by summing the
#'     values for each of the three original observations and dividing by three.
#'     The merging process for more than three observations is conducted
#'     similarly.
#'
#'     Having described the binning algorithm, it is apparent that there are
#'     scenarios in which the order in which observations are merged affects the
#'     outcome of the algorithm.  Since it seems that a minumum requirement of
#'     any binning algorithm is that the algorithm is invariant to the ordering
#'     of the observations in the data, this algorithm abides by the following
#'     rules.  The observations in the data are sorted in increasing order by
#'     mass-to-charge value, peak elution time, and electical charge state,
#'     respectively.  Then when choosing an observation to compare to the rest
#'     of the set, we start with the observation at the top of the sort
#'     ordering, and compare it one-at-a-time to the other elements in the set
#'     according to the same ordering.  When a consolidated observation is
#'     complete in that no other observation left in the working set satisfies
#'     the merging criteria, then this consolidated observation can be removed
#'     from consideration for all future merges.
#'
#' @return Returns an object of class \code{binMS} which inherits from
#'     \code{msDat}.  This object is a \code{list} with elements described
#'     below.  The class is equipped with a \code{print}, \code{summary}, and
#'     \code{extractMS} function.
#'
#'     \describe{
#'
#'     \item{\code{msDatObj}}{ An object of class \code{\link{msDat}} that
#'         encapsulates the mass spectrometry data for the consolidated data. }
#'
#'     \item{\code{summ_info}}{ A list containing information pertaining to the
#'         consolidation process; for use by the summary function. }
#'
#'     }
#'
#' @examples
#'
#' # Load mass spectrometry data
#' data(mass_spec)
#'
#' # Perform consolidation via binMS
#' bin_out <- binMS(mass_spec = mass_spec,
#'                  mtoz = "m/z",
#'                  charge = "Charge",
#'                  mass = "Mass",
#'                  time_peak_reten = "Reten",
#'                  ms_inten = NULL,
#'                  time_range = c(14, 45),
#'                  mass_range = c(2000, 15000),
#'                  charge_range = c(2, 10),
#'                  mtoz_diff = 0.05,
#'                  time_diff = 60)
#'
#' # print, summary function
#' bin_out
#' summary(bin_out)
#'
#' # Extract consolidated mass spectrometry data as a matrix or msDat object
#' bin_matr <- extractMS(msObj = bin_out, type = "matrix")
#' bin_msDat <- extractMS(msObj = bin_out, type = "matrix")
#'
#' @export

binMS <- function(mass_spec,
mtoz,
charge,
mass = NULL,
time_peak_reten,
ms_inten = NULL,
time_range,
mass_range,
charge_range,
mtoz_diff,
time_diff) {

# Check for validity of the form of the arguments
binMS_check_valid_input(mass_spec, mtoz, charge, mass, time_peak_reten, ms_inten, time_range,
mass_range, charge_range, mtoz_diff, time_diff)

# Translate arguments to values
dmtoz <- extract_var(mass_spec, mtoz)
dcharge <- extract_var(mass_spec, charge)
dtime_pr <- extract_var(mass_spec, time_peak_reten)
# Check if we need to calculate mass from scratch
if (is.null(mass)) {
dmass <- charge * (dmtoz - 1.007825)
# Can't pass NULL to extract_var so create a variable
mass <- dmass
} else {
dmass <- extract_var(mass_spec, mass)
}
dms_inten <- extract_var(mass_spec, ms_inten, TRUE, mtoz, charge, mass, time_peak_reten)

## Step 1: construct sorted indices of rows in the data that satisfy the
## mass, time of peak retention, and charge criteria

# Calculate inclusion criteria status
time_pr_bool <- (time_range[1] <= dtime_pr) & (dtime_pr <= time_range[2])
mass_bool <- (mass_range[1] <= dmass) & (dmass <= mass_range[2])
charge_bool <- (charge_range[1] <= dcharge) & (dcharge <= charge_range[2])
keepIdx <- which( Reduce("&", list(time_pr_bool, mass_bool, charge_bool)) )

# Number of observations satisfying the inclusion criteria
n_befbin <- length(keepIdx)

## Step 2: reduce data by row criteria and by selecting columns needed in
## future.  Transpose so as to conform to column major order.

# case: some observations satisfied the inclusion criteria
if (!identical(n_befbin, 0L)) {

# Remove m/z levels that didn't meet inclusion criteria
dmtoz <- dmtoz[keepIdx]
dcharge <- dcharge[keepIdx]
dtime_pr <- dtime_pr[keepIdx]
dmass <- dmass[keepIdx]
dms_inten <- dms_inten[keepIdx, , drop=FALSE]

# Obtain indices of ordered data
sortIdx <- order(dmtoz, dtime_pr, dcharge)

# ms: transpose of the mass spec data
ms <- t( dms_inten[sortIdx, , drop=FALSE] )
# info: data used to identify and combine the mass-to-charge levels
info_orig <- matrix(c(dmtoz, dcharge, dtime_pr, dmass), nrow=4, byrow=TRUE)
info_orig <- info_orig[, sortIdx]

## Step 3: perform binning of observations that meet the similarity
## criteria based on mass-to-charge values, time of peak retention, and
## charge state

# Allocate memory for binned data.  We keep the data in two matrices;
# one for what we call the information which are the values that
# determine whether or not to combine two m/z levels, and another for
# the mass spectrometry abundances.  The reason for separating the two
# is that the information data gets averaged within a bin while the ms
# abundances are summed.
info_bin <- matrix(nrow=4, ncol=n_befbin)
ms_bin <- matrix(nrow=nrow(ms), ncol=n_befbin)
row.names(ms_bin) <- colnames(dms_inten)

} # end keepIdx not empty case

# Give names to rows (i.e. the variables) in the binned data information
# matrix so as to make the algorithm more readable.  The variables
# correspond to the m/z value, charge, time of peak retention, and mass
# respectively.
rmtoz  <- 1L
rchg   <- 2L
rtime  <- 3L
rmass  <- 4L

binCtr   <- 1L  # Index for next binned data entry
origCtr  <- 1L  # Index for next orig data to be considered for binned data
innerCtr <- 1L  # Index for orig data within inner loop that looks for all
# occurences of data within the allowed m/z difference
nbin     <- 1L  # Number of mass-to-charge levels in current bin

# Create an out-of-bounds charge value used to signal that a mass-to-charge
# level has already been included as part of a bin
flagv <- charge_range[1] - 1

# Each iteration compares an m/z level to see if we can start a new bin.  If
# the current level of the iteration isn't part of a bin, then we start a
# new bin and compare to larger m/z levels, adding levels to the bin when
# criteria are met, and ending when m/z levels gets out of range.

while (origCtr <= n_befbin) {

# case: current row was already included as part of a bin; move on to
# next m/z level in the pre-binned data (recall that when a row was
# already included as part of a bin then the charge is set to
# charge_range[1] - 1 as a signal)
if (info_orig[rchg, origCtr] < charge_range[1]) {
origCtr <- origCtr + 1L
next
}

# If we've made it here then we've found a m/z level in the original
# data not yet included in any previous bin.  Begin a new bin starting
# with this level as its first entry.
info_bin[, binCtr] <- info_orig[, origCtr]
ms_bin[, binCtr] <- ms[, origCtr]
# Set innerCtr to be the index of the first m/z level to consider for
# adding to the new bin
innerCtr <- origCtr + 1L
# Initial number of m/z levels in the bin
nbin <- 1L

# Each iteration compares a m/z value to the current m/z level until the
# values being compared against have a m/z outside of the allowed
# range. If all of the criteria for the iteration are met, then the m/z
# level is combined with the current bin.
#
# Note that when comparisons are made that we divide by nbin (i.e. the
# number of compounds in the bin) since we are only summing observations
# from the original data when placing it in the bin (and taking the mean
# occurs after the (following inner) loop) ends.

while ( (innerCtr <= n_befbin) &&
(info_orig[rmtoz, innerCtr] - (info_bin[rmtoz, binCtr] / nbin) < mtoz_diff) ) {

# case: criteria met.  Add m/z level to current bin.
if ( (info_orig[rchg, innerCtr] == (info_bin[rchg, binCtr] / nbin))
&& ( abs(info_orig[rtime, innerCtr] - (info_bin[rtime, binCtr] / nbin))
< time_diff ) ) {

# add m/z level to current bin
info_bin[, binCtr] <- info_bin[, binCtr] + info_orig[, innerCtr]
ms_bin[, binCtr] <- ms_bin[, binCtr] + ms[, innerCtr]

# signal that m/z level is already part of a bin
info_orig[rchg, innerCtr] <- flagv
nbin <- nbin + 1L

}
# else: criteria not me; noop

# Update the counter indexing the next m/z level to consider
# for adding to the current bin
innerCtr <- innerCtr + 1L

} # end compare current bin to next m/z level loop

# Take the average of the binned m/z levels (note: don't need to do this
# for ms_bin as these values are strictly summed)
info_bin[, binCtr] <- info_bin[, binCtr] / nbin

# We've found every m/z level in the original data that belongs in the
# binned data.  Update the counter indexing the next m/z level from the
# original data, and update the counter indexing the next available bin
# to start placing consolidated data into.
origCtr <- origCtr + 1L
binCtr <- binCtr + 1L

} # end iterate over original m/z levels loop

# Construct msDat object from binned data
nbinned <- binCtr - 1L
if (!identical(nbinned, 0L)) {

# It is possible that the binned data can be out of order.  This happens
# when the minimum observation of one bin is smaller than that of
# another, but the mean is larger.  binCtr indexes the next available
# bin, so 1 less is the number of bins.
resortIdx <- order(info_bin[rmtoz, 1:nbinned],
info_bin[rtime, 1:nbinned],
info_bin[rchg, 1:nbinned])

# Construct msDat object
msDatObj <- msDat(t(ms_bin[, resortIdx,  drop=FALSE]),
info_bin[rmtoz, resortIdx],
info_bin[rchg, resortIdx])
}
else {
warning("No observations satisfied all of the inclusion criteria", call.=FALSE)
msDatObj <- NULL
}

# Info for use by summary.binMS to describe binning process
summ_info <- list(n_tot        = nrow(mass_spec),
n_time_pr    = sum(time_pr_bool),
n_mass       = sum(mass_bool),
n_charge     = sum(charge_bool),
n_tiMaCh     = length(keepIdx),
n_binned     = nbinned,
time_range   = time_range,
mass_range   = mass_range,
charge_range = charge_range,
mtoz_diff    = mtoz_diff,
time_diff    = time_diff)

# Construct binMS object
outObj <- list(msDatObj  = msDatObj,
summ_info = summ_info)

structure(outObj, class=c("binMS", "msDat"))
}

#' @export

print.binMS <- function(x, ...) {

if (is.null(x$msDatObj)) { cat("An object of class \"binMS\"; no observations ", "satisfied all of the inclusion criteria.\n", sep="") } else { cat("An object of class \"binMS\" with ", NROW(x$msDatObj$ms), " compounds and ", NCOL(x$msDatObj$ms), " fractions.\n", sep="") } cat("Use summary to see more details regarding the consolidation process.\n", "Use extractMS to extract the consolidated mass spectrometry data.\n\n", sep="") } #' Overview of the binning process #' #' Prints a text description of the binning process. Displays arguments passed #' to the \code{binMS} routine, how many m/z levels were chosen for each #' criterion, how many candidate compounds were chosen overall, and how many #' candidate compounds were obtained after consolidation. #' #' @param object An object of class \code{\link{binMS}} #' #' @param ... Arguments passed to dot-dot-dot are ignored #' #' @export summary.binMS <- function(object, ...) { cat(format(object), sep="") } # Return a vector of strings supplying the output for summary.binMS format.binMS <- function(x, ...) { # Add pointers to summ_info variables for convenience n_tot <- x$summ_info$n_tot n_time_pr <- x$summ_info$n_time_pr n_mass <- x$summ_info$n_mass n_charge <- x$summ_info$n_charge n_tiMaCh <- x$summ_info$n_tiMaCh n_binned <- x$summ_info$n_binned time_range <- x$summ_info$time_range mass_range <- x$summ_info$mass_range charge_range <- x$summ_info$charge_range mtoz_diff <- x$summ_info$mtoz_diff time_diff <- x$summ_info\$time_diff

# Add commas to filter sizes and make a uniform width
nlev <- format_int( c(n_time_pr, n_mass, n_charge, n_tiMaCh) )
names(nlev) <- c("time", "mass", "charge", "all")

# Create two columns of numbers with uniform width and that line up
incl_crit_lo <- format_float( c(time_range[1L], mass_range[1L], charge_range[1L]) )
incl_crit_hi <- format_float( c(time_range[2L], mass_range[2L], charge_range[2L]) )
names(incl_crit_lo) <- names(incl_crit_hi) <-  c("time", "mass", "charge")

# Begin string construction ----------------------------

# User-specified inclusion criteria values to binMS
inclusion_criteria <- sprintf(
paste0("The inclusion criteria was specified as follows:\n",
"------------------------------------------------\n",
"    time of peak retention:  between %s and %s\n",
"    mass:                    between %s and %s\n",
"    charge:                  between %s and %s\n",
"\n"),
incl_crit_lo["time"],
incl_crit_hi["time"],
incl_crit_lo["mass"],
incl_crit_hi["mass"],
incl_crit_lo["charge"],
incl_crit_hi["charge"])

# User-specified consolidation criteria
consolidation_criteria <- sprintf(
paste0("m/z levels were consolidated when each of the following criteria were met:\n",
"--------------------------------------------------------------------------\n",
"    m/z levels were no more than %s units apart\n",
"    the time peak retention occured no farther apart than %s units\n",
"    the charge states were the same\n",
"\n"),
format(mtoz_diff, big.mark=","),
format(time_diff, big.mark=","))

# Size of the data prior to binning
size_prior <- sprintf(
paste0("The mass spectrometry data prior to binning had:\n",
"------------------------------------------------\n",
"    %s m/z levels\n",
"\n"),
format(n_tot, big.mark=","))

# Number of remaining m/z levels after filtering by the criterions
nremain <-  sprintf(
paste0("The number of remaining m/z levels after filtering by the inclusion criteria was:\n",
"---------------------------------------------------------------------------------\n",
"    time of peak retention:  %s\n",
"    mass:                    %s\n",
"    charge:                  %s\n",
"    satisfied all:           %s\n",
"\n"),
nlev["time"],
nlev["mass"],
nlev["charge"],
nlev["all"])

# Total number of remaining levels
ntotal <- sprintf(
paste0("After consolidating the m/z levels, there were:\n",
"-----------------------------------------------\n",
"    %s levels\n",
"\n"),
format(n_binned, big.mark=","))

c(newl = "\n",
incl = inclusion_criteria,
cons = consolidation_criteria,
size = size_prior,
nrem = nremain,
ntot = ntotal)
}

dpritchLibre/Bioactivity documentation built on May 13, 2017, 5:37 a.m.