qsip: Quantitative stable isotope probing for microbial ecology

Documented in calc_excess

#' Calculation of excess atom fraction
#'
#' Calculates fractional isotope incorporation in excess of natural abundances
#'
#' @param data Data as a long-format data.table where each row represents a taxonomic feature within a single fraction.
#'  Typically, this is the output from the \code{calc_wad} function.
#' @param tax_id Column name specifying unique identifier for each taxonomic feature.
#' @param sample_id Column name specifying unique identifier for each replicate.
#' @param wads Column name specifying weighted average density values.
#'  It is possible to change this if preferred but not recommended.
#' @param iso_trt Column name specifying a two-level categorical column indicating whether a sample has been amended with a stable isotope (i.e., is "heavy") or if
#'  isotopic composition is at natural abundance (i.e., "light").
#'  Any terms may be applied but care should be taken for these values.
#'  If supplied as a factor, \code{calc_excess} will take the lowest level as the "light" treatment and the higher
#'  level as the "heavy" treatment.
#'  Alternatively, if supplied as a character, \code{calc_excess} will coerce the column to a factor and with the default behavior wherein the
#'  first value in alphabetical order will be assumed to be the lowest factor level (i.e. the "light" treatment).
#' @param isotope Column name specifying the isotope applied to each replicate. For "heavy" samples, these values should be one of "13C", "15N", or "18O".
#' @param bootstrap Whether to generate bootstrapped enrichment values for each taxonomic feature across groups of samples.
#'  If \code{TRUE}, replicates within specified treatment groupings will be randomly resampled with replacement and resulting
#'  WAD values used to generate a distribution of enrichment values for each taxon.
#' @param iters Integer specifying the number of bootstrap iterations to perform.
#' @param grouping_cols Column(s) used to indicate bootstrap resampling groups.
#'  Within each group, replicates will be resampled with replacement.
#'  Resampling will not occur across groups.
#' @param min_freq Minimum number of replicates a taxonomic feature must occur in to be kept.
#'  If treatment grouping columns are specified, frequencies will be assessed at this level.
#'  For unlabeled "light" samples, treatment groupings will be ignored when assessing adequate frequency of occurrence.
#' @param correction Whether to apply a correction to fractional enrichment values to ensure a certain proportion are positive.
#' @param rm_outlers Whether or not to remove low fractional enrichment values that are 1.5X lesser than the distance between the 25th quantile
#'  and interquartile range.
#'  If \code{bootstrap = TRUE}, outlier WAD values will be removed prior to resampling and enrichment calculation.
#' @param non_grower_prop Fractional value applied if \code{correction == TRUE} specifying the proportion of the community in each samples assumed to be
#'  non-growers and whose median enrichment values will be assumed to be zero. The adjustment necessary to place this median value at zero will be applied
#'  as a correction to all enrichment values in the sample.
#' @param total_enrich Argument of length one describing the total enrichment value of the target isotope(s) achieved. The final fractional isotope enrichment
#'  calculations will be divided by this amount. A numeric value should be a positive number less than one and will be applied to every sample. A character value
#'  indicates a column with one or more total enrichment values, so that sample- or group-specific total enrichment values may be applied.
#' @param wad_to_gc_slope Single numeric value indicating the slope of the relationship between WAD values and estimated genomic GC content.
#'  The default value represents that calculated by Hungate \emph{et al.} 2015.
#' @param wad_to_gc_intercept Single numeric value indicating the intercept of the relationship between WAD values and estimated genomic GC content.
#'  The default value represents that calculated by Hungate \emph{et al.} 2015.
#' @param nat_abund_13C Single numeric value indicating the default assumption for background 13C. See \code{details}.
#' @param nat_abund_15N Single numeric value indicating the default assumption for background 15N See \code{details}.
#' @param nat_abund_18O Single numeric value indicating the default assumption for background 18O. See \code{details}.
#'
#' @details \code{calc_excess} automatically averages the isotopically unamended WAD values for each taxonomic feature on the assumption that density values
#'   will be identical (or nearly identical) for those samples.
#'
#'   The equations for calculating the molecular weights of taxon \emph{i}, designated \eqn{M_{Lab,i}} for labeled and \eqn{M_{Light,i}} for
#'   unlabeled, are:
#'
#'   \deqn{M_{Light,i} = 0.496 \cdot G_{i} + 307.691}
#'   \deqn{M_{Lab,i} = \left( \frac{\Delta W}{W_{Light,i}} + 1 \right) \cdot M_{Light,i}}
#'
#'   Where
#'
#'   \deqn{G_{i} = \frac{1}{0.083506} \cdot (W_{Light,i} - 1.646057)}
#'   Which indicates the GC content of taxon \emph{i} based on the density of its DNA when unlabeled.
#'   Note that the slope and intercept in this equation may be changed using the \code{wad_to_gc_slope}, and \code{wad_to_gc_intercept} parameters.
#'
#'   The calculation for the fractional of enrichment of taxon \emph{i}, \eqn{A_{i}} is:
#'
#'   \deqn{A_{i} = \frac{M_{Lab,i} - M_{Light,i}}{M_{Heavymax,i} - M_{Light,i}} \cdot (1 - N_{x})}
#'
#'   Where
#'
#'   \eqn{N_{x}}: The natural abundance of heavy isotope. Default estimates are: \eqn{N_{18O} = 0.002000429}, \eqn{N_{13C} = 0.01111233},
#'   and \eqn{N_{15N} = 0.003663004}
#'
#'   \eqn{M_{Heavymax,i}}: The highest theoretical molecular weight of taxon \emph{i} assuming maximum labeling by the heavy isotope
#'
#'   \eqn{M_{O,Heavymax,i} = (12.07747 + M_{Light,i}) \cdot L}
#'
#'   \eqn{M_{C,Heavymax,i} = (-0.4987282 \cdot G_{i} + 9.974564 + M_{Light,i}) \cdot L}
#'
#'   \eqn{M_{N,Heavymax,i} =( 0.5024851 \cdot G_{i} + 3.517396 + M_{Light,i}) \cdot L}
#'
#'   \emph{L}: The maximum label possible based off the percent of heavy isotope making up the atoms of that element in the labeled treatment
#'
#' @return \code{calc_excess} returns a data.table where each row represents a taxonomic feature within a single replicate.
#'  The following columns are produced: excess atom fraction (\code{eaf}). \code{NA} values (usually when an organism is present in
#'  only the unlabeled or labeled samples) are removed.
#'
#'  Because fraction-level data are being condensed to replicate-level, a list of grouping columns to keep is not necessary
#'  \emph{unless} bootstrapping is specified, wherein those groupings will be used to organize the resampling effort.
#'
#'  Bootstrap functionality produces the columns: \code{2.5\%}, \code{50\%}, and \code{97.5\%}
#'  reflecting the median bootstrapped enrichment value and the 95\% confidence interval
#'  of enrichment values.
#'  In addition, probability that a taxon's enrichment was less than zero is represented
#'  in the \code{p_val} column which is based on the proportion sub-zero bootstrapped observations.
#'
#'  If \code{iso_trt} is not a factor, then it will be coerced to one and it will
#'  return a message telling the user which value was assigned to which isotope labeling level.
#'  Most users adopt a scheme of using the terms "light" and "label" to define their
#'  treatments but please note that the word "label" comes first in the alphabet and
#'  will be assigned by the function to be the unlabeled or light treatment.
#'  If you wish to avoid this (you almost certainly do), you must specify your factor levels beforehand.
#'
#'
#' @seealso \code{\link{calc_wad}, \link{wad_wide}}
#'
#' @examples
#'  # Load in example data
#'  data(example_qsip)
#'
#'  # relativize sequence abundances (should be done after taxonomic filtering)
#'  example_qsip[, rel_abund := seq_abund / sum(seq_abund), by = sampleID]
#'
#'  # ensure that the "light" treatment is the first factor level in the isotope treatment column
#'  levels(example_qsip$iso_trt)
#'
#'  # calculate weighted average densities
#'  wads <- calc_wad(example_qsip,
#'                   tax_id = 'asv_id', sample_id = 'sampleID', frac_id = 'fraction',
#'                   frac_dens = 'Density.g.ml', frac_abund = 'avg_16S_g_soil',
#'                   rel_abund = 'rel_abund',
#'                   grouping_cols = c('treatment', 'isotope', 'iso_trt', 'Phylum'))
#'
#'  # calculate fractional enrichment in excess of background
#'  eaf <- calc_excess(wads, tax_id = 'asv_id', sample_id = 'sampleID',
#'                     iso_trt = 'iso_trt', isotope = 'isotope')
#'
#' @references
#'  Hungate, Bruce, \emph{et al.} 2015. Quantitative microbial ecology through stable isotope probing.
#'  \emph{Applied and Environmental Microbiology} \strong{81}: 7570 - 7581.
#'
#'  Morrissey, Ember, \emph{et al.} 2018. Taxonomic patterns in nitrogen assimilation of soil prokaryotes.
#'  \emph{Environmental Microbiology} \strong{20}: 1112 - 1119.
#'
#' @export

calc_excess <- function(data, tax_id = c(), sample_id = c(), wads = 'wad',
                        iso_trt = c(), isotope = c(),
                        bootstrap = FALSE, iters = 999L, grouping_cols = c(), min_freq = 3,
                        correction = TRUE, rm_outliers = TRUE, non_grower_prop = 0.1,
                        total_enrich = 1,
                        wad_to_gc_slope = 0.083506, wad_to_gc_intercept = 1.646057,
                        nat_abund_13C = 0.01111233, nat_abund_15N = 0.003663004, nat_abund_18O = 0.002011429) {
  vars <- list(tax_id, sample_id, iso_trt, isotope)
  if(any(sapply(vars, is.null))) {
    null_vars <- which(sapply(vars, is.null))
    null_vars <- paste(c('taxon IDs', 'sample IDs', 'isotope addition (for each row "13C" or "15N" or "18O")',
                         'isotope treatment (amended or unamended)')[null_vars],
                       sep = ',')
    stop("Must supply the following columns: ", null_vars)
  }
  vars <- c(vars, wads)
  if(any(sapply(vars, function(x) !exists(x, data)))) {
    missing_vars <- which(sapply(vars, function(x) !exists(x, data)))
    missing_vars <- paste(vars[missing_vars], sep = ',')
    stop("Missing the following column(s) in supplied data: ", missing_vars)
  }
  bad_tot_enrich_msg <- 'Total_enrich must be either a single olumn name or a single numeric value between 0 and 1'
  if(length(total_enrich) > 1) {
    stop(bad_tot_enrich_msg)
  } else if(is.character(total_enrich)) {
    if(!exists(total_enrich, data)) stop(bad_tot_enrich_msg)
  } else if(is.numeric(total_enrich))
    if(total_enrich > 1 | total_enrich <= 0) stop(bad_tot_enrich_msg)
  #
  if(bootstrap == FALSE) {
    eafd <- wad_wide(data, tax_id = tax_id, sample_id = sample_id, wads = wads, iso_trt = iso_trt, isotope = isotope)
    setnames(eafd, old = isotope, new = 'iso')
    # calculate molecular weights
    eafd[, gc_prop := (1 / wad_to_gc_slope) * (light - wad_to_gc_intercept)
         ][, mw_light := (0.496 * gc_prop) + 307.691
           ][, mw_label := (((label - light) / light) + 1) * mw_light]
    # calculate enrichment
    eafd[iso == '18O', `:=` (mw_max = mw_light + 12.07747, nat_abund = nat_abund_18O)
         ][iso == '13C', `:=` (mw_max = mw_light + 9.974564 + (-0.4987282 * gc_prop), nat_abund = nat_abund_13C)
           ][iso == '15N', `:=` (mw_max = mw_light + 3.517396 + (0.5024851 * gc_prop), nat_abund = nat_abund_15N)
             ][, eaf := ((mw_label - mw_light) / (mw_max - mw_light)) * (1 - nat_abund)]
    # correct enrichment values
    if(rm_outliers == TRUE) {
      pos_out <- suppressWarnings(pos_outlier(eafd$eaf))
      neg_out <- suppressWarnings(neg_outlier(eafd$eaf))
    } else {
      neg_out <- -Inf
      pos_out <- Inf
    }
    eafd <- eafd[eaf > neg_out]
    if(correction) {
      shift <- eafd[!is.na(eaf)
                    ][order(eaf)
                      ][, .(shift = median(eaf[1:floor(non_grower_prop * .N)])),
                        by = sample_id]
      eafd <- merge(eafd, shift, by = sample_id, all.x = TRUE)
      eafd[, eaf := eaf - shift][, shift := NULL]
      if(rm_outliers) {
      } else {
        neg_out <- -Inf
      }
    }
    # adjust enrichment for total labeling environment - default value of 1 results in no change
    if(is.character(total_enrich)) {
      setnames(eafd, old = total_enrich, new = 'te')
      } else {
        eafd[, te := total_enrich]
      }
    eafd[, eaf := eaf / te]
    # clean final data output
    # remove NA EAF values - this will also remove all the unlabeled samples
    eafd <- eafd[!is.na(eaf), !c('label', 'light', 'gc_prop', 'te',
                                 'mw_light', 'mw_label', 'mw_max', 'nat_abund')]
    setnames(eafd, old = c('wvd_label', 'iso'), new = c('wvd', isotope), skip_absent = TRUE)
  ####################
  # BOOTSTRAPPING CODE
  ####################
  } else if(bootstrap == TRUE) {
    bd <- copy(data)
    setnames(bd, old = c(sample_id, iso_trt, wads, isotope), new = c('sid', 'iso_trt', 'wad', 'iso'))
    # account for total enrichment
    if(is.character(total_enrich)) setnames(bd, old = total_enrich, new = 'te') else bd[, te := total_enrich]
    # make sure iso_trt column is a factor
    if(is.factor(bd[[iso_trt]]) == FALSE || nlevels(bd[[iso_trt]]) > 2) {
      test_trt <- factor(bd[[iso_trt]])
      light_trt <- levels(test_trt)[1]
      message('Assigned user value "', light_trt, '" as the unamended or light treatment and user value "',
              levels(test_trt)[!levels(test_trt) %in% light_trt],
              '" as the heavy treatment(s).')
      bd[[iso_trt]] <- factor(bd[[iso_trt]])
    }
    # rename factor levels
    bd[, iso_trt := factor(iso_trt, levels = levels(iso_trt), labels = c('light', 'label'))]
    # assess minimum frequency
    bd <- bd[iso_trt == 'label', rep_freq := uniqueN(sid), by = c(tax_id, grouping_cols)
             ][iso_trt == 'light', rep_freq := uniqueN(sid), by = tax_id
               ][rep_freq >= min_freq
                 ][, rep_freq := NULL]
    # remove outlier WAD values
      if(rm_outliers) {
        pos_out <- suppressWarnings(pos_outlier(bd$wad))
        neg_out <- suppressWarnings(neg_outlier(bd$wad))
      } else {
        pos_out <- Inf
        neg_out <- -Inf
      }
    bd <- bd[wad > neg_out]
    # set keys and sort for faster merging in for-loop
    bd <- setkeyv(bd, c(tax_id, grouping_cols, 'iso'))
    # store permutation output in a data.table
    eaf_output <- subset(bd, iso_trt == 'label', select = c(tax_id, grouping_cols, 'iso'))
    eaf_output <- unique(eaf_output)
    # quickly generate a list of replicate resampling permutations within your grouping variables
    reps <- subset(bd, select = c('sid', 'iso_trt', grouping_cols, 'iso'))
    reps <- unique(reps)
    reps[, rep_count := uniqueN(sid), by = c('iso_trt', grouping_cols, 'iso')]
    resamps <- reps[, as.list(sample.int(rep_count, iters, replace = TRUE)), by = c('sid', grouping_cols, 'iso')]
    #----------------
    # BEGINNING OF FOR-LOOP
    for(i in 1:iters) {
      cat('Bootstrap iteration', i, 'of', iters, '\r')
      cols_for_subsample <- c('sid', grouping_cols, paste0('V', i))
      # subsample replicates
      dat_boot <- merge(bd, resamps[, ..cols_for_subsample],
                        by = c('sid', grouping_cols),
                        all.x = TRUE)
      setnames(dat_boot, paste0('V', i), 'resample_rep')
      dat_boot <- dat_boot[, `:=` (wad = wad[resample_rep],
                                   te = te[resample_rep]),
                           by = c(tax_id, grouping_cols, 'iso_trt')]
      # convert to wide format
      dat_boot <- wad_wide(dat_boot, tax_id = tax_id, sample_id = 'sid', wads = wads, iso_trt = iso_trt, isotope = 'iso')
      dat_boot <- dat_boot[iso %in% c('18O', '13C', '15N')]
      # calculate molecular weights
      dat_boot[, gc_prop := (1 / wad_to_gc_slope) * (light - wad_to_gc_intercept)
               ][, mw_light := (0.496 * gc_prop) + 307.691
                 ][, mw_label := (((label - light) / light) + 1) * mw_light]
      # calculate enrichment
      dat_boot[iso == '18O', `:=` (mw_max = mw_light + 12.07747, nat_abund = nat_abund_18O)
               ][iso == '13C', `:=` (mw_max = mw_light + 9.974564 + (-0.4987282 * gc_prop), nat_abund = nat_abund_13C)
                 ][iso == '15N', `:=` (mw_max = mw_light + 3.517396 + (0.5024851 * gc_prop), nat_abund = nat_abund_15N)
                   ][, eaf := ((mw_label - mw_light) / (mw_max - mw_light)) * (1 - nat_abund)]
      # adjust enrichment for total labeling environment - default value of 1 results in no change
      dat_boot[, eaf := eaf / te]
      # calculate mean enrichment for bootstrap iteration
      dat_boot <- dat_boot[, .(eaf = mean(eaf)), by = c(tax_id, grouping_cols, 'iso')]
      # add iteration count to EAF columns
      setnames(dat_boot, 'eaf', paste0('eaf_', i))
      # add EAF values to output data
      cols_to_add <- c(tax_id, grouping_cols, 'iso', paste0('eaf_', i))
      eaf_output <- merge(eaf_output,
                          dat_boot[, ..cols_to_add],
                          all.x = TRUE,
                          by = c(tax_id, grouping_cols, 'iso'),
                          sort = FALSE)
    }
    # END OF FOR-LOOP
    #----------------
    rm(dat_boot, i, cols_for_subsample, cols_to_add)
    # Calculate distribution of 18O EAF values
    eaf_output <- melt(eaf_output,
                       id.vars = c(tax_id, grouping_cols, 'iso'),
                       variable.name = 'iteration',
                       value.name = 'eaf',
                       na.rm = TRUE)
    # perform density correction for each tube based on distribution of EAF values
    if(correction) {
      shift <- eaf_output[!is.na(eaf)
                          ][order(eaf)
                            ][, .(shift = median(eaf[1:floor(non_grower_prop * .N)])),
                              by = c(grouping_cols, 'iso', 'iteration')]
      eaf_output <- merge(eaf_output, shift, by = c(grouping_cols, 'iso', 'iteration'), all.x = TRUE)
      eaf_output[, eaf := eaf - shift][, shift := NULL]
    }
    # note: your EAF confidence interval columns are called `2.5%`, `50%`, and `97.5%`
    eaf_med_ci <- eaf_output[, as.list(quantile(eaf, probs = c(0.025, .5, .975), na.rm = TRUE)),
                             by = c(tax_id, grouping_cols, 'iso')]
    # calculate p-value for EAF being greater than 0
    eaf_pval <- eaf_output[, .(p_val = 1 - (sum(eaf > 0) / .N)), by = c(tax_id, grouping_cols, 'iso')]
    # combine and remove NA values -- this will also remove all the unlabeled samples
    eafd <- merge(eaf_med_ci, eaf_pval)
    eafd <- eafd[!is.na(`50%`)]
    setnames(eafd, old = 'iso', new = isotope)
  }
  return(eafd)
}