README.md

metcourse

metcourse provides a simple algorithm for correcting systematic deviations in metabolomic time-courses. Specifically, it deals with the case where all metabolite concentrations are overestimated or underestimated by a constant percentage within a single sample. This can occur from inaccurate addition or processing of an internal standard as well as a result of variable solvent levels when dealing with intracellular metabolite extraction or biofluids. The correction method was validated by simulating realistic metabolic time-courses, with the simulation framework also provided in this package.

More information is available in the manuscript "A correction method for systematic error in 1H-NMR time-course data validated through stochastic cell culture simulation" (open access link - doi:10.1186/s12918-015-0197-4).

Original code

Following submission of the manuscript, the original scripts were refactored into this package with some changes to enhance usability. The original code is still available under "R/original" in the form of "Additional files" mentioned throughout the manuscript.

Installation

Installation can be performed directly from github using the devtools package:

library(devtools)
install_github('ssokolen/metcourse')

Table of contents

  1. Correction
  2. Simulation
  3. Trend shape
  4. Maximum concentrations
  5. Minimum concentrations
  6. Measurement error
  7. Putting it all together

Correction

The correction of systematic bias is performed by correct_rel_bias, which requires only three arguments -- time or sample number, metabolite concentration, and a column of metabolites corresponding to each concentration. We have found that the cubic regression spline smooth provided by the mgcv package yielded the best results (with a basis dimension, k, of 5) and this smooth is used by default. Custom smoothing functions can also be defined (see met_smooth_gam and met_smooth_loess).

  ## Generating a set of metabolic trends to perform correction

  # Using previously simulated data 40 metabolic trends with 10 time points
  # (see Simulation section below and the `simulate_timecourse` example)
  data(timecourse)

  # Artificially adding an error of 5% at sample 4
  logic <- timecourse$sample == 4
  timecourse$concentration[logic] <- timecourse$concentration[logic] * 1.05

  ## The correction itself

  timecourse$corrected <- correct_rel_bias(timecourse$time,
                                           timecourse$concentration,
                                           timecourse$metabolite)

  # f_smooth argument can be used to set any smoothing function that takes
  # an input of concentrations and returns corrected values.
  # met_smooth_loess and met_smooth_gam have been provided as simple wrappers
  # to gam and loess.

  # span is passed on to loess
  timecourse$corrected_loess <- correct_rel_bias(timecourse$time,
                                                 timecourse$concentration,
                                                 timecourse$metabolite,
                                                 f_smooth = met_smooth_loess,
                                                 span = 0.75)

  # k is passed on to gam
  timecourse$corrected_loess <- correct_rel_bias(timecourse$time,
                                                 timecourse$concentration,
                                                 timecourse$metabolite,
                                                 f_smooth = met_smooth_gam,
                                                 k = 5)

  # Plotting -- the original value of the corrected point is marked in red
  par(mfrow = c(8, 5), oma = c(5, 4, 1, 1) + 0.1, mar = c(1, 1, 1, 1) + 0.1)
  new.time <- seq(min(timecourse$time), max(timecourse$time), length.out=100)

  for (metabolite in unique(timecourse$metabolite)) {

    logic <- timecourse$metabolite == metabolite
    d <- timecourse[logic, ]

    logic2 <- d$concentration != d$corrected

    plot(d$time, d$corrected, pch = 16, xlab = '', ylab = '',
    ylim = c(min(d$concentration), max(d$concentration)))

    smoothed <- met_smooth_gam(d$time, d$corrected,
    new.time = new.time, k = 4)
    lines(new.time, smoothed)

    points(d$time[logic2], d$concentration[logic2], pch = 16, col = 'red')
  }

  title(xlab = 'Time post inoculation (hours)',
        ylab = 'Concentration (mM)', outer = TRUE, line = 3)

Simulation

The simulation of metabolite concentration time-courses is divided into 4 parameters -- the overall shape of the trend, maximum concentration, minimum concentration, and measurement variability (noise). Each step can be run separately or combined into a single simulation.

Trend shape

Metabolic time-course shapes are divided into decreasing, increasing, and concave. Decreasing and increasing trends are modelled by sigmoidal equations, y = (1 + exp((x-a)/b))^(-1) and y = 1 - (1 + exp((x-a)/b))^(-1), respectively. Concave trends are modelled by a beta distribution y = x^(a-1) * (1-x)^(b-1), truncated to allow different starting and final concentrations.

  # Parameters take the form of lower and upper bounds 
  # on the a and b coefficients in (1 + exp((x-a)/b))^(-1)
  par <- c(0.2, 0.6, 0.10, 0.18)

  # Simulating 1000 trends with 100 timepoints per trend
  trends <- simulate_decreasing(1000, 100, par)

  # Plotting all trends at once
  y_mat <- t(as.matrix(trends))
  x <- seq(0, 1, length.out=100)
  matplot(x, y_mat, type = 'l', lty = 1, lwd = 4, col = grey(0, 0.05),
          xlab = 'Relative culturing time', ylab = 'Relative concentration')

  # Trends can also be constructed from a mixture of two sets of parameters
  par1 <- c(0.2, 0.6, 0.10, 0.18)

  # Modifying the bounds on coefficient a
  par2 <- c(0.6, 0.9, 0.10, 0.18)

  # Only 0.05 of the trends will be generated using par1
  trends <- simulate_decreasing(1000, 100, par1, par2, 0.05)

  # Plotting
  y_mat <- t(as.matrix(trends))
  x <- seq(0, 1, length.out=100)
  matplot(x, y_mat, type = 'l', lty = 1, lwd = 4, col = grey(0, 0.05),
          xlab = 'Relative culturing time', ylab = 'Relative concentration')

See examples for simulate_increasing and simulate_concave i.e. example(simulate_increasing) and example(simulate_concave).

Maximum concentrations

Maximum metabolite concentrations are generated from a normal distribution of concentration logarithms.

  # Parameters take the form of mean and standard deviation values
  # in units of concentration (base-10 logarithms are taken internally)
  par <- c(7, 2)
  out <- simulate_max(10000, par)

  # Formatting and plotting output on logarithmic scale
  out <- log10(out)
  labels <- c(0.1, 0.25, 0.5, 1, 2.5, 5, 10, 25, 50)
  hist(out, 20, axes = FALSE, probability = TRUE, 
       main = '', xlab = 'Maximum metabolite concentration')
  axis(side = 2)
  axis(at = log10(labels), labels = labels, side = 1)

  # A mixture of two distributions can be used to account for a large
  # number of low concentration metabolites. 
  par1 <- c(7, 2)
  par2 <- c(0.5, 2)

  # Only 0.3 of the maximum concentration are sampled from the
  # higher concentration distribution.
  out <- simulate_max(10000, par1, par2, 0.3)

  # Formatting and plotting output on logarithmic scale
  out <- log10(out)
  labels <- c(0.1, 0.25, 0.5, 1, 2.5, 5, 10, 25, 50)
  hist(out, 20, axes = FALSE, probability = TRUE, 
       main = '', xlab = 'Maximum metabolite concentration')
  axis(side = 2)
  axis(at = log10(labels), labels = labels, side = 1)

  # Constraints can be set on upper and lower values
  # (while maintaining the number of maximum concentrations generated)
  out <- simulate_max(10000, par1, par2, 0.3, con=c(0.1, 20))

  # Formatting and plotting output on logarithmic scale
  out <- log10(out)
  labels <- c(0.1, 0.25, 0.5, 1, 2.5, 5, 10, 25, 50)
  hist(out, 20, axes = FALSE, probability = TRUE, 
       main = '', xlab = 'Maximum metabolite concentration')
  axis(side = 2)
  axis(at = log10(labels), labels = labels, side = 1)

Minimum concentrations

To avoid correlations between minimum and maximum concentrations, minimum concentration are calculated by generating a relative (fractional) change from maximum concentration. A mixture of two beta distributions gives a fine degree of control over the distribution.

  # The syntax for simulate_rel_change is very similar to simulate_max

  # First set of parameters to favour changes in the 25% range
  par1 <- c(2, 5)

  # Second set of parameters to add high probability of either no
  # change, or very large change (optional)
  par2 <- c(0.5, 0.5)

  # (Constraints can also be used to set minimum/maximum changes)

  # Simulating
  out <- simulate_rel_change(10000, par1, par2, 0.7)

  # Plotting
  hist(out, 20, probability = TRUE,
       main = 'Relative change', xlab = 'Fractional change in concentration')

  # Repeating with only one set of parameters and with constraints
  out <- simulate_rel_change(10000, par1, con = c(0.1, 0.5))

  # Plotting
  hist(out, 20, probability = TRUE,
       main = 'Relative change', xlab = 'Fractional change in concentration')

Measurement error

The simulation of measurement error follows the same pattern as above, with the distribution of relative standard deviations modelled as a mixture of normal distributions.

  # First set of parameters for high precision measurements (sd ~ 5%)
  par1 <- c(0.05, 0.02)

  # Second set of parameters for lower precision measurements (sd ~ 10%)
  par2 <- c(0.10, 0.04)

  # Constraints are important to prevent negative or impossibly low variability
  con <- c(0.01, 1)

  # Simulating
  out <- simulate_rel_sd(10000, par1, par2, 0.7, con)

  # Plotting
  hist(out, 20, probability = TRUE,
       main = 'Measurement variability', xlab = 'Fractional standard deviation')

Putting it all together

The above functions can be combined manually to generate a set of trends. Alternatively, simulate_timecourse simplifies the process, but a large number of parameters are still required.

  # This example makes use of tidyr and dplyr packages
  library(dplyr)
  library(tidyr)

  # Making up metabolite names
  metabolites <- paste('met', 1:6, sep = '')

  # Generating a set of 6 decreasing trends with 20 timepoints
  trends <- simulate_decreasing(n = 6, n.samples = 20, 
                                par1 = c(0.2, 0.6, 0.10, 0.18))

  # Changing column names of samples
  colnames(trends) <- paste('sample', 1:20, sep = '')

  # Adding metabolite column
  trends$metabolite <- metabolites

  # Initializing data frame of parameters
  parameters <- data.frame(metabolite = metabolites, stringsAsFactors = FALSE)

  # Adding maximums
  parameters$maximum <- simulate_max(n = 6, par1 = c(7, 2), con = c(0, 50))

  # Adding minimums 
  parameters$change <- simulate_rel_change(n = 6, par1 = c(5, 2))
  parameters$minimum <- (1 - parameters$change) * parameters$maximum

  # Adding measurement error
  parameters$sdv <- simulate_rel_sd(n = 6, par1 = c(0.02, 0.02), 
                                    con = c(0.01, 1))

  # Combining
  combined <- left_join(trends, parameters, by = 'metabolite')  

  # Converting into "long" format
  combined <- gather(combined, sample, conc, sample1:sample20)

  # Scaling trends
  combined <- combined %>%
                group_by(metabolite) %>%
                mutate(conc = conc * (maximum - minimum) + minimum,
                       error = rnorm(20, 0, mean(conc) * sdv[1]),
                       conc = conc + error,
                       sample = as.numeric(gsub('sample', '', sample)))

  # Plotting
  par(mfrow = c(3, 2), oma = c(5, 4, 1, 1) + 0.1, mar = c(1, 1, 1, 1) + 0.1)

  for (metabolite in unique(combined$metabolite)) {
    logic <- combined$metabolite == metabolite
    plot(combined$sample[logic], combined$conc[logic],
    xlab='', ylab='')
  }

  title(xlab = 'Sample', ylab = 'Concentration', outer = TRUE, line = 3)

The following example simulates a full metabolic profile from a suspension cell culture (modelled on the growth of insect cells). The number of metabolites is limited to 10 for easier plotting.

  # All the parameters are defined at once in a single list. Global parameters
  # in the form of `par1.max` are used in the generation of maximum values for
  # all trend types. These can be overriden for specific types of trends by
  # appending the trend type to the parameter e.g. `par1.max.increasing`.

  param <- list(
    # Maximum concentrations are the same for every trend type
    p.max = 0.3,
    par1.max = c(7, 2),
    par2.max = c(0.5, 2),
    con.max = c(0, 50),

    # Global change parameters are near 100% for increasing/concave trends
    par1.change = c(5, 0.1),
    con.change = c(0.5, 1),

    # Decreasing trends can have a wide variety of changes
    p.change.decreasing = 0.7,
    par1.change.decreasing = c(2, 5),
    par2.change.decreasing = c(0.5, 0.5),
    con.change.decreasing = c(0.1, 1),

    # Linear trends are characterized by relatively small changes
    par1.change.linear = c(1, 5),
    con.change.linear = c(0, 0.1),

    # Measurement error is the same for every trend type (but no more than 20%)
    p.sd = 0.7,
    par1.sd = c(0.04, 0.02),
    par2.sd = c(0.11, 0.02),
    con.sd = c(0, 0.20),

    # Decreasing trend specification
    p.trend.decreasing = 0.05,
    par1.trend.decreasing = c(0.2, 0.6, 0.10, 0.18),
    par2.trend.decreasing = c(0.6, 0.9, 0.10, 0.18),

    # Increasing trend specification
    p.trend.increasing = 0.15,
    par1.trend.increasing = c(0.045, 0.055, 0.2, 0.4),
    par2.trend.increasing = c(0.945, 0.955, 0.1, 0.3),

    # Concave trend specification
    par1.trend.concave = c(3.5, 4.5, 2.5, 3.5, 0.0, 0.2, 0.8, 0.9),

    # Linear trends are equaly split between increasing and decreasing
    p.trend.linear = 0.5
  )

  # Generating trends
  timecourse <- simulate_timecourse(10, c(0.3, 0.3, 0.3), param)

  # Plotting
  par(mfrow = c(5, 2), oma = c(5, 4, 1, 1) + 0.1, mar = c(1, 1, 1, 1) + 0.1)

  for (metabolite in unique(timecourse$metabolite)) {
    logic <- timecourse$metabolite == metabolite
    plot(timecourse$sample[logic], timecourse$concentration[logic],
    xlab='', ylab='')
  }

  title(xlab = 'Sample', ylab = 'Concentration', outer = TRUE, line = 3)


ssokolen/metcourse documentation built on May 30, 2019, 8:43 a.m.