NB: Some data names and structures have been altered since completing this vignette, and it might be a bit broken. Fixing it is not a high priority right now.

A suite of replicate M-TRFLP reactions was carried out containing various controls and using a PCR volume of either 25 or 50 µL. Data from these were then processing using T-REX and eight different analysis settings. This vignette explores the effects of these treatments. The goal is to maximise within-sample similarity (i.e. similarity between replicates) and maximise between-sample differences. Settings chosen to optimise this will be used in analysis of the full sample suite.

Part 1: T-REX settings

When analysing TRFLP data with the online platform T-REX, choices of settings for the noise filtering and T-RF alignment algorithms can produce substantially different OTU profiles. This vignette compares eight such combinations of T-REX settings (see ?TRF_replicates for details) to explore their their effects on the measured similarity of replicates. I want settings which maximise within-replicate similarity and minimise between-sample similarity.

The fixed T-REX analysis settings used were as follows[^1]:

In addition, some parameters were varied:

This created a panel of 8x treatment combinations A - H, as follows:

| treatment code | sd in noise filtering | bp in peak alignment | >1 peak per T-RF | |----------------|-----------------------|----------------------|------------------| | A | 1 | 0.5 | No | | B | 1 | 0.5 | Yes | | C | 1 | 1 | No | | D | 1 | 1 | Yes | | E | 1.5 | 0.5 | No | | F | 1.5 | 0.5 | Yes | | G | 1.5 | 1 | No | | H | 1.5 | 1 | Yes |

The analysis proceeds as follows:

  1. Generating Simpson diversity measures for each sample and quantifying their variability, and how this is affected by T-REX settings A - H
  2. Generating a matrix of pairwise sample similarity indices for each T-REX setting and determining which setting minimises within-sample dissimilarity and maximises between-sample similarity (probably via NMDS and ANOVA)

Effect on sample diversity

This is not the main point of interest but should be explored.

# Import replicates OTU matrices & data

## Sample diversity
# Create Simpson diversity indices
replicates_Simpson <-, vegan::diversity, index = "simpson"))
# Split into list of treatment groups rather than one big list
treatments_diversity <- as.list(LETTERS[1:8])
for(i in 1:8) {
treatments_diversity[[i]] <- replicates_Simpson[,
  grep(paste(LETTERS[i], ".", sep = ""),
       fixed = TRUE)

# Compare boxplots of the richness treatments for each taxon
par_default <- par(no.readonly = TRUE)
par(mfrow = c(2, 4), mar = c(2, 2, 2, 2))
for(i in 1:length(treatments_diversity)) {
  boxplot(treatments_diversity[[i]], ylim = c(0.5, 1))
# NOTE: outliers are not all included in the y-axis range

Results are just about what you would expect. Increasing the bin width and permitting more than one peak per T-RF reduces the diversity of Archaea & Fungi in particular (treatments D and H), while increasing the noise filtering has a particular impact on the diversity of Eukarya (settings E - H), though to a lesser extent the changes are noticeable across the board. Increasing the noise filtering doesn't appear to make the sample set any more robust to changes in peak alignment settings, which suggests the 1 sd setting may be preferable. It doesn't appear to increase the noise from other treatments, and it's least likely to exclude "true" peaks.

Effect on sample similarity

I want to select the T-REX settings which maximise within-sample similarity and minimise between-sample similarity. I will use Bray-Curtis (and eventually the Forbes index) as my similarity metric.


  1. Generate pairwise sample similarity matrix for each treatment
  2. Generate NMDS plots for each treatment
  3. Build ANOVA & assess multivariate dispersion

Step 1: generate similarity or distance matrices

## Generate pairwise dissimilarity or distance matrix
# Initialist empty list
replicates_dist <- as.list(names(TRF_replicates))
# Pairwise sample similarity matrix
for(i in 1:length(replicates_dist)) {
  replicates_dist[[i]] <- vegan::vegdist(TRF_replicates[[i]]) # default method = bray-curtis
# Replace names - whoops
names(replicates_dist) <- names(TRF_replicates)
# Let's see if that's worked properly, taking a single distance matrix as an example:
  sample(1:length(replicates_dist), 1)]]), # Choose a random element of the list
  digits = 2) # Remove some information to shrink the table a little
# Yup, that's fine. It's still pretty big.

Step 2: Effect on ordination of samples

NMDS will find a 2-D ordination which best reflects the multidimensional (m = 27) ordination of the samples, using the distance matrix generated in the previous chunk.

replicates_NMDS <- as.list(names(replicates_dist))
replicates_NMDS <- lapply(TRF_replicates, vegan::metaMDS)

Ah. It seems that I have too few observations for NMDS, according to this FAQ. I mean, I have 27 "points" (i.e. samples), which is greater than the recommended 4 x k + 1 = 9 (where k = number of dimensions in solution, in my case I am seeking a 2-dimensional solution). However, my "points" are highly degenerate: each is one of three replicates, so really I only have 9 unique samples, and even then, 8 of them are simply different treatments (PCR volume = 25 or 50 µL) of the same sample. So my unique number of samples is more like 5. According to the FAQ, metric methods may be more appropriate.

I'll try PCoA instead.

# Generate the PCoA using vegan's wcmdscale()
replicates_PCoA <- as.list(names(replicates_dist))
replicates_PCoA <- lapply(replicates_dist, vegan::wcmdscale, eig = TRUE, x.ret = TRUE, w = rep(1, 27))
# Subset the list by taxon so I can look at them in a more manageable way
taxa <- c("Archaea", "Bacteria", "Fungi", "Eukarya")
taxa_PCoA <- as.list(taxa)
for(i in seq_along(taxa)){
  taxa_PCoA[[i]] <- replicates_PCoA[grep(taxa[i], names(replicates_PCoA))]
names(taxa_PCoA) <- taxa
# Stressplots
par(mfrow = c(4, 2), mar = c(2, 2, 2, 2))
taxa_stressplot <- function(PCoA_list) {
  for(i in seq_along(PCoA_list)){ # assumes all elements have same length as first element. I should abstract this later probably.
        title(main = names(PCoA_list)[i],
              xlab = "Bray-Curtis distance",
              ylab = "PCoA euclidean distance",
              line = -1)

lapply(taxa_PCoA, taxa_stressplot)

Looks like the stress values are rather high? Distinct differences between taxa.

Archaea: Performance appears generally poor, showing many Bray-Curtis differences poorly reflected by the PCoA distances. Bacteria: Performance is patchy, but generally poor. Very few PCoA distances between 0.2 and 0.6, for some reason. Fungi: Slightly better. Eukarya: As for Archaea.

Let's examine the eigenvalues

par(mfrow = c(4, 2), mar = c(2, 2, 2, 2))
taxa_screeplot <- function(PCoA_list) {
  for(i in seq_along(PCoA_list)){ # assumes all elements have same length as first element. I should abstract this later probably.
    # Create broken-stick model
    eig <- PCoA_list[[i]]$eig
    n <- length(eig)
    bsm <- data.frame(j = seq(1:n), p = 0)
    bsm$p[1] <- 1 / n
    for(x in 2 : n) {
      bsm$p[x] <- bsm$p[x - 1] + (1 / (n + 1 - x))
    bsm$p <- 100 * bsm$p / n
    # Initialise plot
    barplot(t(cbind(100 * eig / sum(eig), bsm$p[n:1])),
            beside = TRUE, col = c("bisque", 2), las = 2)
    # Plot legend
    legend("topright", c("% eigenvalue", "Broken stick model"),
           pch = 15, col = c("bisque", 2), bty = "n", title = names(PCoA_list)[i])
lapply(taxa_PCoA, taxa_screeplot)

No obvious cutoff dimensions for most plots, though some settings obviously improve the proportion of variance on the first couple of axes.

OK now we can finally look at the plot themselves:

# Plot the results
par(mfrow = c(4, 2), mar = c(2, 2, 2, 2))
taxa_PCoA_plot <- function(PCoA_list) {
  for(i in seq_along(PCoA_list)){ # assumes all elements have same length as first element. I should abstract this later probably.
        plot <- vegan::ordiplot(PCoA_list[[i]], display = "sites",
                                type = "none")
        text(plot, what = "sites", labels = paste(
           rownames(replicates_data))], # This is confusing, but selects the correct labels.
    abline(h = 0, lty = 3)
    abline(v = 0, lty = 3)

lapply(taxa_PCoA, taxa_PCoA_plot)

General points:

Archaea: Archaeal communities are generally well separated. Samples 44 and H2O are well separated, but sample 25 overlaps with both. Small bin window has better discrimination. Bacteria: Samples 44 and 1 substantially overlap, as do samples 25 and H2O. Changing the parameters doesn't seem to change this overmuch. Fungi: Everything overlaps except sample SYN. Weirdly, they still form three distinct clusters - but only one of those clusters (SYN) has anything to do with which sample it contains. Again, analysis settings don't have a large effect on discrimination. Eukarya: As for Archaea, but with more spread (except for SYN)

Overall, varying the settings doesn't change discriminatory power overmuch, though settings A and B (which are almost identical) may be marginally better as they reduce dispersion of replicates.

Statistical comparison of similarities/treatments

I have two metrics of interest:

I could use several different approaches (betadisper(), adonis(), dbRDA), but they all take a single distance matrix as an argument, so I'd need to create a single aggregate object for all treatments, supply two groups of factors (sample identity vs. treatment) and look at the interaction effects between treatment and sample metrics.

The alternative is to construct all 8 models separately (treatments A - H) and compare their metrics. This is the simplest approach which I'll take to begin with.

In order to preclude the necessity of comparing 4x models for each sample (one for each taxon), I'll work with object TRF_replicates, which contains OTUs for all four taxa side-by-side in the same matrix.

Evaluating 8x vegan::betadisper() models

# Create aggregate distance object for each treatment
aggregate_dist <- as.list(LETTERS[1:8])
aggregate_dist <- lapply(TRF_replicates, vegan::vegdist)
# Generalised function for matching file names to sample codes (sample_ref)
labels <- replicates_data$sample[
        attr(aggregate_dist[[2]], "Labels")

# Create betadisper() models
treatments_betadisper <- lapply(aggregate_dist, vegan::betadisper,
                                group = labels, bias = TRUE)
# Plot the dispersions
par(mfrow = c(4, 2), mar = c(2, 2, 2, 2))
lapply(treatments_betadisper, boxplot)

Not much difference between treatments. Setting "D" seems to minimise within-sample dispersion.

Evaluating adonis() models

# Reorder replicates_data to match the order in the distance matrix
replicates_data_reorder <- replicates_data[match(attr(aggregate_dist[[2]], "Labels"),
                                                 , ]
# Create adonis() models
treatments <- LETTERS[1:8]  
treatments_adonis <- as.list(treatments)
for(i in seq_along(treatments)) {
  treatments_adonis[[i]] <- vegan::adonis(formula = aggregate_dist[[i]] ~ sample,
                                   data = replicates_data_reorder)
names(treatments_adonis) <- treatments
sample_R2 <- numeric(length = 8)
for(i in seq_along(treatments_adonis)) {
  sample_R2[i] <- treatments_adonis[[i]]$$R2[1]
treatments[sample_R2 == max(sample_R2)]

Looks like setting D (1 sd noise filtering, 1 bp bin width, more than 1 peak per T-RF) maximises the explanatory power of the "sample" factor - in other words, maximises discrimination between samples. I'll proceed with this for now.

TODO: Raw peaks and PCR volume

Quantifying within- and between-sample variability in raw peaks (i.e. sample richness), as reported by GeneMapper, is a measure of M-TRFLP reproducibility independent of T-REX data processing choices.

In addition, for each sample, 3 replicate PCRs were carried out at two different reaction volumes: 25 or 50 µL. I want to measure the effect of PCR volume on raw richness, plus diversity and similarity. It's a bit of a moot point since I already went ahead with the 50 µL reaction volume for the full sample set after a cursory analysis, but still.

## Sample richness
# Import the raw M-TRFLP peaks data
# Quantify richness & variability: per-sample, per-replicate, per-taxon etc.

# Effect of PCR volume

[^1]: For more info, see the documentation on the T-REX website.

mixtrak/wellington.aquifer.ecol documentation built on Nov. 30, 2017, 4:25 a.m.