In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

title: "Sensitivity Analysis of Quality Filtering Parameters" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Sensitivity Analysis for Quality Filtering Parameters} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(neonMicrobe)
setBaseDirectory(dirname(getwd()))
knitr::opts_knit$set(
  root.dir = NEONMICROBE_DIR_BASE()
)

The purpose of this vignette is to demonstrate how a researcher can test the effects of various DADA2 quality filtering parameter combinations on the outputs of the DADA2 pipeline, and ultimately, on the ecological inference. Specifically, this vignette asks the following from a subset of NEON 16S marker gene sequences:

What are the effects of quality filtering parameters on the number of remaining reads?
What are the effects of quality filtering parameters on alpha and beta diversity metrics?

The parameters that we vary to observe their downstream effects include the following:

truncLen.R: Reverse reads that do not meet or exceed truncLen.R in length will be discarded. Reverse reads that exceed truncLen.R will be truncated to truncLen.R.
maxEE.R: After truncation, reverse reads with greater than maxEE.R expected errors will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10)).

Because we are interested in variation in these benchmark metrics across different parameter sets, we refer to this analysis as a sensitivity analysis. This vignette is intended to provide a boilerplate for users of this R package to construct their own sensitivity analyses.

Load libraries

library(neonMicrobe)

library(dada2)
library(ShortRead)
library(Biostrings)
library(tibble)
library(dplyr)
library(vegan)
library(phyloseq)
library(ggplot2)
library(tidyr)

Set up base directory

Setting the base directory using setBaseDirectory() will create a point of reference for the neonMicrobe package, so that it knows where to look for the raw sequence files, and where to save the processed data. This should be the same base directory that you used in the "Download NEON Data" vignette.

print(getwd())

setBaseDirectory()

Constants used for this sensitivity analysis:

Quality filter parameters that we do not vary in this sensitivity analysis

MAX_EE_FWD <- 8
TRUNC_LEN_FWD <- 240
MIN_LEN <- 50

Parameters specific to this sensitivity analysis

N_SAMPLES <- 200
DIRNAME_TEST <- "qf_test_3-29-2021"
runIDs <-  c("B69PP", "B69RN", "B9994", "BDNB6", "BF462", "BFDG8", "BNMJ5", "BNMWB", "BRPH4", "C24VW", "C25T6", "C5B2R", "C7WK3", "C8VMV", "C977L", "C983L", "CBJYB", "CBTWG", "CDHG2", "CDJ5J")

Parameter value grid. The following allows testing of two quality filtering parameters at a time.

PARAM1 <- "maxEE.R"
PARAM2 <- "truncLen.R"
grid <- expand.grid(c(4, 8, 16), c(170, 220, 250))
params <- matrix(
  c(grid[,1],     # PARAM1
    grid[,2]),    # PARAM2
  byrow=FALSE, ncol=2,
  dimnames = list(NULL, c(PARAM1, PARAM2))
)
param_sets <- apply(params, 1, function(x) paste(c(rbind(c(PARAM1, PARAM2), x)), collapse="_"))

Assign filepath variables

Special directories were used for this sensitivity analysis.

PATH_16S <- file.path(NEONMICROBE_DIR_SEQUENCE(), "16S")
PATH_RAW <- PATH_16S
PATH_TRIMMED <- file.path(PATH_16S, "1_trimmed")
PATH_TEST <- file.path("outputs", DIRNAME_TEST)
dir.create(file.path(PATH_TEST, "results"), recursive=TRUE)

Get fastq files for this analysis

Retrieve files that match the sequence run IDs specified earlier.

rawFs <- sort(list.files(PATH_RAW, pattern = "_R1", full.names = TRUE, recursive = FALSE))
rawRs <- sort(list.files(PATH_RAW, pattern = "_R2", full.names = TRUE, recursive = FALSE))

Retrieve metadata. If you saved your metadata to file, you can load it here:

meta <- read.csv("data/sequence_metadata/qc_metadata/mmg_metadata_16SrRNA_QCd_20210329.csv")

If you did not save your metadata to file, you can re-download it using downloadSequenceMetadata(), and quality-control it using qcMetadata().

meta <- downloadSequenceMetadata(targetGene = "16S")
meta <- qcMetadata(meta)

Remove any samples that only have forward reads or only have reverse reads.

matched_fn <- getPairedFastqFiles(c(rawFs, rawRs), meta)
rawFs <- matched_fn[[1]]
rawRs <- matched_fn[[2]]

To cut down on computation time, select up to N_SAMPLES samples from the runs, up to N_SAMPLES/length(runIDs) from each run.

if(length(rawFs) > N_SAMPLES) {
  rawFs_subset <- c()
  rawRs_subset <- c()
  for(i in 1:length(runIDs)) {
    rawFs_runID <- rawFs[grep(runIDs[i], rawFs)]
    rawRs_runID <- rawRs[grep(runIDs[i], rawRs)]
    if(length(rawFs_runID) > N_SAMPLES/length(runIDs)) {
      set.seed(101010+i)
      subset <- sample(seq(1,length(rawFs_runID)), N_SAMPLES/length(runIDs), FALSE)
      rawFs_subset <- c(rawFs_subset, rawFs_runID[subset])
      rawRs_subset <- c(rawRs_subset, rawRs_runID[subset])
    } else {
      rawFs_subset <- c(rawFs_subset, rawFs_runID)
      rawRs_subset <- c(rawRs_subset, rawRs_runID)
    }
  }
  rawFs <- rawFs_subset
  rawRs <- rawRs_subset
}
write.csv(cbind(rawFs, rawRs), file.path(PATH_TEST, "input_files.csv"))

input_files <- read.csv(file.path(PATH_TEST, "input_files.csv"), row.names = 1, stringsAsFactors = FALSE)
rawFs <- input_files$rawFs
rawRs <- input_files$rawRs

Plot quality profiles

profiles_list <- list()
for(i in 1:length(runIDs)) {
  # Retrieve only those files associated with the appropriate runID
  profiles_list[[i]] <- gridExtra::grid.arrange(
    plotQualityProfile(grep(runIDs[i], rawFs, value=TRUE), aggregate=TRUE),
    plotQualityProfile(grep(runIDs[i], rawRs, value=TRUE), aggregate=TRUE),
    ncol=2,
    top=runIDs[i]
  )
}
for(i in 1:length(profiles_list)) {
  plot(profiles_list[[i]])
}

Split complete filenames into "basenames" and directory names

fn_base <- basename(c(rawFs, rawRs))
PATH_PARAMSETS <- file.path(PATH_TEST, param_sets)
for(p in PATH_PARAMSETS) dir.create(p, showWarnings=FALSE)

Trim primers from sequences

Trim reads based on the primer sequences supplied in params.R.

trim_trackReads <- trimPrimers16S(fn_base, in_subdir="raw", out_explicitdir=PATH_TRIMMED, meta=meta,
                                  primer_16S_fwd="CCTACGGGNBGCASCAG", primer_16S_rev="GACTACNVGGGTATCTAATCC")
# trim_trackReads <- trimPrimers16S(fn_base, PATH_RAW, PATH_TRIMMED, "CCTACGGGNBGCASCAG", "GACTACNVGGGTATCTAATCC")

Run quality filter on sequences

To store the results arising from each set of parameter choices, we construct a list object where each element corresponds to the output given a different parameter set. For this step, the list object is filter_trackReads.

filter_trackReads <- list()
for(i in 1:length(param_sets)) {
  filter_trackReads[[i]] <- qualityFilter16S(
    fn_base,
    in_explicitdir = PATH_TRIMMED,
    out_explicitdir = PATH_PARAMSETS[[i]],
    meta = meta,
    maxEE=c(MAX_EE_FWD, params[i,1]), # Vary maxEE.R
    truncLen=c(TRUNC_LEN_FWD, params[i,2]), # Vary truncLen.R
    minLen=MIN_LEN,
    multithread=TRUE) # set FALSE for Windows computers
  rownames(filter_trackReads[[i]]) <- paste0(param_sets[i], "|", rownames(filter_trackReads[[i]]))
}
filter_trackReads_mat <- do.call(rbind, filter_trackReads)

Optionally, save the read-tracking table, in case the job fails:

write.csv(filter_trackReads_mat, file.path(PATH_TEST, "results",  "sensitivity_trackReads_filter.csv"), row.names=TRUE)

Run the rest of the processing pipeline

Although the quality filtering step is the only part of the processing pipeline where we vary the parameters, we must follow through with the rest of the pipeline to observe the downstream effects on remaining reads, merging rate, taxonomic resolution, and alpha- and beta-diversity estimates.

This requires us to redefine the list structure so that it becomes nested: in the first level, each element corresponds to a parameter set; in the second level, each element corresponds to a sequencing run ID. This is necessary because the dada sequence inference algorithm is sensitive to error rate estimates, and error rates may differ considerably between sequencing runs. Here we initialize the nested list structure for two types of outputs simultaneously.

seqtabs <- dada_trackReads <- lapply(1:length(param_sets), function(x) lapply(1:length(runIDs), function(y) list()))

Now we continue with the processing pipeline to produce different versions of the resulting sequence tables.

meta_fn <- matchFastqToMetadata(fn_base, meta)

for(i in 1:length(param_sets)) {
  for(j in 1:length(runIDs)) {
    message("Sensitivity analysis: parameter set ", param_sets[i], ", sequencing run ",  runIDs[j])

    # Retrieve only those files associated with the appropriate parameter set and runID
    meta_thisrun <- meta_fn[which(meta_fn$sequencerRunID==runIDs[j]),]
    fn_base_thisrun <- meta_thisrun$file

    dada_out <- runDada16S(
      fn_base_thisrun, 
      in_explicitdir = PATH_PARAMSETS[[i]],
      out_seqtab = file.path(PATH_PARAMSETS[i], paste0("sensitivity_seqtab_", runIDs[j], ".Rds")),
      multithread = TRUE, # set FALSE for Windows computers
      verbose = FALSE, 
      seed = 11001100)

    seqtabs[[i]][[j]] <- dada_out$seqtab
    dada_trackReads[[i]][[j]] <- dada_out$track

    rownames(seqtabs[[i]][[j]]) <- paste0(param_sets[i], "|", rownames(seqtabs[[i]][[j]]))
    rownames(dada_trackReads[[i]][[j]]) <- paste0(param_sets[i], "|", rownames(dada_trackReads[[i]][[j]]))

    # Can save work as you go:
    saveRDS(seqtab.list$seqtab.nochim, 
            file.path(PATH_PARAMSETS[i], paste0("sensitivity_seqtab_", runIDs[j], ".Rds")))
    write.csv(seqtab.list$track, 
              file.path(PATH_PARAMSETS[i], paste0("sensitivity_trackReads_dada_", runIDs[j], ".csv")),
              row.names=TRUE)
  }
}

dada_trackReads_mat <- do.call(rbind, lapply(dada_trackReads, function(x) do.call(rbind, x)))

Combine all read-tracking tables:

trim_trackReads_mat <- do.call(rbind, lapply(1:length(param_sets), function(i) {
  rownames(trim_trackReads) <- paste0(param_sets[i], "|", rownames(trim_trackReads))
  trim_trackReads
}))

combineReadTrackingTables16S(trim_trackReads, filter_trackReads, dada_trackReads_mat,
                             out_file = file.path(PATH_TEST, "results", "sensitivity_trackReads.csv"))

Now that sequence inference is complete, we can join the sequencing runs back together in each parameter set:

seqtabs_joinrun <- lapply(1:length(param_sets), function(x) list())
for(i in 1:length(param_sets)) {
  seqtabs_joinrun[[i]] <- mergeSequenceTables(tables = seqtabs[[i]])
}

Save the data so far:

saveRDS(seqtabs_joinrun, file.path(PATH_TEST, "results", "sensitivity_seqtabs_joinrun_list.Rds"))

You can reload RDS objects into R to pick up where you left off:

seqtabs_joinrun <- readRDS(file.path(PATH_TEST, "results", "sensitivity_seqtabs_joinrun_list.Rds"))
track <- read.csv(file.path(PATH_TEST, "results", "sensitivity_trackReads.csv"), row.names=1)

Sensitivity of read counts

We can now plot the number of reads remaining after each step in the processing pipeline.

First, add parameter information to the read tracking table. This custom function simply parses the parameter sets from the rownames of the tracking table.

track <- parseParamsFromRownames(track, PARAM1, PARAM2)

Next, perform a few more operations to make the tracking table ready for plotting:

# Reshape read tracking table
track_long <- tidyr::gather(track, key = "step", value = "reads", input:nonchim)
# Exclude metrics associated only with forward reads
track_long <- track_long[!grepl("F$", track_long$step),]
# Aggregate read counts by run ID
track_long[["step"]] <- factor(track_long[["step"]], levels=colnames(track)[1:9])
track_aggRun <- group_by(track_long, maxEE.R, truncLen.R, runID, step) %>%
  dplyr::summarise(reads = sum(reads)) %>%
  mutate(runID_print = sub("run", "", runID))

Plot!

theme_set(theme_bw())
ggplot(track_aggRun, aes(x=step, y=reads, col=as.factor(truncLen.R))) +
  geom_line(aes(linetype=as.factor(maxEE.R), group=interaction(maxEE.R, truncLen.R)), alpha=0.7, size=0.5) +
  facet_wrap(~runID_print, ncol=4) +
  labs(linetype="maxEE.R", color="truncLen.R") +
  scale_linetype_manual(expression(maxEE[R]), values=c("dotted", "dashed", "solid")) +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) +
  scale_y_continuous(trans="log10") +
  scale_color_manual(expression(truncLen[R]), values = c("#E69F00", "#56B4E9", "#CC79A7")) +
  xlab("Step") + 
  ylab("Reads") +
  theme(legend.position = "top",
        legend.text = element_text(size = 9),
        legend.key.size = unit(0.5, "cm"),
        strip.background = element_blank())
ggsave(file.path(PATH_TEST, "results", "track_reads_plot.png"), width=5.5, height=7, units="in")

Sensitivity of alpha diversity

First, we convert the sequence tables into phyloseq objects.

physeqs <- list() # Create a list of physeq objects
for(i in 1:length(param_sets)) {
  # Sample data (parameters)
  sampledata <- parseParamsFromRownames(seqtabs_joinrun[[i]], PARAM1, PARAM2, keep_rownames=TRUE, keep_original_cols = FALSE)
  physeqs[[i]] <- phyloseq(otu_table(seqtabs_joinrun[[i]], taxa_are_rows=FALSE),
                           sample_data(sampledata))
}

We make use of phyloseq's estimate_richness() function.

obsrich_list <- shannon_list <- list()
for(i in 1:length(physeqs)) {
  div <- suppressWarnings(
    estimate_richness(physeqs[[i]], measures=c("Observed","Shannon"), split=TRUE)
  )
  obsrich_list[[i]] <- div[,1]
  shannon_list[[i]] <- div[,2]
}

diversity_list <- lapply(seq_along(physeqs), function(i) { 
  parseParamsFromRownames(
    cbind(
      sample_data(physeqs[[i]]), 
      obsrich = obsrich_list[[i]], 
      shannon = shannon_list[[i]]
    ),
    PARAM1, PARAM2
  )
})

diversity_df <- do.call(rbind, diversity_list)
diversity_df$runID_print <- sub("run", "", diversity_df$runID)

Plot!

theme_set(theme_bw())
ggplot(diversity_df, aes(y=obsrich, col=as.factor(truncLen.R), x=as.factor(maxEE.R))) +
  geom_boxplot(size = 0.3) +
  facet_wrap(~runID_print, ncol=4) +
  labs(col=expression(truncLen[R]), x=expression(maxEE[R])) +
  ylab("Observed richness") +
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#CC79A7")) +
  theme(legend.position = "top",
        legend.text = element_text(size = 9),
        legend.key.size = unit(0.5, "cm"),
        strip.background = element_blank())
ggsave(file.path(PATH_TEST, "results", "diversity_plot_obsrich.png"), width=5.5, height=7, units="in")

theme_set(theme_bw())
ggplot(diversity_df, aes(y=shannon, col=as.factor(truncLen.R), x=as.factor(maxEE.R))) +
  geom_boxplot(size = 0.3) +
  facet_wrap(~runID_print, ncol=4) +
  labs(col=expression(truncLen[R]), x=expression(maxEE[R])) +
  ylab("Shannon index") +
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#CC79A7")) +
  theme(legend.position = "top",
        legend.text = element_text(size = 9),
        legend.key.size = unit(0.5, "cm"),
        strip.background = element_blank())
ggsave(file.path(PATH_TEST, "results", "diversity_plot_shannon.png"), width=5.5, height=7, units="in")

Test statistical significance using ANOVA

obsrich_aov <- aov(log(obsrich) ~ truncLen.R + maxEE.R + runID + truncLen.R:runID + maxEE.R:runID, data=diversity_df)

png(file.path(PATH_TEST, "results", "aov_diagnostics_obsrich.png"))
par(mfrow=c(2,2))
plot(obsrich_aov)
dev.off()

options(knitr.kable.NA = '')
knitr::kable(summary(obsrich_aov)[[1]], digits=3)

shannon_aov <- aov(shannon ~ truncLen.R + maxEE.R + runID + truncLen.R:runID + maxEE.R:runID, data=diversity_df)

png(file.path(PATH_TEST, "results", "aov_diagnostics_shannon.png"))
par(mfrow=c(2,2))
plot(shannon_aov)
dev.off()

options(knitr.kable.NA = '')
knitr::kable(summary(shannon_aov)[[1]], digits=3)

In conclusion, truncLen.R affects alpha-diversity inference, whereas maxEE.R does not (within the range of maxEE.R values that we tested). truncLen.R affects alpha-diversity inference to a different degree depending on the sequence run.

Sensitivity of beta diversity

First, combine all sequence tables into ONE large sequence table. A sample data table will distinguish samples processed using different parameters.

seqtab_joined <- mergeSequenceTables(tables=seqtabs_joinrun)

Unite sequence-length variants:

t1 <- Sys.time()
seqtab_joined_collapsed <- collapseNoMismatch(seqtab_joined)
t2 <- Sys.time()
t2-t1

Or load the collapsed version of the joined sequence table (recommended!)

seqtab_joined <- readRDS("./data/seqtab_joined_runs_and_params_COLLAPSED.Rds")

# Sample data (parameters)
sampledata <- parseParamsFromRownames(seqtab_joined, PARAM1, PARAM2, keep_rownames=TRUE, keep_original_cols = FALSE)

Custom function to match rownames in this sequence table (which are based on the original fastq filenames) to the sequence metadata.

matchSeqtabToMetadata <- function(tab, meta, verbose=TRUE) {
  rownms <- rownames(tab)
  samplenms <- sub(".*\\|", "", rownms)

  # Remove runID if appended to beginning of filename
  key <- sub("^run[A-Za-z0-9]*_", "", samplenms)

  # Append "_R1" to end of samplenames
  key <- paste0(key, "_R1.fastq.gz")

  # # Append ".gz" to end of filename if missing
  # key[!grepl(".gz$", key)] <- paste0(key[!grepl(".gz$", key)], ".gz")

  key_match <- match(key, as.character(meta$rawDataFileName))
  if(any(is.na(key_match))) {
    if(verbose) {
      message("Matching file names to metadata: ", sum(is.na(key_match)), " files did not have matching records in the provided metadata. ",
              "Double-check to ensure that the provided metadata is of the appropriate scope.")
    }
  }
  return(cbind(rowname = rownms, meta[key_match,], stringsAsFactors=FALSE))
}

meta_seqtab <- matchSeqtabToMetadata(seqtab_joined, read.csv("./data/sequence_metadata/qc_metadata/mmg_metadata_16SrRNA_QCed_20210208.csv"))
sampledata_with_meta <- cbind(sampledata, meta_seqtab)
all(rownames(sampledata_with_meta) == meta_seqtab$rowname)

meta_seqtab %>%
  group_by(dnaSampleID, siteID, plotID, collectDate) %>%
  dplyr::summarise(n=n(), .groups="drop") %>%
  dplyr::select(-n) ->
  meta_seqtab_summary
meta_seqtab_summary

Combine sequence table and sample data table into a phyloseq object:

physeq_joined <- phyloseq(otu_table(seqtab_joined,taxa_are_rows=FALSE),
                          sample_data(sampledata_with_meta))

Remove samples with zero total counts

physeq_joined_nonzero <- prune_samples(sample_sums(physeq_joined) > 0, physeq_joined)

Optionally, discard taxa with 10 or fewer reads:

physeq_joined_nonzero <- prune_taxa(taxa_sums(physeq_joined_nonzero) > 10, physeq_joined_nonzero)

Ordinate, one sequencing run at a time. (Samples from different sequencing runs may be highly dissimilar, producing difficult-to-interpret ordination plots.)

ord_list <- ps_runID <- list()

for (i in 1:length(runIDs)) {
  ps_runID[[i]] <- subset_samples(physeq_joined_nonzero, sequencerRunID==runIDs[i])
}
for (i in 1:length(runIDs)) {
  set.seed(1010101)
  ord_list[[i]] <- ordinate(ps_runID[[i]], "NMDS", "bray", k=2)
  saveRDS(ord_list[[i]], file.path(PATH_TEST, "results", paste0("sensitivity_ordination_", runIDs[i], "_COLLAPSED.Rds")))
}

Load ordination data if previously run:

for(i in 1:length(runIDs)) {
  ord_list[[i]] <- readRDS(file.path(PATH_TEST, "results", paste0("sensitivity_ordination_", runIDs[i], "_COLLAPSED.Rds")))
}

Plot the ordinations!

Generalize to plot with one facet for each runID:

stress_annotations <- lapply(ord_list, function(o) {
  data.frame(xloc=Inf, yloc=-Inf, 
             # label=paste0("Stress: ", formatC(signif(o$stress,digits=3), digits=3,format="fg", flag="#")),
             label=paste0("Stress: ", round(o$stress, 3)),
             hjust=1.05, vjust=-0.15)
})

theme_set(theme_bw())
ordplot_list <- list()
for(i in 1:length(ord_list)) {
  sampledata <- as(sample_data(ps_runID[[i]]), "data.frame")
  sampledata$maxEE.R <- as.factor(sampledata$maxEE.R)
  sampledata$truncLen.R <- as.factor(sampledata$truncLen.R)
  sampledata <- cbind(sampledata, scores(ord_list[[i]])[match(rownames(sampledata), rownames(scores(ord_list[[i]]))),])
  ordplot_list[[i]] <- ggplot(sampledata, aes(x=NMDS1, y=NMDS2)) +
    geom_point(aes(col=truncLen.R, shape=maxEE.R)) + 
    stat_ellipse(aes(group=sampleID), lwd=0.1, col="grey20") +
    scale_shape_manual(values=c(21, 22, 24)) +
    scale_color_manual(values = c("#E69F00", "#56B4E9", "#CC79A7")) +
    guides(col="none", shape="none") +
    ggtitle(runIDs[i]) +
    geom_text(data=stress_annotations[[i]], aes(x=xloc, y=yloc, label=label, hjust=hjust, vjust=vjust), size=3) +
    theme(plot.title = element_text(hjust=0.5, size=12),
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          # axis.title = element_text(size=9),
          axis.title = element_blank())
}

# Sequencing run B9994 known to have convergence failure; optionally omit its plot
plot(ordplot_list[[3]])
ordplot_list[[3]] <- ordplot_list[[3]] + 
       geom_rect(aes(xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf),
            color="black", fill="white") +
       theme(axis.text = element_text(color="white"),
             axis.ticks = element_blank()) +
       annotate(geom = 'text', x = 4750, y = 0.5, label = "Plot omitted due to\nconvergence failure", size=3)


g <- gridExtra::arrangeGrob(grobs=ordplot_list, ncol=4)
plot(g)
ggsave(file.path(PATH_TEST, "results", "ordination_plot_COLLAPSED.png"), plot=g, width=6.6, height=7.7, units="in")

Get just the legend

library(cowplot)
library(grid)

legend <- cowplot::get_legend(ggplot(sampledata, aes(x=NMDS1, y=NMDS2)) +
      geom_point(aes(col=truncLen.R, shape=maxEE.R)) + 
      scale_shape_manual(expression(maxEE[R]), values=c(21, 22, 24)) +
      scale_color_manual(expression(truncLen[R]), values = c("#E69F00", "#56B4E9", "#CC79A7")) +
      theme(legend.position = "top"))

grid.newpage()
grid.draw(legend)
ggsave(file.path(PATH_TEST, "results", "ordination_plot_COLLAPSED_legend.png"), plot=legend, width=6.6, height=0.5, units="in")

Test statistical significance using permANOVA

Use statistical tests to confirm the effects of the quality filtering parameters. First we normalize the combined sequence table by proportionalizing the counts:

# Start by pruning low-sequencing depth samples and removing uncommon taxa
ps_nonzero_mindepth <- prune_samples(sample_sums(physeq_joined_nonzero) > 2500, physeq_joined_nonzero)
ps_nonzero_mindepth <- prune_taxa(taxa_sums(ps_nonzero_mindepth) > 15, ps_nonzero_mindepth)
ps_nonzero_mindepth_prop <- transform_sample_counts(ps_nonzero_mindepth, function(x) x/sum(x))
saveRDS(ps_nonzero_mindepth_prop, file.path(PATH_TEST, "results", paste0("sensitivity_ps_nonzero_mindepth_prop_COLLAPSED.Rds")))

Add more variables into the sampledata for ease of use.

sampledata <- as(sample_data(ps_nonzero_mindepth_prop), "data.frame")
sampledata$maxEE.R <- as.factor(sampledata$maxEE.R)
sampledata$truncLen.R <- as.factor(sampledata$truncLen.R)
sampledata <- cbind(sampledata, scores(ord_all)[match(rownames(sampledata), rownames(scores(ord_all))),])
sampledata$collectYear <- as.integer(format(as.Date(sampledata$collectDate), "%Y"))

First run permANOVA using all three levels of truncLen.R.

ps_joined_dist <- vegdist(otu_table(ps_nonzero_mindepth_prop)) # Takes about 2.5 hours
saveRDS(ps_joined_dist, file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_COLLAPSED.Rds"))

ps_joined_dist <- readRDS(file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_COLLAPSED.Rds"))

perm <- how(nperm=999)
setBlocks(perm) <- get_variable(ps_nonzero_mindepth_prop, "sampleID")
adonis2(ps_joined_dist ~ maxEE.R + truncLen.R,
        data = as(sample_data(ps_nonzero_mindepth_prop), "data.frame"),
        permutations=perm)

Distance between groups may be confounded with heterogeneity of within-group variances (dispersion) in the permANOVA test, so we use betadisper to test separately for heterogeneity of within-group variances.

beta_disper <- vegan::betadisper(ps_joined_dist, get_variable(ps_nonzero_mindepth_prop, "truncLen.R"))
anova(beta_disper)
mod.HSD <- TukeyHSD(beta_disper)
plot(mod.HSD)

There is less dispersion among communities that were processed with truncLen.R = 170. In order to confirm whether the significant effect of truncLen.R on mean community composition is robust to heterogeneity of dispersion, then, we re-run permANOVA using only top two levels of truncLen.R.

ps_nonzero_mindepth_prop_hitrunc <- subset_samples(ps_nonzero_mindepth_prop, truncLen.R != 170)
ps_joined_dist_hitrunc <- vegdist(otu_table(ps_nonzero_mindepth_prop_hitrunc))
saveRDS(ps_joined_dist_hitrunc, file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_hitrunc_COLLAPSED.Rds"))

readRDS <- file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_hitrunc_COLLAPSED.Rds")
perm <- how(nperm=999)
setBlocks(perm) <- get_variable(ps_nonzero_mindepth_prop_hitrunc, "sampleID")
adonis2(ps_joined_dist_hitrunc ~ maxEE.R + truncLen.R,
        data = as(sample_data(ps_nonzero_mindepth_prop_hitrunc), "data.frame"),
        permutations=perm)
beta_disper_hitrunc <- vegan::betadisper(ps_joined_dist_hitrunc, get_variable(ps_nonzero_mindepth_prop_hitrunc, "truncLen.R"))

Now use betadisper to test homogeneity of within-group variances (dispersion).

beta_disper_hitrunc <- vegan::betadisper(ps_joined_dist_hitrunc, get_variable(physeq_joined_nonzero_runC5B2R_hitrunc, "truncLen.R"))
anova(beta_disper_hitrunc)
mod.HSD_hitrunc <- TukeyHSD(beta_disper_hitrunc)
plot(mod.HSD_hitrunc)

In conclusion, both truncLen.R and maxEE.R affect beta-diversity inference, though to different extents. truncLen.R affects mean community composition and within-group variance (dispersion) in a non-linear fashion, with drastically different results for the lowest level of truncLen.R tested (truncLen.R == 160); This is probably due to the large drop-off in read volume at the merging step due to insufficient read overlap. When ANOVA is constrained to only the higher two levels of truncLen.R, both truncLen.R and maxEE.R are found to be significant in determining mean community composition, but with very small effect sizes.

claraqin/neonMicrobe documentation built on April 11, 2024, 11:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

claraqin/neonMicrobe
Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

Load libraries

Set up base directory

Constants used for this sensitivity analysis:

Assign filepath variables

Get fastq files for this analysis

Trim primers from sequences

Run quality filter on sequences

Run the rest of the processing pipeline

Sensitivity of read counts

Sensitivity of alpha diversity

Sensitivity of beta diversity

Test statistical significance using permANOVA

R Package Documentation

Browse R Packages

We want your feedback!

claraqin/neonMicrobe Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

In claraqin/neonMicrobe: Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data

Load libraries

Set up base directory

Constants used for this sensitivity analysis:

Assign filepath variables

Get fastq files for this analysis

Trim primers from sequences

Run quality filter on sequences

Run the rest of the processing pipeline

Sensitivity of read counts

Sensitivity of alpha diversity

Sensitivity of beta diversity

Test statistical significance using permANOVA

R Package Documentation

Browse R Packages

We want your feedback!

claraqin/neonMicrobe
Downloading, Pre-processing, and Assembling NEON Soil Microbe Marker Gene Sequence Data