title: "Sensitivity Analysis of Quality Filtering Parameters" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Sensitivity Analysis for Quality Filtering Parameters} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(neonMicrobe) setBaseDirectory(dirname(getwd())) knitr::opts_knit$set( root.dir = NEONMICROBE_DIR_BASE() )
The purpose of this vignette is to demonstrate how a researcher can test the effects of various DADA2 quality filtering parameter combinations on the outputs of the DADA2 pipeline, and ultimately, on the ecological inference. Specifically, this vignette asks the following from a subset of NEON 16S marker gene sequences:
The parameters that we vary to observe their downstream effects include the following:
truncLen.R
: Reverse reads that do not meet or exceed truncLen.R
in length will be discarded. Reverse reads that exceed truncLen.R
will be truncated to truncLen.R
.maxEE.R
: After truncation, reverse reads with greater than maxEE.R
expected errors will be discarded. Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10)).Because we are interested in variation in these benchmark metrics across different parameter sets, we refer to this analysis as a sensitivity analysis. This vignette is intended to provide a boilerplate for users of this R package to construct their own sensitivity analyses.
library(neonMicrobe)
library(dada2) library(ShortRead) library(Biostrings) library(tibble) library(dplyr) library(vegan) library(phyloseq) library(ggplot2) library(tidyr)
Setting the base directory using setBaseDirectory()
will create a point of reference for the neonMicrobe
package, so that it knows where to look for the raw sequence files, and where to save the processed data. This should be the same base directory that you used in the "Download NEON Data" vignette.
print(getwd())
setBaseDirectory()
Quality filter parameters that we do not vary in this sensitivity analysis
MAX_EE_FWD <- 8 TRUNC_LEN_FWD <- 240 MIN_LEN <- 50
Parameters specific to this sensitivity analysis
N_SAMPLES <- 200 DIRNAME_TEST <- "qf_test_3-29-2021" runIDs <- c("B69PP", "B69RN", "B9994", "BDNB6", "BF462", "BFDG8", "BNMJ5", "BNMWB", "BRPH4", "C24VW", "C25T6", "C5B2R", "C7WK3", "C8VMV", "C977L", "C983L", "CBJYB", "CBTWG", "CDHG2", "CDJ5J")
Parameter value grid. The following allows testing of two quality filtering parameters at a time.
PARAM1 <- "maxEE.R" PARAM2 <- "truncLen.R" grid <- expand.grid(c(4, 8, 16), c(170, 220, 250)) params <- matrix( c(grid[,1], # PARAM1 grid[,2]), # PARAM2 byrow=FALSE, ncol=2, dimnames = list(NULL, c(PARAM1, PARAM2)) ) param_sets <- apply(params, 1, function(x) paste(c(rbind(c(PARAM1, PARAM2), x)), collapse="_"))
Special directories were used for this sensitivity analysis.
PATH_16S <- file.path(NEONMICROBE_DIR_SEQUENCE(), "16S") PATH_RAW <- PATH_16S PATH_TRIMMED <- file.path(PATH_16S, "1_trimmed") PATH_TEST <- file.path("outputs", DIRNAME_TEST) dir.create(file.path(PATH_TEST, "results"), recursive=TRUE)
Retrieve files that match the sequence run IDs specified earlier.
rawFs <- sort(list.files(PATH_RAW, pattern = "_R1", full.names = TRUE, recursive = FALSE)) rawRs <- sort(list.files(PATH_RAW, pattern = "_R2", full.names = TRUE, recursive = FALSE))
Retrieve metadata. If you saved your metadata to file, you can load it here:
meta <- read.csv("data/sequence_metadata/qc_metadata/mmg_metadata_16SrRNA_QCd_20210329.csv")
If you did not save your metadata to file, you can re-download it using downloadSequenceMetadata()
, and quality-control it using qcMetadata()
.
meta <- downloadSequenceMetadata(targetGene = "16S") meta <- qcMetadata(meta)
Remove any samples that only have forward reads or only have reverse reads.
matched_fn <- getPairedFastqFiles(c(rawFs, rawRs), meta) rawFs <- matched_fn[[1]] rawRs <- matched_fn[[2]]
To cut down on computation time, select up to N_SAMPLES samples from the runs, up to N_SAMPLES/length(runIDs) from each run.
if(length(rawFs) > N_SAMPLES) { rawFs_subset <- c() rawRs_subset <- c() for(i in 1:length(runIDs)) { rawFs_runID <- rawFs[grep(runIDs[i], rawFs)] rawRs_runID <- rawRs[grep(runIDs[i], rawRs)] if(length(rawFs_runID) > N_SAMPLES/length(runIDs)) { set.seed(101010+i) subset <- sample(seq(1,length(rawFs_runID)), N_SAMPLES/length(runIDs), FALSE) rawFs_subset <- c(rawFs_subset, rawFs_runID[subset]) rawRs_subset <- c(rawRs_subset, rawRs_runID[subset]) } else { rawFs_subset <- c(rawFs_subset, rawFs_runID) rawRs_subset <- c(rawRs_subset, rawRs_runID) } } rawFs <- rawFs_subset rawRs <- rawRs_subset } write.csv(cbind(rawFs, rawRs), file.path(PATH_TEST, "input_files.csv"))
input_files <- read.csv(file.path(PATH_TEST, "input_files.csv"), row.names = 1, stringsAsFactors = FALSE) rawFs <- input_files$rawFs rawRs <- input_files$rawRs
Plot quality profiles
profiles_list <- list() for(i in 1:length(runIDs)) { # Retrieve only those files associated with the appropriate runID profiles_list[[i]] <- gridExtra::grid.arrange( plotQualityProfile(grep(runIDs[i], rawFs, value=TRUE), aggregate=TRUE), plotQualityProfile(grep(runIDs[i], rawRs, value=TRUE), aggregate=TRUE), ncol=2, top=runIDs[i] ) } for(i in 1:length(profiles_list)) { plot(profiles_list[[i]]) }
Split complete filenames into "basenames" and directory names
fn_base <- basename(c(rawFs, rawRs)) PATH_PARAMSETS <- file.path(PATH_TEST, param_sets) for(p in PATH_PARAMSETS) dir.create(p, showWarnings=FALSE)
Trim reads based on the primer sequences supplied in params.R
.
trim_trackReads <- trimPrimers16S(fn_base, in_subdir="raw", out_explicitdir=PATH_TRIMMED, meta=meta, primer_16S_fwd="CCTACGGGNBGCASCAG", primer_16S_rev="GACTACNVGGGTATCTAATCC") # trim_trackReads <- trimPrimers16S(fn_base, PATH_RAW, PATH_TRIMMED, "CCTACGGGNBGCASCAG", "GACTACNVGGGTATCTAATCC")
To store the results arising from each set of parameter choices, we construct a list object where each element corresponds to the output given a different parameter set. For this step, the list object is filter_trackReads
.
filter_trackReads <- list() for(i in 1:length(param_sets)) { filter_trackReads[[i]] <- qualityFilter16S( fn_base, in_explicitdir = PATH_TRIMMED, out_explicitdir = PATH_PARAMSETS[[i]], meta = meta, maxEE=c(MAX_EE_FWD, params[i,1]), # Vary maxEE.R truncLen=c(TRUNC_LEN_FWD, params[i,2]), # Vary truncLen.R minLen=MIN_LEN, multithread=TRUE) # set FALSE for Windows computers rownames(filter_trackReads[[i]]) <- paste0(param_sets[i], "|", rownames(filter_trackReads[[i]])) } filter_trackReads_mat <- do.call(rbind, filter_trackReads)
Optionally, save the read-tracking table, in case the job fails:
write.csv(filter_trackReads_mat, file.path(PATH_TEST, "results", "sensitivity_trackReads_filter.csv"), row.names=TRUE)
Although the quality filtering step is the only part of the processing pipeline where we vary the parameters, we must follow through with the rest of the pipeline to observe the downstream effects on remaining reads, merging rate, taxonomic resolution, and alpha- and beta-diversity estimates.
This requires us to redefine the list structure so that it becomes nested: in the first level, each element corresponds to a parameter set; in the second level, each element corresponds to a sequencing run ID. This is necessary because the dada sequence inference algorithm is sensitive to error rate estimates, and error rates may differ considerably between sequencing runs. Here we initialize the nested list structure for two types of outputs simultaneously.
seqtabs <- dada_trackReads <- lapply(1:length(param_sets), function(x) lapply(1:length(runIDs), function(y) list()))
Now we continue with the processing pipeline to produce different versions of the resulting sequence tables.
meta_fn <- matchFastqToMetadata(fn_base, meta) for(i in 1:length(param_sets)) { for(j in 1:length(runIDs)) { message("Sensitivity analysis: parameter set ", param_sets[i], ", sequencing run ", runIDs[j]) # Retrieve only those files associated with the appropriate parameter set and runID meta_thisrun <- meta_fn[which(meta_fn$sequencerRunID==runIDs[j]),] fn_base_thisrun <- meta_thisrun$file dada_out <- runDada16S( fn_base_thisrun, in_explicitdir = PATH_PARAMSETS[[i]], out_seqtab = file.path(PATH_PARAMSETS[i], paste0("sensitivity_seqtab_", runIDs[j], ".Rds")), multithread = TRUE, # set FALSE for Windows computers verbose = FALSE, seed = 11001100) seqtabs[[i]][[j]] <- dada_out$seqtab dada_trackReads[[i]][[j]] <- dada_out$track rownames(seqtabs[[i]][[j]]) <- paste0(param_sets[i], "|", rownames(seqtabs[[i]][[j]])) rownames(dada_trackReads[[i]][[j]]) <- paste0(param_sets[i], "|", rownames(dada_trackReads[[i]][[j]])) # Can save work as you go: saveRDS(seqtab.list$seqtab.nochim, file.path(PATH_PARAMSETS[i], paste0("sensitivity_seqtab_", runIDs[j], ".Rds"))) write.csv(seqtab.list$track, file.path(PATH_PARAMSETS[i], paste0("sensitivity_trackReads_dada_", runIDs[j], ".csv")), row.names=TRUE) } } dada_trackReads_mat <- do.call(rbind, lapply(dada_trackReads, function(x) do.call(rbind, x)))
Combine all read-tracking tables:
trim_trackReads_mat <- do.call(rbind, lapply(1:length(param_sets), function(i) { rownames(trim_trackReads) <- paste0(param_sets[i], "|", rownames(trim_trackReads)) trim_trackReads })) combineReadTrackingTables16S(trim_trackReads, filter_trackReads, dada_trackReads_mat, out_file = file.path(PATH_TEST, "results", "sensitivity_trackReads.csv"))
Now that sequence inference is complete, we can join the sequencing runs back together in each parameter set:
seqtabs_joinrun <- lapply(1:length(param_sets), function(x) list()) for(i in 1:length(param_sets)) { seqtabs_joinrun[[i]] <- mergeSequenceTables(tables = seqtabs[[i]]) }
Save the data so far:
saveRDS(seqtabs_joinrun, file.path(PATH_TEST, "results", "sensitivity_seqtabs_joinrun_list.Rds"))
You can reload RDS objects into R to pick up where you left off:
seqtabs_joinrun <- readRDS(file.path(PATH_TEST, "results", "sensitivity_seqtabs_joinrun_list.Rds")) track <- read.csv(file.path(PATH_TEST, "results", "sensitivity_trackReads.csv"), row.names=1)
We can now plot the number of reads remaining after each step in the processing pipeline.
First, add parameter information to the read tracking table. This custom function simply parses the parameter sets from the rownames of the tracking table.
track <- parseParamsFromRownames(track, PARAM1, PARAM2)
Next, perform a few more operations to make the tracking table ready for plotting:
# Reshape read tracking table track_long <- tidyr::gather(track, key = "step", value = "reads", input:nonchim) # Exclude metrics associated only with forward reads track_long <- track_long[!grepl("F$", track_long$step),] # Aggregate read counts by run ID track_long[["step"]] <- factor(track_long[["step"]], levels=colnames(track)[1:9]) track_aggRun <- group_by(track_long, maxEE.R, truncLen.R, runID, step) %>% dplyr::summarise(reads = sum(reads)) %>% mutate(runID_print = sub("run", "", runID))
Plot!
theme_set(theme_bw()) ggplot(track_aggRun, aes(x=step, y=reads, col=as.factor(truncLen.R))) + geom_line(aes(linetype=as.factor(maxEE.R), group=interaction(maxEE.R, truncLen.R)), alpha=0.7, size=0.5) + facet_wrap(~runID_print, ncol=4) + labs(linetype="maxEE.R", color="truncLen.R") + scale_linetype_manual(expression(maxEE[R]), values=c("dotted", "dashed", "solid")) + theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) + scale_y_continuous(trans="log10") + scale_color_manual(expression(truncLen[R]), values = c("#E69F00", "#56B4E9", "#CC79A7")) + xlab("Step") + ylab("Reads") + theme(legend.position = "top", legend.text = element_text(size = 9), legend.key.size = unit(0.5, "cm"), strip.background = element_blank()) ggsave(file.path(PATH_TEST, "results", "track_reads_plot.png"), width=5.5, height=7, units="in")
First, we convert the sequence tables into phyloseq objects.
physeqs <- list() # Create a list of physeq objects for(i in 1:length(param_sets)) { # Sample data (parameters) sampledata <- parseParamsFromRownames(seqtabs_joinrun[[i]], PARAM1, PARAM2, keep_rownames=TRUE, keep_original_cols = FALSE) physeqs[[i]] <- phyloseq(otu_table(seqtabs_joinrun[[i]], taxa_are_rows=FALSE), sample_data(sampledata)) }
We make use of phyloseq
's estimate_richness()
function.
obsrich_list <- shannon_list <- list() for(i in 1:length(physeqs)) { div <- suppressWarnings( estimate_richness(physeqs[[i]], measures=c("Observed","Shannon"), split=TRUE) ) obsrich_list[[i]] <- div[,1] shannon_list[[i]] <- div[,2] } diversity_list <- lapply(seq_along(physeqs), function(i) { parseParamsFromRownames( cbind( sample_data(physeqs[[i]]), obsrich = obsrich_list[[i]], shannon = shannon_list[[i]] ), PARAM1, PARAM2 ) }) diversity_df <- do.call(rbind, diversity_list) diversity_df$runID_print <- sub("run", "", diversity_df$runID)
Plot!
theme_set(theme_bw()) ggplot(diversity_df, aes(y=obsrich, col=as.factor(truncLen.R), x=as.factor(maxEE.R))) + geom_boxplot(size = 0.3) + facet_wrap(~runID_print, ncol=4) + labs(col=expression(truncLen[R]), x=expression(maxEE[R])) + ylab("Observed richness") + scale_color_manual(values = c("#E69F00", "#56B4E9", "#CC79A7")) + theme(legend.position = "top", legend.text = element_text(size = 9), legend.key.size = unit(0.5, "cm"), strip.background = element_blank()) ggsave(file.path(PATH_TEST, "results", "diversity_plot_obsrich.png"), width=5.5, height=7, units="in")
theme_set(theme_bw()) ggplot(diversity_df, aes(y=shannon, col=as.factor(truncLen.R), x=as.factor(maxEE.R))) + geom_boxplot(size = 0.3) + facet_wrap(~runID_print, ncol=4) + labs(col=expression(truncLen[R]), x=expression(maxEE[R])) + ylab("Shannon index") + scale_color_manual(values = c("#E69F00", "#56B4E9", "#CC79A7")) + theme(legend.position = "top", legend.text = element_text(size = 9), legend.key.size = unit(0.5, "cm"), strip.background = element_blank()) ggsave(file.path(PATH_TEST, "results", "diversity_plot_shannon.png"), width=5.5, height=7, units="in")
Test statistical significance using ANOVA
obsrich_aov <- aov(log(obsrich) ~ truncLen.R + maxEE.R + runID + truncLen.R:runID + maxEE.R:runID, data=diversity_df) png(file.path(PATH_TEST, "results", "aov_diagnostics_obsrich.png")) par(mfrow=c(2,2)) plot(obsrich_aov) dev.off()
options(knitr.kable.NA = '') knitr::kable(summary(obsrich_aov)[[1]], digits=3)
shannon_aov <- aov(shannon ~ truncLen.R + maxEE.R + runID + truncLen.R:runID + maxEE.R:runID, data=diversity_df) png(file.path(PATH_TEST, "results", "aov_diagnostics_shannon.png")) par(mfrow=c(2,2)) plot(shannon_aov) dev.off()
options(knitr.kable.NA = '') knitr::kable(summary(shannon_aov)[[1]], digits=3)
In conclusion, truncLen.R
affects alpha-diversity inference, whereas maxEE.R
does not (within the range of maxEE.R
values that we tested). truncLen.R
affects alpha-diversity inference to a different degree depending on the sequence run.
First, combine all sequence tables into ONE large sequence table. A sample data table will distinguish samples processed using different parameters.
seqtab_joined <- mergeSequenceTables(tables=seqtabs_joinrun)
Unite sequence-length variants:
t1 <- Sys.time() seqtab_joined_collapsed <- collapseNoMismatch(seqtab_joined) t2 <- Sys.time() t2-t1
Or load the collapsed version of the joined sequence table (recommended!)
seqtab_joined <- readRDS("./data/seqtab_joined_runs_and_params_COLLAPSED.Rds")
# Sample data (parameters) sampledata <- parseParamsFromRownames(seqtab_joined, PARAM1, PARAM2, keep_rownames=TRUE, keep_original_cols = FALSE)
Custom function to match rownames in this sequence table (which are based on the original fastq filenames) to the sequence metadata.
matchSeqtabToMetadata <- function(tab, meta, verbose=TRUE) { rownms <- rownames(tab) samplenms <- sub(".*\\|", "", rownms) # Remove runID if appended to beginning of filename key <- sub("^run[A-Za-z0-9]*_", "", samplenms) # Append "_R1" to end of samplenames key <- paste0(key, "_R1.fastq.gz") # # Append ".gz" to end of filename if missing # key[!grepl(".gz$", key)] <- paste0(key[!grepl(".gz$", key)], ".gz") key_match <- match(key, as.character(meta$rawDataFileName)) if(any(is.na(key_match))) { if(verbose) { message("Matching file names to metadata: ", sum(is.na(key_match)), " files did not have matching records in the provided metadata. ", "Double-check to ensure that the provided metadata is of the appropriate scope.") } } return(cbind(rowname = rownms, meta[key_match,], stringsAsFactors=FALSE)) } meta_seqtab <- matchSeqtabToMetadata(seqtab_joined, read.csv("./data/sequence_metadata/qc_metadata/mmg_metadata_16SrRNA_QCed_20210208.csv")) sampledata_with_meta <- cbind(sampledata, meta_seqtab) all(rownames(sampledata_with_meta) == meta_seqtab$rowname)
meta_seqtab %>% group_by(dnaSampleID, siteID, plotID, collectDate) %>% dplyr::summarise(n=n(), .groups="drop") %>% dplyr::select(-n) -> meta_seqtab_summary meta_seqtab_summary
Combine sequence table and sample data table into a phyloseq object:
physeq_joined <- phyloseq(otu_table(seqtab_joined,taxa_are_rows=FALSE), sample_data(sampledata_with_meta))
Remove samples with zero total counts
physeq_joined_nonzero <- prune_samples(sample_sums(physeq_joined) > 0, physeq_joined)
Optionally, discard taxa with 10 or fewer reads:
physeq_joined_nonzero <- prune_taxa(taxa_sums(physeq_joined_nonzero) > 10, physeq_joined_nonzero)
Ordinate, one sequencing run at a time. (Samples from different sequencing runs may be highly dissimilar, producing difficult-to-interpret ordination plots.)
ord_list <- ps_runID <- list() for (i in 1:length(runIDs)) { ps_runID[[i]] <- subset_samples(physeq_joined_nonzero, sequencerRunID==runIDs[i]) } for (i in 1:length(runIDs)) { set.seed(1010101) ord_list[[i]] <- ordinate(ps_runID[[i]], "NMDS", "bray", k=2) saveRDS(ord_list[[i]], file.path(PATH_TEST, "results", paste0("sensitivity_ordination_", runIDs[i], "_COLLAPSED.Rds"))) }
Load ordination data if previously run:
for(i in 1:length(runIDs)) { ord_list[[i]] <- readRDS(file.path(PATH_TEST, "results", paste0("sensitivity_ordination_", runIDs[i], "_COLLAPSED.Rds"))) }
Plot the ordinations!
Generalize to plot with one facet for each runID:
stress_annotations <- lapply(ord_list, function(o) { data.frame(xloc=Inf, yloc=-Inf, # label=paste0("Stress: ", formatC(signif(o$stress,digits=3), digits=3,format="fg", flag="#")), label=paste0("Stress: ", round(o$stress, 3)), hjust=1.05, vjust=-0.15) }) theme_set(theme_bw()) ordplot_list <- list() for(i in 1:length(ord_list)) { sampledata <- as(sample_data(ps_runID[[i]]), "data.frame") sampledata$maxEE.R <- as.factor(sampledata$maxEE.R) sampledata$truncLen.R <- as.factor(sampledata$truncLen.R) sampledata <- cbind(sampledata, scores(ord_list[[i]])[match(rownames(sampledata), rownames(scores(ord_list[[i]]))),]) ordplot_list[[i]] <- ggplot(sampledata, aes(x=NMDS1, y=NMDS2)) + geom_point(aes(col=truncLen.R, shape=maxEE.R)) + stat_ellipse(aes(group=sampleID), lwd=0.1, col="grey20") + scale_shape_manual(values=c(21, 22, 24)) + scale_color_manual(values = c("#E69F00", "#56B4E9", "#CC79A7")) + guides(col="none", shape="none") + ggtitle(runIDs[i]) + geom_text(data=stress_annotations[[i]], aes(x=xloc, y=yloc, label=label, hjust=hjust, vjust=vjust), size=3) + theme(plot.title = element_text(hjust=0.5, size=12), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), # axis.title = element_text(size=9), axis.title = element_blank()) } # Sequencing run B9994 known to have convergence failure; optionally omit its plot plot(ordplot_list[[3]]) ordplot_list[[3]] <- ordplot_list[[3]] + geom_rect(aes(xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf), color="black", fill="white") + theme(axis.text = element_text(color="white"), axis.ticks = element_blank()) + annotate(geom = 'text', x = 4750, y = 0.5, label = "Plot omitted due to\nconvergence failure", size=3) g <- gridExtra::arrangeGrob(grobs=ordplot_list, ncol=4) plot(g) ggsave(file.path(PATH_TEST, "results", "ordination_plot_COLLAPSED.png"), plot=g, width=6.6, height=7.7, units="in")
Get just the legend
library(cowplot) library(grid) legend <- cowplot::get_legend(ggplot(sampledata, aes(x=NMDS1, y=NMDS2)) + geom_point(aes(col=truncLen.R, shape=maxEE.R)) + scale_shape_manual(expression(maxEE[R]), values=c(21, 22, 24)) + scale_color_manual(expression(truncLen[R]), values = c("#E69F00", "#56B4E9", "#CC79A7")) + theme(legend.position = "top")) grid.newpage() grid.draw(legend) ggsave(file.path(PATH_TEST, "results", "ordination_plot_COLLAPSED_legend.png"), plot=legend, width=6.6, height=0.5, units="in")
Use statistical tests to confirm the effects of the quality filtering parameters. First we normalize the combined sequence table by proportionalizing the counts:
# Start by pruning low-sequencing depth samples and removing uncommon taxa ps_nonzero_mindepth <- prune_samples(sample_sums(physeq_joined_nonzero) > 2500, physeq_joined_nonzero) ps_nonzero_mindepth <- prune_taxa(taxa_sums(ps_nonzero_mindepth) > 15, ps_nonzero_mindepth) ps_nonzero_mindepth_prop <- transform_sample_counts(ps_nonzero_mindepth, function(x) x/sum(x)) saveRDS(ps_nonzero_mindepth_prop, file.path(PATH_TEST, "results", paste0("sensitivity_ps_nonzero_mindepth_prop_COLLAPSED.Rds")))
Add more variables into the sampledata for ease of use.
sampledata <- as(sample_data(ps_nonzero_mindepth_prop), "data.frame") sampledata$maxEE.R <- as.factor(sampledata$maxEE.R) sampledata$truncLen.R <- as.factor(sampledata$truncLen.R) sampledata <- cbind(sampledata, scores(ord_all)[match(rownames(sampledata), rownames(scores(ord_all))),]) sampledata$collectYear <- as.integer(format(as.Date(sampledata$collectDate), "%Y"))
First run permANOVA using all three levels of truncLen.R
.
ps_joined_dist <- vegdist(otu_table(ps_nonzero_mindepth_prop)) # Takes about 2.5 hours saveRDS(ps_joined_dist, file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_COLLAPSED.Rds"))
ps_joined_dist <- readRDS(file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_COLLAPSED.Rds")) perm <- how(nperm=999) setBlocks(perm) <- get_variable(ps_nonzero_mindepth_prop, "sampleID") adonis2(ps_joined_dist ~ maxEE.R + truncLen.R, data = as(sample_data(ps_nonzero_mindepth_prop), "data.frame"), permutations=perm)
Distance between groups may be confounded with heterogeneity of within-group variances (dispersion) in the permANOVA test, so we use betadisper to test separately for heterogeneity of within-group variances.
beta_disper <- vegan::betadisper(ps_joined_dist, get_variable(ps_nonzero_mindepth_prop, "truncLen.R")) anova(beta_disper) mod.HSD <- TukeyHSD(beta_disper) plot(mod.HSD)
There is less dispersion among communities that were processed with truncLen.R
= 170. In order to confirm whether the significant effect of truncLen.R
on mean community composition is robust to heterogeneity of dispersion, then, we re-run permANOVA using only top two levels of truncLen.R
.
ps_nonzero_mindepth_prop_hitrunc <- subset_samples(ps_nonzero_mindepth_prop, truncLen.R != 170) ps_joined_dist_hitrunc <- vegdist(otu_table(ps_nonzero_mindepth_prop_hitrunc)) saveRDS(ps_joined_dist_hitrunc, file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_hitrunc_COLLAPSED.Rds"))
readRDS <- file.path(PATH_TEST, "results", "sensitivity_dist_nonzero_mindepth_prop_hitrunc_COLLAPSED.Rds") perm <- how(nperm=999) setBlocks(perm) <- get_variable(ps_nonzero_mindepth_prop_hitrunc, "sampleID") adonis2(ps_joined_dist_hitrunc ~ maxEE.R + truncLen.R, data = as(sample_data(ps_nonzero_mindepth_prop_hitrunc), "data.frame"), permutations=perm) beta_disper_hitrunc <- vegan::betadisper(ps_joined_dist_hitrunc, get_variable(ps_nonzero_mindepth_prop_hitrunc, "truncLen.R"))
Now use betadisper to test homogeneity of within-group variances (dispersion).
beta_disper_hitrunc <- vegan::betadisper(ps_joined_dist_hitrunc, get_variable(physeq_joined_nonzero_runC5B2R_hitrunc, "truncLen.R")) anova(beta_disper_hitrunc) mod.HSD_hitrunc <- TukeyHSD(beta_disper_hitrunc) plot(mod.HSD_hitrunc)
In conclusion, both truncLen.R
and maxEE.R
affect beta-diversity inference, though to different extents. truncLen.R
affects mean community composition and within-group variance (dispersion) in a non-linear fashion, with drastically different results for the lowest level of truncLen.R
tested (truncLen.R == 160
); This is probably due to the large drop-off in read volume at the merging step due to insufficient read overlap. When ANOVA is constrained to only the higher two levels of truncLen.R
, both truncLen.R
and maxEE.R
are found to be significant in determining mean community composition, but with very small effect sizes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.