knitr::opts_chunk$set( comment = " ", message=F, warning=F, fig.width=7, fig.height=6, highlight=FALSE ) requireNamespace("vegan", quietly = TRUE) requireNamespace("knitr", quietly = TRUE) requireNamespace("rmarkdown", quietly = TRUE) requireNamespace("magrittr", quietly = TRUE) requireNamespace("ggplot2", quietly = TRUE)
The open-source R package bouquet
facilitates easy and transparent preprocessing and GC-MS datasets. The package flexibly accepts integrated and identified peaks from a set of chromatograms and performs customizable filtering steps. Using internal or external standards, emission rates can be quantified and standardized by a measure of biomass. Visualizations of statistics for each compound help determine filtering thresholds.
To begin, install the package from GitHub and load the magrittr
package to join multiple functions with the pipe %>%
. To build this package vignette you will also need to install the packages listed below. After loading bouquet
, help for each function, including details for each argument, can be accessed by typing ?function_name
. Bugs or feature requests can be submitted as GitHub issues.
install.packages(c("vegan", "knitr", "rmarkdown", "magrittr", "ggplot2")) #needed to build this vignette install.packages("remotes") #for installing from github remotes::install_github("jmpowers/bouquet", build_vignettes = TRUE)
library(bouquet) library(magrittr)
The example dataset GCMSfloral
is provided with the package and contains 128 floral samples and 16 ambient controls. The floral samples were taken from two species of Schiedea and their hybrids (Cross: KAAL, HOOK, KAHO, HOKA) at two times of day (Time: Day, Night). A subset of these samples is analyzed in Powers et al. (2020). The dataset comprises two tables: GCMS_metadata
holds metadata for each sample in the experiment, and GCMS_output
holds the integrated and identified chromatographic peaks from all samples.
The experimental information about each sample and control is given in the GCMS_metadata
table for this example, but would normally be imported from a spreadsheet (for example using GCMS_metadata <- read.csv("GCMS_metadata.csv")
). Once that is done, load_metadata()
identifies the important columns that will be used later, and standardizes their names. Two columns are required: sample
identifies the sample uniquely, and type
specifies whether the row represents a floral sample or an ambient or vegetative control (must be either "floral", "ambient", or "leaf"). The optional group
argument indicates the column(s) with experimental groups. Users may also provide the date
of sampling and the amount
of sample (e.g. flower number or mass) that can be used for standardization later (with standardize_finaltable()
).
data(GCMSfloral) metadata <- load_metadata(GCMS_metadata, date = "SampleDate", sample = "Filename", group = c("Cross", "Time"), type = "Type", amount = "Flrs") metadata[c(56,75,84,104,1,17),]
In this example, GCMS_output
contains a row for each peak in each floral and ambient control sample, including the sample name, retention time, integrated peak area, compound identification, and best match (similarity) score to the mass spectral library. This table can be generated by picking, integrating, and identifying peaks using software provided by the instrument manufacturer or freely available tools such as AMDIS or OpenChrom. load_longdata()
standardizes the column names and the match scores.
longdata <- load_longdata(GCMS_output, sample = "Sample", RT = "RT", name = "Name", area = "Area", match = "Match", maxmatch=100) longdata[1:4,]
make_sampletable()
reshapes the data into a table of peak areas organized with samples in rows and compounds in columns. Only a subset of the r ncol(sampletable)
columns are shown here.
sampletable <- make_sampletable(longdata, metadata) sampletable[c(56,75,84,104,1,17),c("oct-1-en-3-ol", "(Z)-hex-3-en-1-ol", "Caprolactam")]
make_chemtable()
calculates descriptive statistics for each compound including:
If the original longdata is not available (e.g. in the case of a published study), make_chemtable()
can also accept a wide-format sampletable: make_chemtable(sampletable, metadata, datatype="wide")
.
chemtable <- make_chemtable(longdata, metadata) str(chemtable, vec.len=1)
Eleven filter_*()
functions have been implemented that append a column with the verdict for each compound; view the complete listing and help for each argument with help(package = "bouquet")
. The example here uses filters for:
filter_RT()
: retention time between 4 and 15 minutesfilter_match()
: match score above 80%filter_freq()
: minimum frequency of 10% across all floral samples and within each experimental group (generates one filter for each group)filter_contaminant()
: a manual contaminant list, which here includes any compound with "siloxane" in the namefilter_area()
: maximum peak area across all samples. The last filter, filter_ambient_ttest()
, performs t-tests of the compound peak areas observed in samples vs. ambient controls and corrects for multiple comparisons using the false discovery rate method. Alternatively, filter_ambient_presence()
excludes any compounds found in controls, and filter_ambient_ratio()
and filter_leaf_ratio()
require the mean area across samples to exceed the mean area in ambient or vegetative controls by a specified ratio. To compare samples to controls taken on the same date, use filter_ambient_date()
.
chemtable <- chemtable %>% filter_RT(4, 15) %>% filter_match(0.8) %>% filter_freq(0.2, group = TRUE) %>% filter_contaminant(cont.list = c("Caprolactam", grep("siloxane", chemtable$name, value=TRUE))) %>% filter_area(min_maximum = 5e5) %>% filter_ambient_ratio(sampletable, metadata, ratio = 4) %>% filter_ambient_ttest(sampletable, metadata, alpha = 0.05, adjust = "fdr") str(chemtable[,30:43], vec.len=1)
To tally the number of filters that each compound passed (filters_passed
), use combine_filters()
, and then leave the filter_final
column as the default (compound must meet all specified filters), or customize it with your own logical combination of the filter columns. In this example, compounds are marked for inclusion in the filter_final
column if they pass the ratio test or t-test against the ambient controls (|
is logical OR) and pass the retention time filter and occur in 10% of floral samples (&
is logical AND). You can see the filters that each compound passed using View(chemtable)
.
chemtable <- chemtable %>% combine_filters() %>% within(filter_final <- (filter_ambient_ratio == "OK" | filter_ambient_ttest == "OK") & filter_RT == "OK" & filter_freq.floral == "OK") str(chemtable[,44:45])
#save.image(file = "./data/GCMSfloral.rda")
You can visualize the filtering criteria in several ways with plot_filters()
.
The "prop" option shows the proportion of compounds that passed each filter (OK
), or gives the reason why they failed. In this case only a small fraction of compounds passed the required combination of filters (filter_final
).
plot_filters(chemtable, option = "prop")
The "rarity" option creates a plot of compound occurrence (x-axis) and maximum peak area (y-axis, log scale) in floral samples. The size of the point reflects the mean peak area (excluding zeros). The ambient ratio (bottom line of each label) shows how areas compare in samples versus controls. The dashed lines show the applied frequency (vertical line) and area (horizontal line) filters. The point colors show whether the compound was included in filter_final
(green, passed the logical combination of criteria specified above), passed the maximum area and frequency filters but not filter_final
(orange, upper-right quadrant) or did not meet both the maximum area and frequency filters (red).
plot_filters(chemtable, option = "rarity", yrange=2.5, fontsize=2.5, pointsize=15, alpha_excluded=0.5)
The "ambient" option creates a plot comparing mean peak areas in floral samples (y-axis, log scale) vs. ambient controls (x-axis, log scale). The size of the point reflects the frequency in floral samples. The dashed line shows the applied ambient ratio filter, and the solid 1:1 line shows where mean amounts in samples and controls are equal. The point colors show whether the compound was included in filter_final
(green, passed the logical combination of criteria specified above), passed the t-test against ambient controls but not filter_final
(orange) or did not pass the t-test (red).
plot_filters(chemtable, option = "ambient", yrange=4, pointsize=6)
The "volcano" option creates a plot of the statistical significance (y-axis, negative log10 adjusted P-value) of the t-test between floral samples and ambient controls, vs. the ambient ratio (x-axis, also known as fold change, log2). This plot is often used in gene expression studies to compare two treatments. The dashed vertical line shows the applied ambient ratio filter, and the dashed horizontal line shows the applied alpha threshold for p-values. The point color and size are the same as for the "ambient" option.
plot_filters(chemtable, option = "volcano", yrange=NA, pointsize=6)
To cut out the compounds that did not meet filter_final, use prune_sampletable()
.
finaltable <- prune_sampletable(sampletable, chemtable, metadata) finalmetadata <- subset(metadata, type == "floral") dim(finaltable)
In this example, only r ncol(finaltable)
of the r ncol(sampletable)
original compounds remain in the final filtered table. The filtered table is now ready for analysis. As an example, with the vegan
package you can plot volatiles accumulation curves, conduct a PERMANOVA for the effect of hybrid cross, run a CAP (canonical analysis of principal coordinates) ordination, and plot a heatmap.
Accumulation curves show how the number of (filtered) volatiles detected increases with the number of sampled individuals within each experimental group (in this case, cross type). Means and standard errors are generated by permuting the data. If the curve does not saturate, additional sampling may be required.
for(cr in 1:nlevels(finalmetadata$Cross)) { Cross_subset <- finaltable[finalmetadata$Cross==levels(finalmetadata$Cross)[cr],] Cross_sa <- vegan::specaccum(Cross_subset, permutations = 999) plot(Cross_sa, col = cr+1, add = cr != 1, lwd=2, xlim=c(1,max(table(finalmetadata$Cross))), xlab="Samples", ylab="Compounds", main="Volatiles accumulation curve") } legend("bottomright", levels(finalmetadata$Cross), col = 2:5, pch = 19, title = "Cross")
library(vegan) (permanova <- vegan::adonis2(finaltable ~ Cross, data = finalmetadata))
(cap <- vegan::capscale(finaltable ~ Cross, data = finalmetadata, distance = "bray")) plot(cap, type = "n", main = paste("CAP of", ncol(finaltable), "volatiles in", nrow(finaltable), "samples vs. cross type")) legend("bottomright", levels(finalmetadata$Cross), col = 2:5, pch = 19, title = "Cross") points(cap, col = as.integer(finalmetadata$Cross)+1, pch = 19) text(cap, display = "species", col="grey50") text(cap, display = "cn", labels=levels(finalmetadata$Cross))
heatmap(as.matrix(finaltable)^(1/4), scale = "none", distfun = vegdist, cexCol = 1, margins = c(15, 0), labRow = NA, col=colorRampPalette(c("white","darkblue"))(256), RowSideColors = as.character(as.integer(finalmetadata$Cross)+1)) legend("bottomleft", levels(finalmetadata$Cross), title = "Cross", col = 2:5, pch = 19, inset=c(0,-0.2), xpd=T)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.