In jmpowers/bouquet: Filter and Augment Biological GC-MS Data

knitr::opts_chunk$set(
  comment = "  ", message=F, warning=F,
  fig.width=7, fig.height=6, highlight=FALSE
)
requireNamespace("vegan", quietly = TRUE)
requireNamespace("knitr", quietly = TRUE)
requireNamespace("rmarkdown", quietly = TRUE)
requireNamespace("magrittr", quietly = TRUE)
requireNamespace("ggplot2", quietly = TRUE)

Getting started

The open-source R package bouquet facilitates easy and transparent preprocessing and GC-MS datasets. The package flexibly accepts integrated and identified peaks from a set of chromatograms and performs customizable filtering steps. Using internal or external standards, emission rates can be quantified and standardized by a measure of biomass. Visualizations of statistics for each compound help determine filtering thresholds.

To begin, install the package from GitHub and load the magrittr package to join multiple functions with the pipe %>%. To build this package vignette you will also need to install the packages listed below. After loading bouquet, help for each function, including details for each argument, can be accessed by typing ?function_name. Bugs or feature requests can be submitted as GitHub issues.

install.packages(c("vegan", "knitr", "rmarkdown", "magrittr", "ggplot2")) #needed to build this vignette
install.packages("remotes") #for installing from github
remotes::install_github("jmpowers/bouquet", build_vignettes = TRUE)

library(bouquet)
library(magrittr)

Data format

The example dataset GCMSfloral is provided with the package and contains 128 floral samples and 16 ambient controls. The floral samples were taken from two species of Schiedea and their hybrids (Cross: KAAL, HOOK, KAHO, HOKA) at two times of day (Time: Day, Night). A subset of these samples is analyzed in Powers et al. (2020). The dataset comprises two tables: GCMS_metadata holds metadata for each sample in the experiment, and GCMS_output holds the integrated and identified chromatographic peaks from all samples.

Metadata

The experimental information about each sample and control is given in the GCMS_metadata table for this example, but would normally be imported from a spreadsheet (for example using GCMS_metadata <- read.csv("GCMS_metadata.csv")). Once that is done, load_metadata() identifies the important columns that will be used later, and standardizes their names. Two columns are required: sample identifies the sample uniquely, and type specifies whether the row represents a floral sample or an ambient or vegetative control (must be either "floral", "ambient", or "leaf"). The optional group argument indicates the column(s) with experimental groups. Users may also provide the date of sampling and the amount of sample (e.g. flower number or mass) that can be used for standardization later (with standardize_finaltable()).

data(GCMSfloral)
metadata <- load_metadata(GCMS_metadata, date = "SampleDate", 
                          sample = "Filename", group = c("Cross", "Time"), 
                          type = "Type", amount = "Flrs")

metadata[c(56,75,84,104,1,17),]

GCMS data

In this example, GCMS_output contains a row for each peak in each floral and ambient control sample, including the sample name, retention time, integrated peak area, compound identification, and best match (similarity) score to the mass spectral library. This table can be generated by picking, integrating, and identifying peaks using software provided by the instrument manufacturer or freely available tools such as AMDIS or OpenChrom. load_longdata() standardizes the column names and the match scores.

longdata <- load_longdata(GCMS_output, sample = "Sample", RT = "RT", 
                          name = "Name", area = "Area", 
                          match = "Match", maxmatch=100)
longdata[1:4,]

make_sampletable() reshapes the data into a table of peak areas organized with samples in rows and compounds in columns. Only a subset of the r ncol(sampletable) columns are shown here.

sampletable <- make_sampletable(longdata, metadata)

sampletable[c(56,75,84,104,1,17),c("oct-1-en-3-ol", "(Z)-hex-3-en-1-ol", "Caprolactam")]

Filtering

make_chemtable() calculates descriptive statistics for each compound including:

the mean, median, and maximum (max) peak area in floral samples and ambient controls (the "nonzero" columns exclude zeros from the mean)
the mean and variance (var) of the match score (match) and retention time (RT)
the count and frequency (freq) of occurrence in ambient controls, floral samples, and each experimental group specified in [Data format] (in this case the cross abbreviations KAAL, HOOK, KAHO, HOKA)

If the original longdata is not available (e.g. in the case of a published study), make_chemtable() can also accept a wide-format sampletable: make_chemtable(sampletable, metadata, datatype="wide").

chemtable <- make_chemtable(longdata, metadata)

str(chemtable, vec.len=1)

Eleven filter_*() functions have been implemented that append a column with the verdict for each compound; view the complete listing and help for each argument with help(package = "bouquet"). The example here uses filters for:

filter_RT(): retention time between 4 and 15 minutes
filter_match(): match score above 80%
filter_freq(): minimum frequency of 10% across all floral samples and within each experimental group (generates one filter for each group)
filter_contaminant(): a manual contaminant list, which here includes any compound with "siloxane" in the name
filter_area(): maximum peak area across all samples.

The last filter, filter_ambient_ttest(), performs t-tests of the compound peak areas observed in samples vs. ambient controls and corrects for multiple comparisons using the false discovery rate method. Alternatively, filter_ambient_presence() excludes any compounds found in controls, and filter_ambient_ratio() and filter_leaf_ratio() require the mean area across samples to exceed the mean area in ambient or vegetative controls by a specified ratio. To compare samples to controls taken on the same date, use filter_ambient_date().

chemtable <- chemtable %>% 
  filter_RT(4, 15) %>% 
  filter_match(0.8) %>% 
  filter_freq(0.2, group = TRUE) %>% 
  filter_contaminant(cont.list = c("Caprolactam", grep("siloxane", chemtable$name, value=TRUE))) %>% 
  filter_area(min_maximum = 5e5) %>%
  filter_ambient_ratio(sampletable, metadata, ratio = 4) %>% 
  filter_ambient_ttest(sampletable, metadata, 
                       alpha = 0.05, adjust = "fdr")

str(chemtable[,30:43], vec.len=1)

To tally the number of filters that each compound passed (filters_passed), use combine_filters(), and then leave the filter_final column as the default (compound must meet all specified filters), or customize it with your own logical combination of the filter columns. In this example, compounds are marked for inclusion in the filter_final column if they pass the ratio test or t-test against the ambient controls (| is logical OR) and pass the retention time filter and occur in 10% of floral samples (& is logical AND). You can see the filters that each compound passed using View(chemtable).

chemtable <- chemtable %>% 
  combine_filters() %>% 
  within(filter_final <- 
              (filter_ambient_ratio == "OK" | filter_ambient_ttest == "OK") & filter_RT == "OK" & filter_freq.floral == "OK")

str(chemtable[,44:45])

#save.image(file = "./data/GCMSfloral.rda")

Visualizing filters

You can visualize the filtering criteria in several ways with plot_filters().

Proportion that pass

The "prop" option shows the proportion of compounds that passed each filter (OK), or gives the reason why they failed. In this case only a small fraction of compounds passed the required combination of filters (filter_final).

plot_filters(chemtable, option = "prop")

Rarity and amount

The "rarity" option creates a plot of compound occurrence (x-axis) and maximum peak area (y-axis, log scale) in floral samples. The size of the point reflects the mean peak area (excluding zeros). The ambient ratio (bottom line of each label) shows how areas compare in samples versus controls. The dashed lines show the applied frequency (vertical line) and area (horizontal line) filters. The point colors show whether the compound was included in filter_final (green, passed the logical combination of criteria specified above), passed the maximum area and frequency filters but not filter_final (orange, upper-right quadrant) or did not meet both the maximum area and frequency filters (red).

plot_filters(chemtable, option = "rarity", yrange=2.5, fontsize=2.5, pointsize=15, alpha_excluded=0.5)

Comparison with controls

The "ambient" option creates a plot comparing mean peak areas in floral samples (y-axis, log scale) vs. ambient controls (x-axis, log scale). The size of the point reflects the frequency in floral samples. The dashed line shows the applied ambient ratio filter, and the solid 1:1 line shows where mean amounts in samples and controls are equal. The point colors show whether the compound was included in filter_final (green, passed the logical combination of criteria specified above), passed the t-test against ambient controls but not filter_final (orange) or did not pass the t-test (red).

plot_filters(chemtable, option = "ambient", yrange=4, pointsize=6)

The "volcano" option creates a plot of the statistical significance (y-axis, negative log10 adjusted P-value) of the t-test between floral samples and ambient controls, vs. the ambient ratio (x-axis, also known as fold change, log2). This plot is often used in gene expression studies to compare two treatments. The dashed vertical line shows the applied ambient ratio filter, and the dashed horizontal line shows the applied alpha threshold for p-values. The point color and size are the same as for the "ambient" option.

plot_filters(chemtable, option = "volcano", yrange=NA, pointsize=6)

Finish and analyze

To cut out the compounds that did not meet filter_final, use prune_sampletable().

finaltable <- prune_sampletable(sampletable, chemtable, metadata)
finalmetadata <- subset(metadata, type == "floral")

dim(finaltable)

In this example, only r ncol(finaltable) of the r ncol(sampletable) original compounds remain in the final filtered table. The filtered table is now ready for analysis. As an example, with the vegan package you can plot volatiles accumulation curves, conduct a PERMANOVA for the effect of hybrid cross, run a CAP (canonical analysis of principal coordinates) ordination, and plot a heatmap.

Accumulation curves

Accumulation curves show how the number of (filtered) volatiles detected increases with the number of sampled individuals within each experimental group (in this case, cross type). Means and standard errors are generated by permuting the data. If the curve does not saturate, additional sampling may be required.

for(cr in 1:nlevels(finalmetadata$Cross)) {
  Cross_subset <- finaltable[finalmetadata$Cross==levels(finalmetadata$Cross)[cr],]
  Cross_sa <- vegan::specaccum(Cross_subset, permutations = 999) 
  plot(Cross_sa, col = cr+1, add = cr != 1, lwd=2, xlim=c(1,max(table(finalmetadata$Cross))),
       xlab="Samples", ylab="Compounds", main="Volatiles accumulation curve")
}
legend("bottomright", levels(finalmetadata$Cross), 
       col = 2:5, pch = 19, title = "Cross")

PERMANOVA

library(vegan)
(permanova <- vegan::adonis2(finaltable ~ Cross, data = finalmetadata))

CAP

(cap <- vegan::capscale(finaltable ~ Cross, data = finalmetadata, 
                 distance = "bray"))
plot(cap, type = "n", 
     main = paste("CAP of", ncol(finaltable), "volatiles in", 
                  nrow(finaltable), "samples vs. cross type"))
legend("bottomright", levels(finalmetadata$Cross), 
       col = 2:5, pch = 19, title = "Cross")
points(cap, col = as.integer(finalmetadata$Cross)+1, pch = 19)
text(cap, display = "species", col="grey50")
text(cap, display = "cn", labels=levels(finalmetadata$Cross))

Heatmap

heatmap(as.matrix(finaltable)^(1/4), 
        scale = "none", distfun = vegdist,
        cexCol = 1, margins = c(15, 0), labRow = NA,
        col=colorRampPalette(c("white","darkblue"))(256),
        RowSideColors = as.character(as.integer(finalmetadata$Cross)+1))
legend("bottomleft", levels(finalmetadata$Cross), title = "Cross",
       col = 2:5, pch = 19, inset=c(0,-0.2),  xpd=T)

jmpowers/bouquet documentation built on Nov. 30, 2024, 12:02 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jmpowers/bouquet
Filter and Augment Biological GC-MS Data

In jmpowers/bouquet: Filter and Augment Biological GC-MS Data

Getting started

Data format

Metadata

GCMS data

Filtering

Visualizing filters

Proportion that pass

Rarity and amount

Comparison with controls

Finish and analyze

Accumulation curves

PERMANOVA

CAP

Heatmap

R Package Documentation

Browse R Packages

We want your feedback!

jmpowers/bouquet Filter and Augment Biological GC-MS Data

In jmpowers/bouquet: Filter and Augment Biological GC-MS Data

Getting started

Data format

Metadata

GCMS data

Filtering

Visualizing filters

Proportion that pass

Rarity and amount

Comparison with controls

Finish and analyze

Accumulation curves

PERMANOVA

CAP

Heatmap

R Package Documentation

Browse R Packages

We want your feedback!

jmpowers/bouquet
Filter and Augment Biological GC-MS Data