Integration of chem16S with phyloseq

Zc <- "<i>Z</i><sub>C</sub>"
nH2O <- "<i>n</i>H<sub>2</sub>O"
nO2 <- "<i>n</i>O<sub>2</sub>"
H2O <- "H<sub>2</sub>O"
O2 <- "O<sub>2</sub>"
oldopt <- options(width = 80)
# Use pngquant to optimize PNG images
library(knitr)
knit_hooks$set(pngquant = hook_pngquant)
pngquant <- "--speed=1 --quality=0-25"
if (!nzchar(Sys.which("pngquant"))) pngquant <- NULL
# To make warnings appear in text box 20230619
# https://selbydavid.com/2017/06/18/rmarkdown-alerts/
knitr::knit_hooks$set(
   error = function(x, options) {
     paste('\n\n<div class="alert alert-danger">',
           gsub('##', '\n', gsub('^##\ Error', '**Error:**', x)),
           '</div>', sep = '\n')
   },
   warning = function(x, options) {
     paste('\n\n<div class="alert alert-warning">',
           gsub('##', '\n', gsub('^##\ Warning:', '**Warning:**', x)),
           '</div>', sep = '\n')
   },
   message = function(x, options) {
     paste('\n\n<div class="alert alert-info">',
           gsub('##', '\n', x),
           '</div>', sep = '\n')
   }
)
## Don't evaluate chunks if phyloseq is not available 20230619
#if(!requireNamespace("phyloseq", quietly = TRUE)) {
#  knitr::opts_chunk$set(eval = FALSE)
#  day <- imin <- AA.RDP <- map.RDP <- map.silva <- NULL
#  warning("The **phyloseq** package is not available, so this vignette shows only the code without the results.")
#}

This vignette demonstrates the integration of chem16S with phyloseq to calculate chemical metrics of community reference proteomes.

Load required packages

library(chem16S)
library(phyloseq)
library(ggplot2)
theme_set(theme_bw())

This vignette was compiled on r Sys.Date() with chem16S version r sessionInfo()$otherPkgs$chem16S$Version and phyloseq version r sessionInfo()$otherPkgs$phyloseq$Version.

Mouse microbiome data

The mothur MiSeq SOP has an example dataset with 16S rRNA gene sequences for mouse gut communities, which was extracted from the complete dataset reported by @SSZ+12. This example dataset includes data collected at 0--9 (early) and 141-150 (late) days post-weaning from one mouse. This dataset is used in the DADA2 Pipeline Tutorial to demonstrate construction of amplicon sequence variants (ASV), taxonomic classification of the ASVs, and analysis using the phyloseq package. Here, we load the phyloseq-class object that was created on 2023-06-14 using dada2 (version 1.28.0) by following the steps in the DADA2 pipeline tutorial (version 1.16). Note that taxonomy was assigned to genus level using a newer version of the Silva training set (silva_nr99_v138.1_train_set.fa.gz) than that indicated in the DADA2 tutorial on this date (silva_nr_v132_train_set.fa.gz).

data(mouse.silva)
mouse.silva

To start to visualize this data, let's recreate the bar plot from the DADA2 tutorial:

top20.silva <- names(sort(taxa_sums(mouse.silva), decreasing = TRUE))[1:20]
mouse.silva.top20 <- transform_sample_counts(mouse.silva, function(OTU) OTU/sum(OTU))
mouse.silva.top20 <- prune_taxa(top20.silva, mouse.silva.top20)
plot_bar(mouse.silva.top20, x = "Day", fill = "Family") + facet_wrap(~When, scales = "free_x")

Lowest-level taxa

The chem16S package contains the amino acid compositions of archaeal, bacterial, and viral taxa at levels from genus to phylum constructed from the RefSeq database; see the vignette Chemical metrics of reference proteomes for more information. There are denoted as reference proteomes for taxa. Several functions are available to identify the lowest-level taxon for each classified read (or in this case, each ASV), combine the reference proteomes for all taxa to create community reference proteomes, and calculate chemical metrics from them.

First, let's make a table that lists the lowest-level taxon for each ASV and their abundances. Because chem16S was first developed to analyze the output from the RDP Classifier, this data frame has a format that is similar to the file created by using the classify -h command of the RDP Classifier followed by the merge-count command. That is, the data frame has columns named taxid, rank, lineage, and name, followed by columns for all samples. Because of the long text in the taxid column (see below), let's look at the other columns first.

tc.silva <- ps_taxacounts(mouse.silva)
head(tc.silva[, -1])

ASVs with identical lowest-level taxonomic assignments are merged, so this data frame has fewer rows than the number of ASVs. The short names of the ASVs are stored in the taxid column, and the names of multiple ASVs with the same lowest-level taxonomic classification are separated by a semicolon.

head(tc.silva$taxid)

As explained in the DADA2 tutorial, the ASVs themselves (that is, the DNA sequences, not their short names) are available in the phyloseq-class object, but they are not used here. Now let's list the number of unique lowest-level classifications at each taxonomic level:

table(tc.silva$rank)

These taxa will be used below to construct community reference proteomes.

Taxonomic mapping

A challenge of using reference proteomes is that they come from one source (GTDB and RefSeq are available in chem16S), but taxonomic assignments may be made using a different training set (such as Silva or RDP). While identical names and ranks can be used to automatically match taxonomic assignments to reference proteomes, inherent differences in available taxonomies pose difficulties. For example, RDP uses the name Escherichia/Shigella, for which the closest correspondence in RefSeq is Escherichia. This and other manual mappings were used by @DT23 for mapping between RDP and RefSeq.

Let's analyze a phyloseq-class object that was made using the Silva training set but use the manual mapping that was designed to map from RDP to RefSeq.

::: {.infobox .note} The manual mapping from RDP to RefSeq was not designed for taxonomic assignments made using Silva. This is just an example to show the limitations of the technique. :::

map.silva <- map_taxa(tc.silva, refdb = "RefSeq_206")

The messages show that one group was manually mapped, and the total mapping rate (including both automatic name matching and manual mapping) is r round(100 - sum(attr(map.silva, "unmapped_percent")), 1)%.

Not let's analyze a phyloseq-class object that was made using the RDP training set. This results in a considerably higher mapping rate to the NCBI taxonomy.

data(mouse.RDP)
tc.RDP <- ps_taxacounts(mouse.RDP)
map.RDP <- map_taxa(tc.RDP, refdb = "RefSeq_206")

Now the total mapping rate is r round(100 - sum(attr(map.RDP, "unmapped_percent")), 1)%.

::: {.infobox .note} Manual mapping does not overcome inconsistencies between different taxonomies but is simply a starting point for getting the most out of the available taxonomic classifications and reference proteomes. Using the Genome Taxonomy Database (GTDB) for both taxonomic classification and reference proteomes obviates the need for manual mapping and is described further below. :::

From taxonomy to amino acid composition

The ps_metrics() function in chem16S wraps the previous two functions (ps_taxacounts() and map_taxa()) as well as another function (get_metrics()). In order, these functions get the lowest-level taxon for each OTU or ASV, map taxonomic names to RefSeq, and combine taxonomic abundances with reference proteomes for taxa to calculate the amino acid composition of community reference proteomes, which are then used to calculate chemical metrics. To peek into this process, let's begin by listing the amino acid compositions of some community reference proteomes.

AA.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE, return_AA = TRUE)
head(AA.RDP)

In this data frame, columns named Run and chains have the sample names and abundance-weighted numbers of reference proteomes -- that is, the total abundance of ASVs that have taxonomic classifications mapped to the NCBI taxonomy -- that are used to compute the amino acid composition. We can use canprot::calc_metrics() to calculate the average protein length of each community reference proteome, then make a histogram:

lengths <- canprot::calc_metrics(AA.RDP, "length")[, 1]
hist(lengths)
imin <- which.min(lengths)
text(lengths[imin], 1.5, AA.RDP$Run[imin], adj = 1)

Compared to the other samples, that from Day r if(!is.null(AA.RDP)) strsplit(AA.RDP$Run[imin], "D")[[1]][2] has the lowest average protein length of the community reference proteome.

From taxonomy to chemical metrics

We're now ready to calculcate chemical metrics from the amino acid compositions.

metrics.RDP <- ps_metrics(mouse.RDP, refdb = "RefSeq_206", quiet = TRUE)
head(metrics.RDP)

This data frame has columns for r Zc (carbon oxidation state), r nO2 (stoichiometric oxidation state), and r nH2O (stoichiometric hydration state). r Zc is computed from the elemental formula, while r nO2 and r nH2O are coefficients on r O2 and r H2O in theoretical formation reactions of the proteins, per residue, from a particular set of thermodynamic components, also known as basis species. The rationale for the choice of basis species used here (glutamine, glutamic acid, cysteine, r O2, and r H2O, abbreviated as QEC) was given by @DYT20. Because they are both measures of oxidation state, in general there is a strong positive correlation between r Zc and r nO2. For this dataset, there is in addition a strong negative correlation between r Zc and r nH2O, but other datasets do not show such a correlation.

cor(metrics.RDP$Zc, metrics.RDP$nO2)
cor(metrics.RDP$Zc, metrics.RDP$nH2O)

Let's plot each of these chemical metrics against the sampling day.

plot_ps_metrics(mouse.RDP, x = "Day", color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) +
  geom_point(size = 3)

Let's plot two chemical metrics against each other.

plot_ps_metrics2(mouse.RDP, color = "When", shape = "When", refdb = "RefSeq_206", quiet = TRUE) +
  geom_point(size = 3)

The community reference proteomes for Late samples tend to be more oxidized and less hydrated (i.e., they have higher r Zc and lower r nH2O) than Early samples, but there is some overlap between the groups.

Using GTDB for reference proteomes

As noted above, mapping taxon names from Silva or RDP to RefSeq is not perfect. To overcome this limitation, the GTDB can be used for both taxonomic classification of 16S rRNA gene sequences and for reference proteomes of taxa. Here we load a phyloseq-class object prepared following the DADA2 Pipeline Tutorial modified to classify 16S rRNA gene sequences using GTDB r220 [@Ali24]. Then, we plot chemical metrics calculated for reference proteomes also derived from the GTDB, but a later version. (The reason for this is that DADA2-formatted 16S rRNA sequences from GTDB r220 are not yet available in the Zenodo repository maintained by @Ali24.)

data(mouse.GTDB_220)
plot_ps_metrics2(mouse.GTDB_220, refdb = "GTDB_220", color = "When", shape = "When") + geom_point(size = 3)

Compared to using the RDP for classification of 16S rRNA sequences and RefSeq for reference proteomes, using the GTDB for both yields a clearer separation of Early and Late samples in chemical space. There is one Early sample with anomalously high r Zc and low r nH2O; let's identify it:

metrics.GTDB <- ps_metrics(mouse.GTDB_220)
is.early <- sample_data(mouse.GTDB_220)$When == "Early"
iout <- which.min(metrics.GTDB[is.early, ]$nH2O)
(day <- sample_data(mouse.GTDB_220)[is.early, ]$Day[iout])

This is the sample from day r day. Let's look again at the taxonomic classifications, this time at the phylum level in GTDB:

top20.GTDB <- names(sort(taxa_sums(mouse.GTDB_220), decreasing = TRUE))[1:20]
mouse.GTDB_220.top20 <- transform_sample_counts(mouse.GTDB_220, function(OTU) OTU/sum(OTU))
mouse.GTDB_220.top20 <- prune_taxa(top20.GTDB, mouse.GTDB_220.top20)
plot_bar(mouse.GTDB_220.top20, x = "Day", fill = "Phylum") + facet_wrap(~When, scales = "free_x")

This plot suggests that a relatively high proportion of Bacteroidota makes the Day r day sample more similar to samples in the Late group. Bacteroidetes (an older name for Bacteroidota) have some of the lowest r nH2O among major bacterial phyla [see @DT23], which could partially explain the low r nH2O observed for the Day r day sample.

Take-home messages

  1. Taxonomic mapping from RDP to RefSeq is incomplete. If you use this, it is advisable to report the percentage of mapped taxonomic classifications as well as the major unmapped groups, which can be listed with the quiet = FALSE argument to map_taxa(), ps_metrics(), plot_ps_metrics(), and plot_ps_metrics2().
  2. Consistent taxonomy can be achieved by using the GTDB for both reference proteomes and taxonomic classification of 16S rRNA gene sequences. For the dataset analyzed here, this leads to a clearer separation between sample groups in chemical space and helps to identify the taxonomic basis for possible outliers.
  3. Gut communities in later stages of mouse growth are characterized by relatively oxidized and dehydrated reference proteomes.

Beyond conventional diversity metrics used for microbiome analysis, visualizing the chemical metrics of community reference proteomes can provide new insight into genomic adaptation. See the next vignette (Plotting two chemical metrics) for more examples.

References

options(oldopt)


Try the chem16S package in your browser

Any scripts or data that you put into this service are public.

chem16S documentation built on April 3, 2025, 10:47 p.m.