knitr::opts_chunk$set(fig.width = 7.15, warning = FALSE, message = FALSE)
set.seed(1)

metafiler provides a suite of functions for performing operations commonly applied to metagenomic taxonomic abundance data. This vignette utilizes results generated by MetaPhlan, a metagenomic phylogenetic analysis tool. The data used here comes from profiles of 51 vaginal metagenomic samples from the Human Microbiome Project. All taxa with an abundance of 0.5% in at least one sample are reported.

Data Representation

metafiler represents taxanomic abundance data using Bioconductor's S4 ExpressionSet class. While the ExpressionSet class was originally designed for high-throughput gene expression data, its structure is well suited for representing taxanomic abundance data. Abundance values are stored in the assayData slot as a $m \times n$ matrix, where $m$ is the number of taxa and $n$ is the number of samples, and associated feature and sample metadata are stored in the featureData and sampleData slots, respectively. I will cover basic ExpressionSet operations here but more information can be found on the object help page (?Biobase::ExpressionSet).

The example dataset included with metafiler is an ExpressionSet variable named profiles. Printing out profiles indicates our data includes 30 features (i.e., taxa) and 51 samples.

library(metafiler)
data(profiles)

profiles

We could also obtain this information using dim()

dim(profiles)

Just like a standard R matrix, indexing operations use []s where the first argument corresponds to rows (features) and the second corresponds to columns (samples).

# subset profiles for top 3 taxa across first 5 samples
profiles[1:3, 1:5]

Sample and feature names can be obtained using the sampleNames() and featureNames() functions, respectively, which can also be used for extraction/replacement purproses.

# order taxa alphabetically
profiles <- profiles[sort(featureNames(profiles)), ]

As indicated earlier, sample/feature metadata can be stored in the sampleData and featureData slots of an ExpressionSet object. These data are stored as data.frames where row represents a sample (or feature) and each column represents a variable. For example, we can add a new column to profile's phenoData slot that indicates whether each sample is pregnant or not.

pData(profiles)$pregnant <-
  sample(c(TRUE, FALSE), ncol(profiles), replace = TRUE)

Visualizations

Stacked barplot

We can generate a stacked barplot to visualize each sample's metagenomic taxa profile:

profile_barplot(profiles)

Great but not very informative. Let's use one metafiler's helper functions to improve this plot. We can use reorder_features() to ensure taxa are consistently ordered within each sample's bar based on the average rank of their abundance across all samples.

profiles <- reorder_features(profiles)
<<base-profile-plot>>

We can improve this further by reordering samples based on their dominant taxa. That is, we assign each sample to a group based on its most abundant taxa (sometimes referred to as an enterotype). This is accomplished using assign_max_group(), which assigns each sample to a new group based on the feature with the highest value. The new group variable is saved as a column in the phenoData slot, which can be accessed using pData().

profiles <- add_max_feature(profiles, group = "enterotype")
head(pData(profiles))

If reorder.sample = TRUE (the default) samples are then ordered based on group size and then rank of the dominant feature within each group, which improves the organization of the barplot.

<<base-profile-plot>>

Because these datasets typically include too many taxa for a legend to be useful, legend = FALSE is the default. However, we can add an informative legend limited to just the top taxa using the top.n argument.

profile_barplot(profiles, top.n = 5, legend = TRUE) +
  ggplot2::theme(legend.position = "top")

Dot plots

We can also visualize these data using a dot plot where taxa abundance is represented by the area of each sample's point.

# reload profiles and order taxa by mean abundance
data(profiles)
profiles <- reorder_features(profiles, fun.aggregate = mean)
profile_dotplot(profiles, top.n = 5, max.size = 3, color = "feature")

The color argument for profile_dotplot() can be set to "sample" or "feature" to color code by sample/feature names, or any of the column names for variables stored in the phenoDadta and featureData slots.



aaronwolen/metafiler documentation built on Feb. 16, 2024, 12:41 a.m.