knitr::opts_chunk$set(fig.width = 7.15, warning = FALSE, message = FALSE) set.seed(1)
metafiler provides a suite of functions for performing operations commonly applied to metagenomic taxonomic abundance data. This vignette utilizes results generated by MetaPhlan, a metagenomic phylogenetic analysis tool. The data used here comes from profiles of 51 vaginal metagenomic samples from the Human Microbiome Project. All taxa with an abundance of 0.5% in at least one sample are reported.
metafiler represents taxanomic abundance data using Bioconductor's S4 ExpressionSet
class. While the ExpressionSet
class was originally designed for high-throughput gene expression data, its structure is well suited for representing taxanomic abundance data. Abundance values are stored in the assayData
slot as a $m \times n$ matrix, where $m$ is the number of taxa and $n$ is the number of samples, and associated feature and sample metadata are stored in the featureData
and sampleData
slots, respectively. I will cover basic ExpressionSet
operations here but more information can be found on the object help page (?Biobase::ExpressionSet
).
The example dataset included with metafiler is an ExpressionSet
variable named profiles
. Printing out profiles
indicates our data includes 30 features (i.e., taxa) and 51 samples.
library(metafiler) data(profiles) profiles
We could also obtain this information using dim()
dim(profiles)
Just like a standard R matrix
, indexing operations use []
s where the first argument corresponds to rows (features) and the second corresponds to columns (samples).
# subset profiles for top 3 taxa across first 5 samples profiles[1:3, 1:5]
Sample and feature names can be obtained using the sampleNames()
and featureNames()
functions, respectively, which can also be used for extraction/replacement purproses.
# order taxa alphabetically profiles <- profiles[sort(featureNames(profiles)), ]
As indicated earlier, sample/feature metadata can be stored in the sampleData
and featureData
slots of an ExpressionSet
object. These data are stored as data.frame
s where row represents a sample (or feature) and each column represents a variable. For example, we can add a new column to profile
's phenoData
slot that indicates whether each sample is pregnant or not.
pData(profiles)$pregnant <- sample(c(TRUE, FALSE), ncol(profiles), replace = TRUE)
We can generate a stacked barplot to visualize each sample's metagenomic taxa profile:
profile_barplot(profiles)
Great but not very informative. Let's use one metafiler's helper functions to improve this plot. We can use reorder_features()
to ensure taxa are consistently ordered within each sample's bar based on the average rank of their abundance across all samples.
profiles <- reorder_features(profiles) <<base-profile-plot>>
We can improve this further by reordering samples based on their dominant taxa. That is, we assign each sample to a group based on its most abundant taxa (sometimes referred to as an enterotype). This is accomplished using assign_max_group()
, which assigns each sample to a new group based on the feature with the highest value. The new group variable is saved as a column in the phenoData
slot, which can be accessed using pData()
.
profiles <- add_max_feature(profiles, group = "enterotype") head(pData(profiles))
If reorder.sample = TRUE
(the default) samples are then ordered based on group size and then rank of the dominant feature within each group, which improves the organization of the barplot.
<<base-profile-plot>>
Because these datasets typically include too many taxa for a legend to be useful, legend = FALSE
is the default. However, we can add an informative legend limited to just the top taxa using the top.n
argument.
profile_barplot(profiles, top.n = 5, legend = TRUE) + ggplot2::theme(legend.position = "top")
We can also visualize these data using a dot plot where taxa abundance is represented by the area of each sample's point.
# reload profiles and order taxa by mean abundance data(profiles) profiles <- reorder_features(profiles, fun.aggregate = mean) profile_dotplot(profiles, top.n = 5, max.size = 3, color = "feature")
The color
argument for profile_dotplot()
can be set to "sample"
or "feature"
to color code by sample/feature names, or any of the column names for variables stored in the phenoDadta
and featureData
slots.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.