knitr::opts_chunk$set(echo = TRUE)
This is a package to combine counts, sample metadata, and taxonomic annotations into one long-form tidy dataframe for further analysis. It offers functions to switch between phyloseq objects and these 'agglomerated' dataframes.
devtools::install_github("eclarke/metaglomr")
suppressPackageStartupMessages(library(tidyverse)) library(metaglomr) data(features) # A counts matrix, with rows as samples and columns as features data(samples) # A dataframe with sample metadata data(taxa) # A dataframe or matrix with taxonomic annotations for features # Combine all of these datasets into one long-form dataframe (agg <- agglomerated(features, samples, taxa, "sample_id"))
In metagenomics, we often work with three distinct datasets: a feature count table, a sample table, and a taxonomic annotation table.
The feature count table is is a matrix of features $\times$ samples, and the cells of the matrix are the times that feature was seen in that sample. Features can be OTUs, ASVs, species, genomes, or whatever. Often these come from a biom file.
data(features) features[1:5, 1:15]
The sample metadata can be anything about the samples, but frequently contains at least a description of the sample type and study group of each sample. This is frequently referred to as a mapping file in QIIME.
data(samples) as_tibble(samples)
This contains taxonomic annotations for all the features (OTUs or otherwise), split by rank (i.e. Kingdom, Phylum, etc).
data(taxa) taxa[1:5, ]
For users of the tidyverse, it's frequently easiest to work with long-form melted datasets, where each row is a unique observation or data point. The unique reference in these three datasets is the count of a feature in a particular sample. Therefore, we can create a dataframe where each row is this unique combination of feature + sample + count, with additional columns describing the sample and feature further.
(agg <- agglomerated(features, samples, taxa, "sample_id"))
While this may seem overly repetitive (as the metadata is duplicated in lots of rows), R and dplyr actually handle this pretty well. Things only start breaking down with feature tables that have more than > 100,000,000 cells. What this buys you is the ability to use standard tidyverse verbs and operations easily.
Subsetting is easy through the use of the filter
verb:
filter(agg, study_group == "case") filter(agg, Phylum == "Bacteroidetes")
Here's how to calculate proportions and add it to your dataframe:
agg <- agg %>% group_by(sample_id) %>% mutate(proportion = count/sum(count)) # Showing just a subset of the data: select(agg, sample_id, otu_id, count, proportion)
Aggregate based on taxonomic rank:
agg %>% group_by(sample_id, Phylum) %>% summarize(count = sum(count)) %>% ungroup() %>% # re-add sample data that got lost in the summarizing left_join(get_samples(agg, sample_id, study_group, sample_type))
Or find the most prevalent phyla in your study groups:
agg %>% ungroup() %>% group_by(study_group, Phylum) %>% summarize(mean_proportion = mean(proportion)) %>% top_n(1, mean_proportion)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.