options(width = 80) set.seed(1) # Knitr library(knitr) library(grid) opts_chunk$set(dev = 'png', fig.width = 7, fig.height = 7, warning = FALSE, message = FALSE)
Metabarcoding is revolutionizing microbial ecology and presenting new challenges:
MetacodeR is an R package that attempts to addresses these issues:
This is only a short introduction. See the full documentation at http://grunwaldlab.github.io/metacoder_documentation.
Most databases have a unique file format and taxonomic hierarchy/nomenclature. Taxonomic data can be extracted from any text-based file format using the extract_taxonomy function. Classifications can be parsed offline or retrieved from online databases if a taxon name, taxon ID, or sequence ID is present. A regular expression with capture groups and a corresponding key is used to define how to parse the file. The example code below parses a 1000 sequence subset of the 16s Ribosome Database Project training set for Mothur.
# Load the package library(metacoder) # Load the input FASTA file seqs <- seqinr::read.fasta(system.file("extdata", "mothur_16S_training_subset.fasta.gz", package = "metacoder")) # Print an example of the sequence headers cat(names(seqs)) # Extract the taxonomic information of the sequences data <- extract_taxonomy(seqs, regex = "^(.*)\\t(.*)", key = c(id = "obs_info", "class"), class_sep = ";")
The resulting object contains sequence information associated with an inferred taxonomic hierarchy. The standard print method of the object shows a part of every kind of data it contains:
taxon_funcs part contains function names (e.g.
n_obs) used to calculate taxon statistics.
You can see the results of these calculations using the
There is corresponding
obs_data function and
obs_funcs list, but they are not used in this case.
The hierarchical nature of taxonomic data makes it difficult to plot effectively. Most often, bar charts, stacked bar charts, or pie graphs are used, but these are ineffective when plotting many taxa or multiple ranks. MetacodeR maps taxonomic data (e.g. sequence abundance) to color/size of tree components in what we call a Metadiversity Plot:
heat_tree(data, node_size = n_obs, node_label = name, node_color = n_obs)
The default size range of nodes and edges displayed is optimized for each plot.
The legend represents the number of sequences for each taxon as both a color gradient and width of nodes.
To see the long list of available plotting options, type
Taxonomic data in the form used in MetacodeR can be manipulated using functions inspired by
For example, taxa can be subset using
filter, users can choose preserve or remove the subtaxa, supertaxa, and associated observation data of the selected taxa.
filter_taxa can be used to look at just the Archaea:
set.seed(1) heat_tree(filter_taxa(data, name == "Archaea", subtaxa = TRUE), node_size = n_obs, node_label = name, node_color = n_obs, layout = "fruchterman-reingold")
Any column displayed by
taxon_data can be used with
filter_taxa (and most other MetacodeR commands) as if it were a variable on its own.
To make the Archaea-Bacteria division more clear, the "Root" taxon can be removed, resulting in two separate trees:
subsetted <- filter_taxa(data, n_supertaxa > 0) set.seed(2) heat_tree(subsetted, node_size = n_obs, node_label = name, node_color = n_obs, tree_label = name)
Although observations (information in
obs_data) are typically assigned to the tips of the taxonomy, they can also be assigned to any internal node.
When a taxon is removed by a filtering operation, the observations assigned to it are reassigned to an unfiltered supertaxon by default.
This makes it easy to remove lower ranks of a taxonomy without discarding observations assigned to the tips:
set.seed(1) filter_taxa(data, n_supertaxa <= 4) %>% heat_tree(node_size = n_obs, node_label = name, node_color = n_obs)
There is also a
filter_obs function which can filter out observations and the taxa they are assigned to.
When calculating statistics for taxa, the amount of data should be balanced across taxa and there should be enough data per taxon to provide unbiased estimates.
Random samples from large reference databases are biased towards overrepresented taxa.
MetacodeR provides two ways to randomly sample taxonomic data.
taxonomic_sample is used to create taxonomically balanced random samples.
The acceptable range of sequence or subtaxa counts can be defined for each taxonomic rank; taxa with too few are excluded and taxa with too many are randomly subsampled.
The code below samples the data such that rank 5 taxa (i.e. those with 5 supertaxa) will have 1 sequence and rank 3 taxa (phyla) will have less than 10:
set.seed(1) sampled <- taxonomic_sample(subsetted, max_counts = c("3" = 10, "5" = 1), min_counts = c("5" = 1)) sampled <- filter_taxa(sampled, n_obs > 0, subtaxa = FALSE)
To better see that this worked, the plot below makes rank 3 node labels larger so that they are easier to read.
set.seed(3) heat_tree(sampled, node_size = n_obs, node_label_size = n_obs * ifelse(n_supertaxa == 3, 10, 1), edge_size = n_obs, node_label = n_obs, node_color = n_obs, tree_label = name)
Something similar can be accomplished using the
dplyr equivalents to
sample_frac by weighting the probability of sampling observations by the inverse of the number of observations:
set.seed(6) sample_n_obs(subsetted, size = 400, taxon_weight = 1 / n_obs, unobserved = FALSE) %>% heat_tree(node_size = n_obs, node_label = n_obs, overlap_avoidance = 0.5, node_color = n_obs, tree_label = name)
This document is only a short introduction to MetacodeR and there is much that is not covered. For more information, see our website at http://grunwaldlab.github.io/metacoder_documentation/ and our github repository at https://github.com/grunwaldlab/metacoder.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.