An introduction to MetacodeR

options(width = 80)
# Knitr
opts_chunk$set(dev = 'png', fig.width = 7, fig.height = 7, warning = FALSE, message = FALSE)

Metabarcoding is revolutionizing microbial ecology and presenting new challenges:

MetacodeR is an R package that attempts to addresses these issues:


This is only a short introduction. See the full documentation at

Extracting taxonomic data

Most databases have a unique file format and taxonomic hierarchy/nomenclature. Taxonomic data can be extracted from any text-based file format using the extract_taxonomy function. Classifications can be parsed offline or retrieved from online databases if a taxon name, taxon ID, or sequence ID is present. A regular expression with capture groups and a corresponding key is used to define how to parse the file. The example code below parses a 1000 sequence subset of the 16s Ribosome Database Project training set for Mothur.

# Load the package
# Load the input FASTA file
seqs <- seqinr::read.fasta(system.file("extdata",
                                       package = "metacoder"))
# Print an example of the sequence headers
# Extract the taxonomic information of the sequences
data <- extract_taxonomy(seqs, regex = "^(.*)\\t(.*)",
                         key = c(id = "obs_info", "class"),
                         class_sep = ";")

The resulting object contains sequence information associated with an inferred taxonomic hierarchy. The standard print method of the object shows a part of every kind of data it contains:


Note the taxon_funcs part contains function names (e.g. n_obs) used to calculate taxon statistics. You can see the results of these calculations using the taxon_data function:


There is corresponding obs_data function and obs_funcs list, but they are not used in this case.

Heat trees

The hierarchical nature of taxonomic data makes it difficult to plot effectively. Most often, bar charts, stacked bar charts, or pie graphs are used, but these are ineffective when plotting many taxa or multiple ranks. MetacodeR maps taxonomic data (e.g. sequence abundance) to color/size of tree components in what we call a Metadiversity Plot:

heat_tree(data, node_size = n_obs, node_label = name, node_color = n_obs)

The default size range of nodes and edges displayed is optimized for each plot. The legend represents the number of sequences for each taxon as both a color gradient and width of nodes. To see the long list of available plotting options, type ?heat_tree.


Taxonomic data in the form used in MetacodeR can be manipulated using functions inspired by dplyr. For example, taxa can be subset using filter_taxa. Unlike the dplyr function filter, users can choose preserve or remove the subtaxa, supertaxa, and associated observation data of the selected taxa. For example, filter_taxa can be used to look at just the Archaea:

heat_tree(filter_taxa(data, name == "Archaea", subtaxa = TRUE),
          node_size = n_obs, node_label = name, 
          node_color = n_obs, layout = "fruchterman-reingold")

Any column displayed by taxon_data can be used with filter_taxa (and most other MetacodeR commands) as if it were a variable on its own. To make the Archaea-Bacteria division more clear, the "Root" taxon can be removed, resulting in two separate trees:

subsetted <- filter_taxa(data, n_supertaxa > 0)
heat_tree(subsetted, node_size = n_obs, node_label = name,
          node_color = n_obs, tree_label = name)

Although observations (information in obs_data) are typically assigned to the tips of the taxonomy, they can also be assigned to any internal node. When a taxon is removed by a filtering operation, the observations assigned to it are reassigned to an unfiltered supertaxon by default. This makes it easy to remove lower ranks of a taxonomy without discarding observations assigned to the tips:

filter_taxa(data, n_supertaxa <= 4) %>%
  heat_tree(node_size = n_obs, node_label = name, node_color = n_obs)

There is also a filter_obs function which can filter out observations and the taxa they are assigned to.


When calculating statistics for taxa, the amount of data should be balanced across taxa and there should be enough data per taxon to provide unbiased estimates. Random samples from large reference databases are biased towards overrepresented taxa. MetacodeR provides two ways to randomly sample taxonomic data. The function taxonomic_sample is used to create taxonomically balanced random samples. The acceptable range of sequence or subtaxa counts can be defined for each taxonomic rank; taxa with too few are excluded and taxa with too many are randomly subsampled. The code below samples the data such that rank 5 taxa (i.e. those with 5 supertaxa) will have 1 sequence and rank 3 taxa (phyla) will have less than 10:

sampled <- taxonomic_sample(subsetted, max_counts = c("3" = 10, "5" = 1), min_counts = c("5" = 1))
sampled <- filter_taxa(sampled, n_obs > 0, subtaxa = FALSE) 

To better see that this worked, the plot below makes rank 3 node labels larger so that they are easier to read.

          node_size = n_obs,
          node_label_size = n_obs * ifelse(n_supertaxa == 3, 10, 1),
          edge_size = n_obs, 
          node_label = n_obs,
          node_color = n_obs,
          tree_label = name)

Something similar can be accomplished using the dplyr equivalents to sample_n and sample_frac by weighting the probability of sampling observations by the inverse of the number of observations:

sample_n_obs(subsetted, size = 400, taxon_weight = 1 / n_obs, unobserved = FALSE) %>%
  heat_tree(node_size = n_obs, node_label = n_obs, overlap_avoidance = 0.5,
            node_color = n_obs, tree_label = name)

More inforamtion

This document is only a short introduction to MetacodeR and there is much that is not covered. For more information, see our website at and our github repository at

Try the metacoder package in your browser

Any scripts or data that you put into this service are public.

metacoder documentation built on May 30, 2017, 3:11 a.m.