Warning: do not copy taxatree_models examples from versions before 0.11.0

The documentation from earlier versions of microViz included an incorrect example of taxatree_models use. Specifically, it accidentally demonstrated log transforming abundance data before aggregation. Sincere apologies to anyone who followed this incorrect procedure. Examples have been corrected in microViz docs and website for version 0.11.0 and later. Please reach out to me with any questions about this issue.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This article will give an example of statistical modelling of the abundances of individual taxa, and visual presentation of the results using microViz taxonomic association tree plots.

Setup

library(microViz)
library(corncob)
library(dplyr)
library(ggplot2)

First we'll get some OTU abundance data from inflammatory bowel disease patients and controls from the corncob package.

data("ibd")
ibd

We'll keep only the Ulcerative Colitis and Healthy Control samples, to simplify the analyses for this example. We'll also remove the Species rank information, as most OTUs in this dataset are not assigned to a species. We'll also use tax_fix to fill any gaps where the Genus is unknown, with the family name or whatever higher rank classification is known.

phylo <- ibd %>%
  ps_filter(DiseaseState %in% c("UC", "nonIBD")) %>%
  tax_mutate(Species = NULL) %>%
  tax_fix()
phylo

Let's have a quick look at the sample data using the skimr package.

phylo %>%
  samdat_tbl() %>%
  dplyr::mutate(across(where(is.character), as.factor)) %>%
  skimr::skim()

Let's make some sample data variables that are easier to use and compare in the statistical modelling ahead. We will convert dichotomous categorical variables into similar binary variables (values: 1 for true, or 0 for false). We will also scale and center the numeric variable for age.

phylo <- phylo %>%
  ps_mutate(
    UC = ifelse(DiseaseState == "UC", yes = 1, no = 0),
    female = ifelse(gender == "female", yes = 1, no = 0),
    antibiotics = ifelse(abx == "abx", yes = 1, no = 0),
    steroids = ifelse(steroids == "steroids", yes = 1, no = 0),
    age_scaled = scale(age, center = TRUE, scale = TRUE)
  )

Modelling one taxon at a time

TSS log2 linear regression

We will start by creating a linear regression model for one genus, Bacteroides. We will transform the count data by first making it proportions with tax_transform("compositional"). We will replace the zeros with half the minimum observed abundance proportion (of any taxon) before log2 transformation. We do this second transformation step by passing trans = "log2" to the tax_model function. This two-step transformation is borrowed from MaAsLin2 (except in MaAsLin2 the compositional transformation is named "Total Sum Scaling (TSS)").

parabacteroides_lm <- phylo %>%
  tax_fix() %>%
  tax_transform("compositional", rank = "Genus") %>%
  tax_model(
    type = "lm", rank = "Genus", taxa = "Parabacteroides",
    trans = "log2", trans_args = list(zero_replace = "halfmin"),
    variables = c("UC", "female", "antibiotics", "steroids", "age_scaled"),
    return_psx = FALSE
  )
parabacteroides_lm$Parabacteroides
summary(parabacteroides_lm$Parabacteroides)

This model suggests Parabacteroides abundances are significantly lower in Ulcerative Colitis patients than controls, on average.

Plotting TSS log2 data

Let's boxplot the transformed data to see what this Parabacteroides association looks like as a crude group difference.

plot_data <- phylo %>%
  tax_fix() %>%
  tax_transform("compositional", rank = "Genus") %>%
  tax_transform("log2", zero_replace = "halfmin", chain = TRUE) %>%
  ps_get() %>%
  ps_otu2samdat("Parabacteroides") %>% # adds Parabacteroides as sample data!
  samdat_tbl()

ggplot(plot_data, aes(x = DiseaseState, y = Parabacteroides)) +
  geom_boxplot(width = 0.5, colour = "grey35") +
  geom_jitter(width = 0.2, alpha = 0.5) +
  scale_y_continuous(
    breaks = log2(1 / 2^(0:13)),
    labels = function(x) paste0(100 * round(2^x, digits = 5), "%"),
    limits = c(log2(0.00005), log2(0.25))
  ) +
  theme_bw()

Beta binomial regression

You can also use other regression modelling functions that take a formula. For example the beta binomial modelling provided in the corncob package. This approach models both abundance and dispersion, and directly uses untransformed counts. By default, microViz's tax_model() will use the same formula for both abundance and dispersion modelling, but you can override this by setting the phi.formula argument yourself. See vignette("corncob-intro", package = "corncob") for more info on these models.

parabacteroides_bb <- phylo %>%
  tax_fix() %>%
  tax_model(
    type = corncob::bbdml, rank = "Genus", taxa = "Parabacteroides",
    variables = c("UC", "female", "antibiotics", "steroids", "age_scaled"),
    return_psx = FALSE
  )
parabacteroides_bb$Parabacteroides

Model all the taxa!

Now we will fit a similar model for almost every taxon at every rank. The code for taxatree_models is quite similar to tax_model. However, you might need to run tax_prepend_ranks to ensure that each taxon at each rank is always unique. As an example of the problem, Actinobacteria is the name of both a Phylum and a Class!

lm_models <- phylo %>%
  tax_fix() %>%
  tax_prepend_ranks() %>%
  # it makes sense to perform the compositional transformation BEFORE filtering
  tax_transform("compositional", rank = "Genus", keep_counts = TRUE) %>%
  tax_filter(min_prevalence = 0.1, undetected = 0, use_counts = TRUE) %>%
  taxatree_models(
    type = lm,
    trans = "log2", trans_args = list(zero_replace = "halfmin"),
    ranks = NULL, # uses every rank available except the first
    variables = c("UC", "female", "antibiotics", "steroids", "age_scaled")
  )

Why filter the taxa? It's less likely that we are interested in rare taxa, and models of rare taxon abundances are more likely to be unreliable. Reducing the the number of taxa modelled also makes visualising the results easier!

lm_models

Getting stats from the models

Next we will get a data.frame containing the regression coefficient estimates, test statistics and corresponding p values from all these regression models. The function taxatree_models2stats() can do this for any type of model that has a broom::tidy() method, as well as for beta binomial regression models calculated with the corncob package bbdml() function.

lm_stats <- taxatree_models2stats(lm_models)
lm_stats
lm_stats %>% taxatree_stats_get()

Adjusting p values

Using the taxatree_stats_p_adjust() function, you can correct for multiple testing / control the false discovery rate or family-wise error rate.

Instead of applying these adjustment methods across all 88 taxa models at all ranks, the default behaviour is to control the family-wise error rate per rank.

lm_stats <- taxatree_stats_p_adjust(
  data = lm_stats, method = "BH", grouping = "rank"
)
# notice the new variable
lm_stats %>% taxatree_stats_get()

Plot all the taxatree_stats!

taxatree_plots() allows you to plot statistics (e.g. point estimates and significance) from all of the taxa models onto a tree layout. The taxon models are organised by rank, radiating out from the central root node from e.g. Phyla around the center to Genera in the outermost ring.

taxatree_plots() itself returns a list of plots, which you can arrange into one figure with the patchwork package for example (and/or cowplot).

lm_stats %>%
  taxatree_plots(
    node_size_range = c(1, 3), var_renamer = toupper
  ) %>%
  patchwork::wrap_plots(
    ncol = 2, guides = "collect"
  )

Taxatree Key

But how do we know which taxa are which nodes? We can create a labelled grey tree with taxatree_plotkey. This labels taxa based on certain conditions.

set.seed(123) # label position
key <- taxatree_plotkey(
  data = lm_stats,
  taxon_renamer = function(x) stringr::str_remove(x, "[PFG]: "),
  # 2 lines of conditions below, for filtering taxa to be labelled
  rank == "Phylum" | rank == "Genus" & prevalence > 0.25,
  !grepl("Kingdom", taxon)
) +
  # add a bit more space for the longer labels by expanding the x axis
  scale_x_continuous(expand = expansion(mult = 0.2))
# all phyla are labelled, and all genera with a prevalence of over 0.2
# except for any taxa whose names (partly) match "Kingdom"
# (i.e. an unclassified taxon)
key

Key + Trees

Let's put the key and some of the trees together in one patchwork figure. Getting the sizing right on these combined plots can be very tricky! Pay attention to the absolute height and width of the plot output.

gridExtra::grid.arrange() or cowplot::plot_grid() are alternatives you can also try. cowplot::get_legend() can be particularly useful.

trees <- lm_stats %>%
  taxatree_plots(node_size_range = c(1, 2.25)) %>%
  .[1:4] %>%
  patchwork::wrap_plots(
    ncol = 2, guides = "collect"
  )

panel <- patchwork::wrap_plots(key, trees, nrow = 1, widths = 8:7)
set.seed(111)
panel

You could save the plot with ggsave() like this.

set.seed(111)
ggsave("test.png", panel, width = 13, height = 5.5, dpi = 120, device = "png")

Alternative label styling

You can change the default styling of the labels by first suppressing the automatic drawing of labels with .draw_label = FALSE in taxatree_plotkey() and then adding your own custom-style labels with taxatree_plot_labels(). Here we will draw some yellow labels.

taxatree_plotkey(
  data = lm_stats, .draw_label = FALSE,
  rank %in% c("Phylum", "Family") & !grepl("Bacteria", taxon),
  prevalence > 0.2 | rank == "Phylum"
) %>%
  taxatree_plot_labels(
    taxon_renamer = function(x) stringr::str_remove(x, "[PFGO]: "),
    # default fun is ggrepel::geom_text_repel
    fun = ggrepel::geom_label_repel,
    # arguments modifying label style
    size = 2.5, alpha = 0.5, colour = "black", fill = "gold1",
    label.size = 0.05, label.r = unit(0.05, "lines"),
    label.padding = unit(0.15, "lines"), segment.size = 0.5,
    # arguments affecting label positioning
    box.padding = 0.05, x_nudge = 0.4, y_nudge = 0.05,
    hjust = 0.5, seed = 123
  )

Directly labelling taxa

You can directly label taxatree_plots too, but it is better to only do this for a few taxa. You must run taxatree_label() first to create a "label" indicator variable.

lm_stats %>%
  taxatree_label(
    rank == "Genus", p.value < 0.05 | prevalence > 0.5, estimate > 0
  ) %>%
  taxatree_plots() %>%
  .[[1]] %>% # show only the first plot
  taxatree_plot_labels(
    taxon_renamer = function(x) stringr::str_remove(x, "G: "),
    fun = ggrepel::geom_label_repel, x_nudge = 0.7, hjust = 0.5, size = 2
  )

Changing color palette

Choosing another color palette is easy, just name any diverging palette from colorspace hcl diverging palettes. See your options below.

colorspace::hcl_palettes(type = "diverging", plot = TRUE, n = 11)

By default, the colour scale is transformed by taking the square root of the absolute values. But you can change this to "identity" to have no palette transformation.

By default the range of the data is used to set symmetrical limits on the colour scale, which are the same for all plots in the list produced. You can set alternative limits. If some data lie outside these limits, their values will be "squished" to match the max or min limit value.

For finer control of the palette luminance range, you can set custom values for l1 and l2, e.g. if the extremes are too bright or dark. This is done by default for the Green-Brown palette.

lm_stats %>%
  taxatree_label(
    rank == "Genus", p.value < 0.05 | prevalence > 0.5, estimate > 0
  ) %>%
  taxatree_plots(
    colour_lims = c(-5, 5), colour_trans = "identity",
    palette = "Blue-Red 3", l2 = 90
  ) %>%
  .[[1]] %>% # show only the first plot
  taxatree_plot_labels(
    taxon_renamer = function(x) stringr::str_remove(x, "G: "),
    fun = ggrepel::geom_label_repel, x_nudge = 0.7, hjust = 0.5, size = 2
  )

Palettes like "Berlin" that go through a black midpoint would probably only make sense with a darker background!

lm_stats %>%
  taxatree_label(
    rank == "Genus", p.value < 0.05 | prevalence > 0.5, estimate > 0
  ) %>%
  taxatree_plots(
    palette = "Berlin", colour_lims = c(-5, 5), size_guide = NULL
  ) %>%
  .[[1]] %>% # show only the first plot
  taxatree_plot_labels(
    taxon_renamer = function(x) stringr::str_remove(x, "G: "),
    fun = ggrepel::geom_label_repel,
    x_nudge = 0.7, xlim = c(-1.7, 1.5), hjust = 0.5, size = 2
  ) +
  theme(
    text = element_text(colour = "white"),
    plot.background = element_rect(fill = "grey30"),
    plot.title = element_text(size = 20, colour = "white")
  )

Sorting taxa nodes

If you like, you can sort the nodes by sorting the taxa in the ps_extra object.

lm_stats %>%
  tax_sort(by = "prev", at = "Genus") %>%
  taxatree_plots() %>%
  .[[1]] # show only the first plot

You can chain multiple tax_sort() calls together to fine-tune the order of the nodes on the tree to your own preference.

lm_stats %>%
  tax_sort(by = "prev", at = "Family") %>%
  tax_sort(by = "name", at = "Phylum") %>%
  tax_sort(by = "rev") %>%
  taxatree_plots() %>%
  .[[1]] # show only the first plot

Plotting adjusted p values

Remember we made adjusted p values earlier? Let's plot those instead. Just to show how it's done, we'll also change the symbol used to identify the significant sites to a cross, and we'll also relax the significance threshold to 0.1.

It looks like only the disease state (having ulcerative colitis) shows any significant associations after this FDR correction.

lm_stats %>%
  taxatree_plots(
    sig_stat = "p.adj.BH.rank", sig_threshold = 0.1,
    sig_shape = "cross", sig_size = 1.5,
    node_size_range = c(1, 3), var_renamer = toupper
  ) %>%
  .[1:4] %>% # keep only first 4 plots
  patchwork::wrap_plots(
    ncol = 2, guides = "collect"
  )

Plotting multiple significance markers

You can also plot multiple significance markers. You must start with the strictest threshold. Here we will plot the FDR corrected significance markers at for p.adj \< 0.05 (as thick white crosses) and then also unadjusted significance markers for p \< 0.05 (as outlined white circles).

lm_stats %>%
  taxatree_plots(
    sig_stat = c("p.adj.BH.rank", "p.value"), sig_threshold = 0.05,
    sig_shape = c("cross", "circle filled"), sig_colour = "white",
    sig_size = c(1.5, 1), sig_stroke = c(1, 0.25),
    node_size_range = c(1, 3), var_renamer = toupper
  ) %>%
  .[1:4] %>% # keep only first 4 plots
  patchwork::wrap_plots(
    ncol = 2, guides = "collect"
  )

Beta-binomial regression example

The corncob package provides beta-binomial regression models. See the paper here, and the helpful package vignette: vignette("corncob-intro", package = "corncob").

We will filter the taxa more strictly (by a higher prevalence threshold) before this type of modelling. We do not need to transform the data, as this approach uses counts.

bb_models <- phylo %>%
  tax_fix() %>%
  tax_prepend_ranks() %>%
  tax_filter(min_prevalence = 0.3) %>%
  taxatree_models(
    type = corncob::bbdml,
    ranks = c("Phylum", "Class", "Order", "Family"),
    variables = c("UC", "female", "antibiotics", "steroids", "age_scaled")
  )
bb_models

When extracting stats from corncob beta-binomial models, you need to specify which parameter estimate you want, "mu" for differential abundance, or "phi" for differential variability or overdispersion.

bb_stats <- taxatree_models2stats(bb_models, param = "mu")
bb_stats
bb_stats %>% taxatree_stats_get()
bb_stats %>%
  taxatree_plots(
    node_size_range = c(1, 4), colour_trans = "identity"
  ) %>%
  # keep only first 4 plots
  .[1:4] %>%
  patchwork::wrap_plots(
    ncol = 2, guides = "collect"
  )

Alternative layouts

You do not need to make circular tree plots if you don't want to!

alt_trees <- bb_stats %>%
  taxatree_plots(
    node_size_range = c(1, 4), circular = FALSE, colour_trans = "identity"
  ) %>%
  # keep only first 4 plots
  .[1:4] %>%
  patchwork::wrap_plots(
    ncol = 2, guides = "collect"
  ) & # & is used by patchwork to modify multiple ggplots (instead of +)
  coord_flip() &
  scale_y_reverse()

alt_trees

Let's add the key for this layout and label it manually with taxatree_plot_labels().

alt_tree_key <- bb_stats %>%
  taxatree_plotkey(circular = FALSE, .draw_label = FALSE, rank == "Family") %>%
  taxatree_plot_labels(
    circular = FALSE, hjust = 0.5, force = 0, nudge_y = 2, size = 3,
    taxon_renamer = function(x) stringr::str_remove(x, "[PFG]: ")
  ) +
  coord_flip() +
  scale_y_reverse(expand = expansion(add = c(0.5, 1.5))) +
  theme(plot.title = element_text(size = 14))

patchwork::wrap_plots(alt_tree_key, alt_trees, nrow = 2, heights = 1:2)

You don't have to use a regular tree!

Alternative layouts from the igraph package are possible, such as the Kamada and Kawai spring algorithm ("kk") or Fruchterman and Reingold force-directed algorithm ("fr"). You must set a layout_seed number for these layouts to ensure they are always the same.

bb_stats %>%
  taxatree_plots(
    node_size_range = c(1, 4),
    colour_trans = "identity", layout = "kk", layout_seed = 321
  ) %>%
  # keep only first 4 plots
  .[1:4] %>%
  patchwork::wrap_plots(
    ncol = 2, guides = "collect"
  )

Session info

devtools::session_info()


david-barnett/microViz documentation built on April 17, 2025, 4:25 a.m.