The documentation from earlier versions of microViz included an incorrect example of
taxatree_models
use. Specifically, it accidentally demonstrated log transforming abundance data before aggregation. Sincere apologies to anyone who followed this incorrect procedure.
Examples have been corrected in microViz docs and website for version 0.11.0 and later.
Please reach out to me with any questions about this issue.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This article will give an example of statistical modelling of the abundances of individual taxa, and visual presentation of the results using microViz taxonomic association tree plots.
library(microViz) library(corncob) library(dplyr) library(ggplot2)
First we'll get some OTU abundance data from inflammatory bowel disease patients and controls from the corncob package.
data("ibd") ibd
We'll keep only the Ulcerative Colitis and Healthy Control samples, to
simplify the analyses for this example. We'll also remove the Species
rank information, as most OTUs in this dataset are not assigned to a
species. We'll also use tax_fix
to fill any gaps where the Genus is
unknown, with the family name or whatever higher rank classification is
known.
phylo <- ibd %>% ps_filter(DiseaseState %in% c("UC", "nonIBD")) %>% tax_mutate(Species = NULL) %>% tax_fix() phylo
Let's have a quick look at the sample data using the skimr
package.
phylo %>% samdat_tbl() %>% dplyr::mutate(across(where(is.character), as.factor)) %>% skimr::skim()
Let's make some sample data variables that are easier to use and compare in the statistical modelling ahead. We will convert dichotomous categorical variables into similar binary variables (values: 1 for true, or 0 for false). We will also scale and center the numeric variable for age.
phylo <- phylo %>% ps_mutate( UC = ifelse(DiseaseState == "UC", yes = 1, no = 0), female = ifelse(gender == "female", yes = 1, no = 0), antibiotics = ifelse(abx == "abx", yes = 1, no = 0), steroids = ifelse(steroids == "steroids", yes = 1, no = 0), age_scaled = scale(age, center = TRUE, scale = TRUE) )
We will start by creating a linear regression model for one genus,
Bacteroides. We will transform the count data by first making it
proportions with tax_transform("compositional")
.
We will replace the zeros with half the minimum observed abundance proportion
(of any taxon) before log2 transformation.
We do this second transformation step by passing trans = "log2"
to the tax_model function.
This two-step transformation is borrowed from MaAsLin2
(except in MaAsLin2 the compositional transformation is named
"Total Sum Scaling (TSS)").
parabacteroides_lm <- phylo %>% tax_fix() %>% tax_transform("compositional", rank = "Genus") %>% tax_model( type = "lm", rank = "Genus", taxa = "Parabacteroides", trans = "log2", trans_args = list(zero_replace = "halfmin"), variables = c("UC", "female", "antibiotics", "steroids", "age_scaled"), return_psx = FALSE ) parabacteroides_lm$Parabacteroides
summary(parabacteroides_lm$Parabacteroides)
This model suggests Parabacteroides abundances are significantly lower in Ulcerative Colitis patients than controls, on average.
Let's boxplot the transformed data to see what this Parabacteroides association looks like as a crude group difference.
plot_data <- phylo %>% tax_fix() %>% tax_transform("compositional", rank = "Genus") %>% tax_transform("log2", zero_replace = "halfmin", chain = TRUE) %>% ps_get() %>% ps_otu2samdat("Parabacteroides") %>% # adds Parabacteroides as sample data! samdat_tbl() ggplot(plot_data, aes(x = DiseaseState, y = Parabacteroides)) + geom_boxplot(width = 0.5, colour = "grey35") + geom_jitter(width = 0.2, alpha = 0.5) + scale_y_continuous( breaks = log2(1 / 2^(0:13)), labels = function(x) paste0(100 * round(2^x, digits = 5), "%"), limits = c(log2(0.00005), log2(0.25)) ) + theme_bw()
You can also use other regression modelling functions that take a
formula. For example the beta binomial modelling provided in the corncob
package. This approach models both abundance and dispersion, and
directly uses untransformed counts. By default, microViz's tax_model()
will use the same formula for both abundance and dispersion modelling,
but you can override this by setting the phi.formula
argument
yourself. See vignette("corncob-intro", package = "corncob")
for more
info on these models.
parabacteroides_bb <- phylo %>% tax_fix() %>% tax_model( type = corncob::bbdml, rank = "Genus", taxa = "Parabacteroides", variables = c("UC", "female", "antibiotics", "steroids", "age_scaled"), return_psx = FALSE ) parabacteroides_bb$Parabacteroides
Now we will fit a similar model for almost every taxon at every rank.
The code for taxatree_models
is quite similar to tax_model
. However,
you might need to run tax_prepend_ranks
to ensure that each taxon at
each rank is always unique. As an example of the problem, Actinobacteria
is the name of both a Phylum and a Class!
lm_models <- phylo %>% tax_fix() %>% tax_prepend_ranks() %>% # it makes sense to perform the compositional transformation BEFORE filtering tax_transform("compositional", rank = "Genus", keep_counts = TRUE) %>% tax_filter(min_prevalence = 0.1, undetected = 0, use_counts = TRUE) %>% taxatree_models( type = lm, trans = "log2", trans_args = list(zero_replace = "halfmin"), ranks = NULL, # uses every rank available except the first variables = c("UC", "female", "antibiotics", "steroids", "age_scaled") )
Why filter the taxa? It's less likely that we are interested in rare taxa, and models of rare taxon abundances are more likely to be unreliable. Reducing the the number of taxa modelled also makes visualising the results easier!
lm_models
Next we will get a data.frame containing the regression coefficient
estimates, test statistics and corresponding p values from all these
regression models. The function taxatree_models2stats()
can do this
for any type of model that has a broom::tidy()
method, as well as for
beta binomial regression models calculated with the corncob
package
bbdml()
function.
lm_stats <- taxatree_models2stats(lm_models) lm_stats
lm_stats %>% taxatree_stats_get()
Using the taxatree_stats_p_adjust()
function, you can correct for
multiple testing / control the false discovery rate or family-wise error
rate.
Instead of applying these adjustment methods across all 88 taxa models at all ranks, the default behaviour is to control the family-wise error rate per rank.
lm_stats <- taxatree_stats_p_adjust( data = lm_stats, method = "BH", grouping = "rank" ) # notice the new variable lm_stats %>% taxatree_stats_get()
taxatree_plots()
allows you to plot statistics (e.g. point estimates
and significance) from all of the taxa models onto a tree layout. The
taxon models are organised by rank, radiating out from the central root
node from e.g. Phyla around the center to Genera in the outermost ring.
taxatree_plots()
itself returns a list of plots, which you can arrange
into one figure with the
patchwork
package for example
(and/or
cowplot
).
lm_stats %>% taxatree_plots( node_size_range = c(1, 3), var_renamer = toupper ) %>% patchwork::wrap_plots( ncol = 2, guides = "collect" )
But how do we know which taxa are which nodes? We can create a labelled
grey tree with taxatree_plotkey
. This labels taxa based on certain
conditions.
set.seed(123) # label position key <- taxatree_plotkey( data = lm_stats, taxon_renamer = function(x) stringr::str_remove(x, "[PFG]: "), # 2 lines of conditions below, for filtering taxa to be labelled rank == "Phylum" | rank == "Genus" & prevalence > 0.25, !grepl("Kingdom", taxon) ) + # add a bit more space for the longer labels by expanding the x axis scale_x_continuous(expand = expansion(mult = 0.2)) # all phyla are labelled, and all genera with a prevalence of over 0.2 # except for any taxa whose names (partly) match "Kingdom" # (i.e. an unclassified taxon) key
Let's put the key and some of the trees together in one patchwork
figure. Getting the sizing right on these combined plots can be very
tricky! Pay attention to the absolute height and width of the plot
output.
gridExtra::grid.arrange()
or cowplot::plot_grid()
are alternatives
you can also try. cowplot::get_legend()
can be particularly useful.
trees <- lm_stats %>% taxatree_plots(node_size_range = c(1, 2.25)) %>% .[1:4] %>% patchwork::wrap_plots( ncol = 2, guides = "collect" ) panel <- patchwork::wrap_plots(key, trees, nrow = 1, widths = 8:7) set.seed(111) panel
You could save the plot with ggsave()
like this.
set.seed(111) ggsave("test.png", panel, width = 13, height = 5.5, dpi = 120, device = "png")
You can change the default styling of the labels by first suppressing
the automatic drawing of labels with .draw_label = FALSE
in
taxatree_plotkey()
and then adding your own custom-style labels with
taxatree_plot_labels()
. Here we will draw some yellow labels.
taxatree_plotkey( data = lm_stats, .draw_label = FALSE, rank %in% c("Phylum", "Family") & !grepl("Bacteria", taxon), prevalence > 0.2 | rank == "Phylum" ) %>% taxatree_plot_labels( taxon_renamer = function(x) stringr::str_remove(x, "[PFGO]: "), # default fun is ggrepel::geom_text_repel fun = ggrepel::geom_label_repel, # arguments modifying label style size = 2.5, alpha = 0.5, colour = "black", fill = "gold1", label.size = 0.05, label.r = unit(0.05, "lines"), label.padding = unit(0.15, "lines"), segment.size = 0.5, # arguments affecting label positioning box.padding = 0.05, x_nudge = 0.4, y_nudge = 0.05, hjust = 0.5, seed = 123 )
You can directly label taxatree_plots too, but it is better to only do
this for a few taxa. You must run taxatree_label()
first to create a
"label" indicator variable.
lm_stats %>% taxatree_label( rank == "Genus", p.value < 0.05 | prevalence > 0.5, estimate > 0 ) %>% taxatree_plots() %>% .[[1]] %>% # show only the first plot taxatree_plot_labels( taxon_renamer = function(x) stringr::str_remove(x, "G: "), fun = ggrepel::geom_label_repel, x_nudge = 0.7, hjust = 0.5, size = 2 )
Choosing another color palette is easy, just name any diverging palette from colorspace hcl diverging palettes. See your options below.
colorspace::hcl_palettes(type = "diverging", plot = TRUE, n = 11)
By default, the colour scale is transformed by taking the square root of the absolute values. But you can change this to "identity" to have no palette transformation.
By default the range of the data is used to set symmetrical limits on the colour scale, which are the same for all plots in the list produced. You can set alternative limits. If some data lie outside these limits, their values will be "squished" to match the max or min limit value.
For finer control of the palette luminance range, you can set custom values for l1 and l2, e.g. if the extremes are too bright or dark. This is done by default for the Green-Brown palette.
lm_stats %>% taxatree_label( rank == "Genus", p.value < 0.05 | prevalence > 0.5, estimate > 0 ) %>% taxatree_plots( colour_lims = c(-5, 5), colour_trans = "identity", palette = "Blue-Red 3", l2 = 90 ) %>% .[[1]] %>% # show only the first plot taxatree_plot_labels( taxon_renamer = function(x) stringr::str_remove(x, "G: "), fun = ggrepel::geom_label_repel, x_nudge = 0.7, hjust = 0.5, size = 2 )
Palettes like "Berlin" that go through a black midpoint would probably only make sense with a darker background!
lm_stats %>% taxatree_label( rank == "Genus", p.value < 0.05 | prevalence > 0.5, estimate > 0 ) %>% taxatree_plots( palette = "Berlin", colour_lims = c(-5, 5), size_guide = NULL ) %>% .[[1]] %>% # show only the first plot taxatree_plot_labels( taxon_renamer = function(x) stringr::str_remove(x, "G: "), fun = ggrepel::geom_label_repel, x_nudge = 0.7, xlim = c(-1.7, 1.5), hjust = 0.5, size = 2 ) + theme( text = element_text(colour = "white"), plot.background = element_rect(fill = "grey30"), plot.title = element_text(size = 20, colour = "white") )
If you like, you can sort the nodes by sorting the taxa in the ps_extra object.
lm_stats %>% tax_sort(by = "prev", at = "Genus") %>% taxatree_plots() %>% .[[1]] # show only the first plot
You can chain multiple tax_sort()
calls together to fine-tune the
order of the nodes on the tree to your own preference.
lm_stats %>% tax_sort(by = "prev", at = "Family") %>% tax_sort(by = "name", at = "Phylum") %>% tax_sort(by = "rev") %>% taxatree_plots() %>% .[[1]] # show only the first plot
Remember we made adjusted p values earlier? Let's plot those instead. Just to show how it's done, we'll also change the symbol used to identify the significant sites to a cross, and we'll also relax the significance threshold to 0.1.
It looks like only the disease state (having ulcerative colitis) shows any significant associations after this FDR correction.
lm_stats %>% taxatree_plots( sig_stat = "p.adj.BH.rank", sig_threshold = 0.1, sig_shape = "cross", sig_size = 1.5, node_size_range = c(1, 3), var_renamer = toupper ) %>% .[1:4] %>% # keep only first 4 plots patchwork::wrap_plots( ncol = 2, guides = "collect" )
You can also plot multiple significance markers. You must start with the strictest threshold. Here we will plot the FDR corrected significance markers at for p.adj \< 0.05 (as thick white crosses) and then also unadjusted significance markers for p \< 0.05 (as outlined white circles).
lm_stats %>% taxatree_plots( sig_stat = c("p.adj.BH.rank", "p.value"), sig_threshold = 0.05, sig_shape = c("cross", "circle filled"), sig_colour = "white", sig_size = c(1.5, 1), sig_stroke = c(1, 0.25), node_size_range = c(1, 3), var_renamer = toupper ) %>% .[1:4] %>% # keep only first 4 plots patchwork::wrap_plots( ncol = 2, guides = "collect" )
The corncob package provides beta-binomial regression models. See the
paper here, and
the helpful package vignette:
vignette("corncob-intro", package = "corncob")
.
We will filter the taxa more strictly (by a higher prevalence threshold) before this type of modelling. We do not need to transform the data, as this approach uses counts.
bb_models <- phylo %>% tax_fix() %>% tax_prepend_ranks() %>% tax_filter(min_prevalence = 0.3) %>% taxatree_models( type = corncob::bbdml, ranks = c("Phylum", "Class", "Order", "Family"), variables = c("UC", "female", "antibiotics", "steroids", "age_scaled") ) bb_models
When extracting stats from corncob beta-binomial models, you need to specify which parameter estimate you want, "mu" for differential abundance, or "phi" for differential variability or overdispersion.
bb_stats <- taxatree_models2stats(bb_models, param = "mu") bb_stats bb_stats %>% taxatree_stats_get()
bb_stats %>% taxatree_plots( node_size_range = c(1, 4), colour_trans = "identity" ) %>% # keep only first 4 plots .[1:4] %>% patchwork::wrap_plots( ncol = 2, guides = "collect" )
You do not need to make circular tree plots if you don't want to!
alt_trees <- bb_stats %>% taxatree_plots( node_size_range = c(1, 4), circular = FALSE, colour_trans = "identity" ) %>% # keep only first 4 plots .[1:4] %>% patchwork::wrap_plots( ncol = 2, guides = "collect" ) & # & is used by patchwork to modify multiple ggplots (instead of +) coord_flip() & scale_y_reverse() alt_trees
Let's add the key for this layout and label it manually with
taxatree_plot_labels()
.
alt_tree_key <- bb_stats %>% taxatree_plotkey(circular = FALSE, .draw_label = FALSE, rank == "Family") %>% taxatree_plot_labels( circular = FALSE, hjust = 0.5, force = 0, nudge_y = 2, size = 3, taxon_renamer = function(x) stringr::str_remove(x, "[PFG]: ") ) + coord_flip() + scale_y_reverse(expand = expansion(add = c(0.5, 1.5))) + theme(plot.title = element_text(size = 14)) patchwork::wrap_plots(alt_tree_key, alt_trees, nrow = 2, heights = 1:2)
You don't have to use a regular tree!
Alternative layouts from the igraph package are possible, such as the Kamada and Kawai spring algorithm ("kk") or Fruchterman and Reingold force-directed algorithm ("fr"). You must set a layout_seed number for these layouts to ensure they are always the same.
bb_stats %>% taxatree_plots( node_size_range = c(1, 4), colour_trans = "identity", layout = "kk", layout_seed = 321 ) %>% # keep only first 4 plots .[1:4] %>% patchwork::wrap_plots( ncol = 2, guides = "collect" )
devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.