library(rbiom) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", R.options = list( pillar.print_min = 5, pillar.print_max = 5 ))
Visualizations are one of the best ways to identify correlations in your dataset. If you can see a trend with your eyes, then you're on the right track. Rbiom's plotting functions are
cat(glue::glue(' <table class="table-borderless"> <tr>{adiv_boxplot}{bdiv_corrplot}{rare_stacked}{taxa_boxplot}</tr> <tr>{adiv_corrplot}{bdiv_heatmap}{rare_corrplot}{taxa_corrplot}</tr> <tr>{bdiv_boxplot}{bdiv_ord_plot}{taxa_stacked}{taxa_heatmap}</tr> </table>', .transformer = function (text, ...) { sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text) }))
The *_boxplot()
, *_corrplot()
, and bdiv_ord_plot()
functions will
automatically add p-values to your figures whenever possible. The ggplot object
they return has $data
, $code
, $stats
, and $stats$code
attributes you
can use to automate, reproduce, and customize your figures.
To generate statistics without creating a plot, use one of the following:
cat(glue::glue(' <table class="table-borderless"> <tr>{adiv_stats}{bdiv_stats}{distmat_stats}{taxa_stats}</tr> </table>', .transformer = function (text, ...) { sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text) }))
\cr
Your metadata field of interest and microbiome property will determine which rbiom function to use.
Metadata Property |
Microbiome Property | ||
---|---|---|---|
Alpha Diversity Shannon, Simpson |
Beta Diversity UniFrac, Jaccard |
Taxa Abundance Phylum, Genus |
|
Categorical Sex, Body Site |
`adiv_boxplot()` | `bdiv_boxplot()` `bdiv_ord_plot()` |
`taxa_boxplot()` |
Numeric Age, BMI |
`adiv_corrplot()` | `bdiv_corrplot()` | `taxa_corrplot()` |
Any | `adiv_stats()` | `bdiv_stats()` | `taxa_stats()` |
For instance, to explore the effect of Body Site (a categorical metadata field)
on Shannon Diversity (an alpha diversity metric), we'd use adiv_boxplot()
to
produce a plot with statistics, or adiv_stats()
if we only want the stats.
Although all the functions have important differences, the statistical methods they employ can be grouped by the three plot types: ordination plot, box plot, and correlation plot.
\cr
Statistics for ordination plots are the most straight-forward. Set a
categorical metadata field to the color.by
parameter to test whether
inter-sample distances are correlated with that variable.
p <- bdiv_ord_plot( biom = rarefy(hmp50), color.by = "Body Site", bdiv = c("Jaccard", "Bray-Curtis"), ord = c("PCoA", "UMAP") ) p p$stats p$stats$code
The plot subtitles have the summary statistics. Additionally, p$stats
contains a tibble data.frame with the full statistics table, and p$stats$code
shows the R commands for reproducing the statistics outside of rbiom.
Note that the ordination statistics are not dependent on the ordination, only the distance metric. This is because the statistics are based on beta diversity distances which are computed prior to ordination.
By default, bdiv_ord_plot()
applies the perMANOVA test. You can change this
to MRPP by specifying test="mrpp"
. Details on the available tests are below.
| Test | Function | Method |
| --------- | ------------------ | ----------------------------------------------------------- |
| adonis2
| vegan::adonis2()
| Permutational Multivariate Analysis of Variance (perMANOVA) |
| mrpp
| vegan::mrpp()
| Multiple Response Permutation Procedure (MRPP) |
\cr
Statistics on box plots will automatically toggle between pairwise and
group-wise statistics based on the values of x
and color.by
: x
controls
pairwise and color.by
controls group-wise. You can set x
and color.by
to
the same categorical metadata field to get colored pairwise statistics, or set
them to different categorical metadata fields to get multiple group-wise
statistics per plot.
biom <- rarefy(hmp50) %>% subset(`Body Site` %in% c('Saliva', 'Stool', 'Buccal mucosa')) p1 <- adiv_boxplot(biom, x = "Body Site") p2 <- adiv_boxplot(biom, color.by = "Body Site") p3 <- adiv_boxplot(biom, x = "Body Site", color.by = "Body Site") p4 <- adiv_boxplot(biom, x = "Sex", color.by = "Body Site") plots <- list( p1 + ggplot2::labs(subtitle = 'x = "Body Site"'), p2 + ggplot2::labs(subtitle = 'color.by = "Body Site"'), p3 + ggplot2::labs(subtitle = 'x = color.by = "Body Site"'), p4 + ggplot2::labs(subtitle = 'x = "Sex", color.by = "Body Site"')) %>% lapply(`+`, ggplot2::labs(x = NULL, y = NULL, caption = NULL)) patchwork::wrap_plots(plots, guides = "collect")
Above, the two plots on the left are annotated with pairwise statistics while the two on the right have group-wise statistics. As with other plots, you can find the full statistics tables and reproducible R code in the plot attributes.
p1$stats p2$stats p2$stats$code
Internally, rbiom uses the non-parametric functions listed below.
| Test | Function | Method |
| ------------- | ----------------------- | -------------------------------------------------------- |
| pairwise | stats::wilcox.test()
| Two-sample Wilcoxon Rank Sum Test, aka Mann-Whitney Test |
| group-wise | stats::kruskal.test()
| Kruskal-Wallis Rank Sum Test |
\cr
A normal distribution is visualized as a "bell curve", where values further from the mean are observed less often. Microbial abundances do not follow this pattern; it's common to observe high or low abundances more often than a "medium" abundance.
library(ggplot2) patchwork::wrap_plots( widths = c(1, 1.5), ggplot() + geom_histogram(aes(x=rnorm(1000)), bins = 10) + ggtitle("Normal Distribution"), ggplot(data = taxa_table(rarefy(hmp50), taxa = 4)) + geom_histogram(aes(x=.abundance), bins = 10) + facet_wrap(".taxa") + ggtitle("Genera Abundance Distributions") )
To compensate for this non-normality, rbiom uses the following non-parametric tests for categorical variables that are based on ranking or permutations.
| Test | Function | Used For |
| ----------------------- | ----------------------- | -------------------------- |
| Wilcoxon Rank-Sum | stats::wilcox.test()
| Pairwise boxplot |
| Kruskal-Wallis Rank Sum | stats::kruskal.test()
| Groupwise boxplot |
| Permutational MANOVA | vegan::adonis2()
| bdiv_ord_plot()
clusters |
For correlation/regression analysis, rbiom uses estimated marginal means
(emmeans::emmeans()
) on rank-transformed values.
Further reading:
\cr
Compositional data arises when the counts don't represent the entire population. In microbiome studies, the number of microbes that get sequenced is far less than the number of microbes from where the sample was collected. Articles by Gloor et al and McMurdie and Holmes propose the use of their analysis tools ( ALDEx2 and metagenomeSeq, respectively) to apply the proper statistical methods for this situation. Conversely, rbiom does not correct for compositionality. This is because correcting for compositionality introduces extra noise into your dataset and severely limits your selection of metrics and visualizations, typically without any significant benefit to analysis.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.