library(rbiom) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", R.options = list( pillar.print_min = 5, pillar.print_max = 5 ))
The rbiom package includes many statistical functions. If you have an rbiom object, you can use dedicated functions for alpha diversity, beta diversity, and taxa abundance. Otherwise, look to the generic functions that operate on any data.frame or distance matrix.
Your metadata field and microbiome property of interest will determine which rbiom function to use.
Metadata Field |
Microbiome Property | ||
---|---|---|---|
Alpha Diversity Shannon, Simpson |
Beta Diversity UniFrac, Jaccard |
Taxa Abundance Phylum, Genus |
|
Categorical Sex, Body Site |
`adiv_boxplot()` | `bdiv_boxplot()` `bdiv_ord_plot()` |
`taxa_boxplot()` |
Numeric Age, BMI |
`adiv_corrplot()` | `bdiv_corrplot()` | `taxa_corrplot()` |
Any | `adiv_stats()` | `bdiv_stats()` | `taxa_stats()` |
For instance, to explore the effect of Body Site (a categorical metadata field)
on Shannon Diversity (an alpha diversity metric), we'd use adiv_boxplot()
to
produce a plot with statistics, or adiv_stats()
if we only want the stats.
If your data is in a data.frame (or tibble), use:
cat(glue::glue(' <table class="table-borderless"> <tr>{stats_boxplot}{stats_corrplot}{stats_table}</tr> </table>', .transformer = function (text, ...) { sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text) }))
Or, for a distance matrix:
cat(glue::glue(' <table class="table-borderless"> <tr>{distmat_stats}</tr> </table>', .transformer = function (text, ...) { sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text) }))
The function stats_table()
and functions ending in "_stats" (e.g. adiv_stats()
, distmat_stats()
) will return a statistics table. Functions with "plot" in their name (e.g. adiv_boxplot()
, stats_corrplot()
) will return a plot object p
. You can access the plot's associated statistics table with p$stats
.
p <- adiv_boxplot( biom = rarefy(hmp50), # Dataset as an rbiom object adiv = c("Shannon", "Simpson"), # Alpha diversity metrics stat.by = "Body Site", # Statistical groups facet.by = "Sex" ) # Split data prior to stats p$stats
This table has the following information:
In the upper-left corner, we see Model: kruskal.test(.diversity ~ `Body Site`)
. This tells us that the underlying test used here is base R's kruskal.test()
, grouping values of .diversity
by `Body Site`
. In rbiom, .diversity
is the default name for the column containing alpha diversity values.
The above table has four rows, meaning we ran data through the model four times. In this case, it was two alpha diversity metrics times two Sex
facets. The first two columns of the table (Sex
and .adiv
) tell us which row is for each combination. Additionally, the .adj.p
column is the .p.val
adjusted for multiple comparisons.
The three columns .stat
, .h1
, and .p.val
together show the hypothesis and its outcome. Above, we're asking if .stat
is greater than zero. When .p.val
is less than 0.05 we can say that it is.
That's the very simplified explanation.
Recall from your statistic classes that you are testing a null hypothesis H0 against an alternate hypothesis H1. In this case, our null hypothesis is that the Kruskal-Wallis statistic is zero (.stat == 0
), indicating all `Body Site`
groups have similar alpha diversity. The p-value is the probability that the null hypothesis is correct (a p-value of 0.6 is interpreted as a 60% chance that .stat == 0
). When the p-value is below a certain value (usually 0.05) we accept the alternative hypothesis instead - in this case that .stat > 0
, meaning .diversity
does vary by `Body Site`
.
The last two columns are:
.n
- The number of samples. After rarefying, we have 30 females and 19 males..df
- Degrees of freedom. One less than the number of groups. The HMP50 dataset has five body sites: Anterior nares, buccal mucosa, mid vagina, saliva, and stool. However, the male facets don't have any mid vagina samples.You can inspect the data passed to kruskal.test()
by looking at p$data
. Additionally, the code used to generate the statistics is available in p$stats$code
. These attributes can help you reproduce and customize your data analysis.
p$data p$stats$code
When generating statistics with st <- adiv_stats()
, bdiv_stats()
, or taxa_stats()
, the data and code are in st$data
and st$code
, respectively. For st <- stats_table(df = df)
, the data and code are in df
and st$code
, respectively.
Care has been taken to keep rbiom's statistics tables consistent across functions. However, some tables will provide more information when it is available from the underlying statistical function.
Below is a quick reference guide to all columns that may appear in an rbiom statistics table.
| Field | Description |
| -------------- | ----------------------------------------------------- |
| .stat
| Wilcoxon or Kruskal-Wallis rank sum statistic. |
| .mean
| Estimated marginal mean. See emmeans::emmeans()
. |
| .mean.diff
| Difference in means. |
| .slope
| Trendline slope. See emmeans::emtrends()
. |
| .slope.diff
| Difference in slopes. |
| .h1
| Alternate hypothesis. |
| .p.val
| Probability that null hypothesis is correct. |
| .adj.p
| .p.val
after adjusting for multiple comparisons. |
| .effect.size
| Effect size. See emmeans::eff_size()
. |
| .lower
| Confidence interval lower bound. |
| .upper
| Confidence interval upper bound. |
| .se
| Standard error. |
| .n
| Number of samples. |
| .df
| Degrees of freedom. |
| .t.ratio
| (.mean, .mean.diff, .slope, or .slope.diff) / .se
|
| .z
| Std. effect size. See vegan::summary.permustats()
. |
| .r.sqr
| Percent of variation explained by the model. |
| .adj.r
| .r.sqr
, taking degrees of freedom into account. |
| .aic
| Akaike Information Criterion (predictive models). |
| .bic
| Bayesian Information Criterion (descriptive models). |
| .loglik
| Log-likelihood goodness-of-fit score. |
| .fit.p
| P-value for observing this fit by chance. |
The .h1
field will always come immediately after the column it is testing against.
Visualizations are one of the best ways to identify correlations in your dataset. If you can see a trend with your eyes, then you're on the right track. The statistics-supported plotting functions in rbiom are ordination plots, box plots, and correlation plots.
Statistics for ordination plots are the most straight-forward. Set a
categorical metadata field to the stat.by
parameter to test whether
inter-sample distances are correlated with that variable.
p <- bdiv_ord_plot( biom = rarefy(hmp50), stat.by = "Body Site", bdiv = c("Jaccard", "Bray-Curtis"), ord = c("PCoA", "UMAP") ) p p$stats p$stats$code
The plot subtitles have the summary statistics. Additionally, p$stats
contains a tibble data.frame with the full statistics table, and p$stats$code
shows the R commands for reproducing the statistics outside of rbiom.
Note that the ordination statistics are not dependent on the ordination, only the distance metric. This is because the statistics are based on beta diversity distances which are computed prior to ordination.
By default, bdiv_ord_plot()
applies the perMANOVA test. You can change this
to MRPP by specifying test="mrpp"
. Details on the available tests are below.
| Test | Function | Method |
| --------- | ------------------ | ----------------------------------------------------------- |
| adonis2
| vegan::adonis2()
| Permutational Multivariate Analysis of Variance (perMANOVA) |
| mrpp
| vegan::mrpp()
| Multiple Response Permutation Procedure (MRPP) |
Statistics on box plots will automatically toggle between pairwise and
group-wise statistics based on the values of x
and stat.by
: x
controls
pairwise and stat.by
controls group-wise. You can set x
and stat.by
to
the same categorical metadata field to get colored pairwise statistics, or set
them to different categorical metadata fields to get multiple group-wise
statistics per plot.
biom <- rarefy(hmp50) %>% subset(`Body Site` %in% c('Saliva', 'Stool', 'Buccal mucosa')) p1 <- adiv_boxplot(biom, x = "Body Site", stat.by = NULL) p2 <- adiv_boxplot(biom, x = NULL, stat.by = "Body Site") p3 <- adiv_boxplot(biom, x = "Body Site", stat.by = "Body Site") p4 <- adiv_boxplot(biom, x = "Sex", stat.by = "Body Site") plots <- list( p1 + ggplot2::labs(subtitle = 'x = "Body Site", stat.by = NULL'), p2 + ggplot2::labs(subtitle = 'x = NULL, stat.by = "Body Site"'), p3 + ggplot2::labs(subtitle = 'x = "Body Site", stat.by = "Body Site"'), p4 + ggplot2::labs(subtitle = 'x = "Sex", stat.by = "Body Site"') ) %>% lapply(`+`, ggplot2::labs(x = NULL, y = NULL, caption = NULL)) %>% lapply(`+`, ggplot2::theme(plot.subtitle = ggplot2::element_text(size = 10))) patchwork::wrap_plots(plots, guides = "collect")
Above, the lower-left plot is annotated with pairwise statistics while the two on the right have group-wise statistics. As with other plots, you can find the full statistics tables and reproducible R code in the plot attributes.
p3$stats p2$stats p2$stats$code
Internally, rbiom uses the non-parametric functions listed below.
| Test | Function | Method |
| ------------- | ----------------------- | -------------------------------------------------------- |
| pairwise | stats::wilcox.test()
| Two-sample Wilcoxon Rank Sum Test, aka Mann-Whitney Test |
| group-wise | stats::kruskal.test()
| Kruskal-Wallis Rank Sum Test |
For an in-depth description of correlation plots, see the rbiom regression article.
Depending on the arguments given to stat.by
and test
, you can test:
biom <- gems %>% subset(diarrhea == "Control") %>% subset(country %in% c("Bangladesh", "Kenya")) %>% rarefy() p1 <- adiv_corrplot(biom, x = "age", test = "emmeans") p2 <- adiv_corrplot(biom, x = "age", test = "emmeans", stat.by = "country") p3 <- adiv_corrplot(biom, x = "age", test = "emtrends") p4 <- adiv_corrplot(biom, x = "age", test = "emtrends", stat.by = "country") plots <- list( p1 + ggplot2::labs(subtitle = 'Is the mean non-zero?'), p2 + ggplot2::labs(subtitle = 'Do means vary by group?'), p3 + ggplot2::labs(subtitle = 'Is the slope non-horizontal?'), p4 + ggplot2::labs(subtitle = 'Do slopes vary by group?')) %>% lapply(`+`, ggplot2::labs(x = NULL, y = NULL, caption = NULL)) patchwork::wrap_plots(plots, guides = "collect")
A normal distribution is visualized as a "bell curve", where values further from the mean are observed less often. Microbial abundances do not follow this pattern; it's common to observe high or low abundances more often than a "medium" abundance.
library(ggplot2) patchwork::wrap_plots( widths = c(1, 1.5), ggplot() + geom_histogram(aes(x=rnorm(1000)), bins = 10) + ggtitle("Normal Distribution"), ggplot(data = taxa_table(rarefy(hmp50), taxa = 4)) + geom_histogram(aes(x=.abundance), bins = 10) + facet_wrap(".taxa") + ggtitle("Genera Abundance Distributions") )
To compensate for this non-normality, rbiom uses the following non-parametric tests for categorical variables that are based on ranking or permutations.
| Test | Function | Used For |
| ----------------------- | ----------------------- | -------------------------- |
| Wilcoxon Rank-Sum | stats::wilcox.test()
| Pairwise boxplot |
| Kruskal-Wallis Rank Sum | stats::kruskal.test()
| Groupwise boxplot |
| Permutational MANOVA | vegan::adonis2()
| bdiv_ord_plot()
clusters |
For correlation/regression analysis, rbiom provides diagnostic plots to determine when residual distributions are cause for concern. To enable this feature, set check = TRUE
.
adiv_corrplot(biom, x = "age", test = "emmeans", check = TRUE)
Further reading:
Compositional data arises when the counts don't represent the entire population. In microbiome studies, the number of microbes that get sequenced is far less than the number of microbes from where the sample was collected. Articles by Gloor et al and McMurdie and Holmes propose the use of their analysis tools ( ALDEx2 and metagenomeSeq, respectively) to apply the proper statistical methods for this situation. Conversely, rbiom does not correct for compositionality. This is because correcting for compositionality introduces extra noise into the dataset and severely limits the selection of metrics and visualizations, typically without any significant benefit to analysis.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.