library(rbiom)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  R.options = list(
    pillar.print_min = 5,
    pillar.print_max = 5 ))

Introduction

Visualizations are one of the best ways to identify correlations in your dataset. If you can see a trend with your eyes, then you're on the right track. Rbiom's plotting functions are

cat(glue::glue('
  <table class="table-borderless">
  <tr>{adiv_boxplot}{bdiv_corrplot}{rare_stacked}{taxa_boxplot}</tr>
  <tr>{adiv_corrplot}{bdiv_heatmap}{rare_corrplot}{taxa_corrplot}</tr>
  <tr>{bdiv_boxplot}{bdiv_ord_plot}{taxa_stacked}{taxa_heatmap}</tr>
  </table>',
  .transformer = function (text, ...) {
    sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text)
  }))

The *_boxplot(), *_corrplot(), and bdiv_ord_plot() functions will automatically add p-values to your figures whenever possible. The ggplot object they return has $data, $code, $stats, and $stats$code attributes you can use to automate, reproduce, and customize your figures.

To generate statistics without creating a plot, use one of the following:

cat(glue::glue('
  <table class="table-borderless">
  <tr>{adiv_stats}{bdiv_stats}{distmat_stats}{taxa_stats}</tr>
  </table>',
  .transformer = function (text, ...) {
    sprintf("<td><code><a href='../reference/%s.html'>%s()</a></code></td>", text, text)
  }))

\cr

Quick Start

Your metadata field of interest and microbiome property will determine which rbiom function to use.

Metadata
Property
Microbiome Property
Alpha Diversity
Shannon, Simpson
Beta Diversity
UniFrac, Jaccard
Taxa Abundance
Phylum, Genus
Categorical
Sex, Body Site
`adiv_boxplot()` `bdiv_boxplot()`
`bdiv_ord_plot()`
`taxa_boxplot()`
Numeric
Age, BMI
`adiv_corrplot()` `bdiv_corrplot()` `taxa_corrplot()`
Any `adiv_stats()` `bdiv_stats()` `taxa_stats()`

For instance, to explore the effect of Body Site (a categorical metadata field) on Shannon Diversity (an alpha diversity metric), we'd use adiv_boxplot() to produce a plot with statistics, or adiv_stats() if we only want the stats.

Although all the functions have important differences, the statistical methods they employ can be grouped by the three plot types: ordination plot, box plot, and correlation plot.

\cr

Ordination Plots

Statistics for ordination plots are the most straight-forward. Set a categorical metadata field to the color.by parameter to test whether inter-sample distances are correlated with that variable.

p <- bdiv_ord_plot(
  biom     = rarefy(hmp50), 
  color.by = "Body Site", 
  bdiv     = c("Jaccard", "Bray-Curtis"), 
  ord      = c("PCoA", "UMAP") )
p
p$stats

p$stats$code

The plot subtitles have the summary statistics. Additionally, p$stats contains a tibble data.frame with the full statistics table, and p$stats$code shows the R commands for reproducing the statistics outside of rbiom.

Note that the ordination statistics are not dependent on the ordination, only the distance metric. This is because the statistics are based on beta diversity distances which are computed prior to ordination.

By default, bdiv_ord_plot() applies the perMANOVA test. You can change this to MRPP by specifying test="mrpp". Details on the available tests are below.

| Test | Function | Method | | --------- | ------------------ | ----------------------------------------------------------- | | adonis2 | vegan::adonis2() | Permutational Multivariate Analysis of Variance (perMANOVA) | | mrpp | vegan::mrpp() | Multiple Response Permutation Procedure (MRPP) |

\cr

Box Plots

Statistics on box plots will automatically toggle between pairwise and group-wise statistics based on the values of x and color.by: x controls pairwise and color.by controls group-wise. You can set x and color.by to the same categorical metadata field to get colored pairwise statistics, or set them to different categorical metadata fields to get multiple group-wise statistics per plot.

biom <- rarefy(hmp50) %>% 
  subset(`Body Site` %in% c('Saliva', 'Stool', 'Buccal mucosa'))

p1 <- adiv_boxplot(biom, x = "Body Site")
p2 <- adiv_boxplot(biom, color.by = "Body Site")
p3 <- adiv_boxplot(biom, x = "Body Site", color.by = "Body Site")
p4 <- adiv_boxplot(biom, x = "Sex", color.by = "Body Site")

plots <- list(
  p1 + ggplot2::labs(subtitle = 'x = "Body Site"'), 
  p2 + ggplot2::labs(subtitle = 'color.by = "Body Site"'), 
  p3 + ggplot2::labs(subtitle = 'x = color.by = "Body Site"'), 
  p4 + ggplot2::labs(subtitle = 'x = "Sex", color.by = "Body Site"')) %>%
  lapply(`+`, ggplot2::labs(x = NULL, y = NULL, caption = NULL))

patchwork::wrap_plots(plots, guides = "collect")

Above, the two plots on the left are annotated with pairwise statistics while the two on the right have group-wise statistics. As with other plots, you can find the full statistics tables and reproducible R code in the plot attributes.

p1$stats

p2$stats

p2$stats$code

Internally, rbiom uses the non-parametric functions listed below.

| Test | Function | Method | | ------------- | ----------------------- | -------------------------------------------------------- | | pairwise | stats::wilcox.test() | Two-sample Wilcoxon Rank Sum Test, aka Mann-Whitney Test | | group-wise | stats::kruskal.test() | Kruskal-Wallis Rank Sum Test |

\cr

Background

Normality

A normal distribution is visualized as a "bell curve", where values further from the mean are observed less often. Microbial abundances do not follow this pattern; it's common to observe high or low abundances more often than a "medium" abundance.

library(ggplot2)

patchwork::wrap_plots(
  widths = c(1, 1.5),

  ggplot() + 
    geom_histogram(aes(x=rnorm(1000)), bins = 10) + 
    ggtitle("Normal Distribution"),

  ggplot(data = taxa_table(rarefy(hmp50), taxa = 4)) + 
    geom_histogram(aes(x=.abundance), bins = 10) + 
    facet_wrap(".taxa") + 
    ggtitle("Genera Abundance Distributions")
)

To compensate for this non-normality, rbiom uses the following non-parametric tests for categorical variables that are based on ranking or permutations.

| Test | Function | Used For | | ----------------------- | ----------------------- | -------------------------- | | Wilcoxon Rank-Sum | stats::wilcox.test() | Pairwise boxplot | | Kruskal-Wallis Rank Sum | stats::kruskal.test() | Groupwise boxplot | | Permutational MANOVA | vegan::adonis2() | bdiv_ord_plot() clusters |

For correlation/regression analysis, rbiom uses estimated marginal means (emmeans::emmeans()) on rank-transformed values.

Further reading:

\cr

Compositionality

Compositional data arises when the counts don't represent the entire population. In microbiome studies, the number of microbes that get sequenced is far less than the number of microbes from where the sample was collected. Articles by Gloor et al and McMurdie and Holmes propose the use of their analysis tools ( ALDEx2 and metagenomeSeq, respectively) to apply the proper statistical methods for this situation. Conversely, rbiom does not correct for compositionality. This is because correcting for compositionality introduces extra noise into your dataset and severely limits your selection of metrics and visualizations, typically without any significant benefit to analysis.



cmmr/rbiom documentation built on April 28, 2024, 6:38 a.m.