title: 'rbiom: An R package to for microbiome analysis' authors: - name: Daniel P Smith orcid: 0000-0002-2479-2044 corresponding: true affiliation: 1 - name: Joseph F Petrosino orcid: 0000-0002-4046-6898 affiliation: 1 affiliations: - name: Alkek Center for Metagenomics and Microbiome Research, Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, Texas, USA index: 1 date: 21 April 2023 bibliography: paper.bib
Microbes live all around us, on us, and even inside our bodies. Their influence on health and disease is profound, and only beginning to be fully understood. Studying microbial populations is becoming easier with modern DNA sequencing technology; examining trends across thousands of samples today is common. The bottleneck is longer collecting data, but rather analyzing and interpreting the results.
rbiom
is an R package for working with abundance datasets, such as OTU or ASV
counts from 16S amplicon sequencing. It enables importing/exporting all BIOM
formats, subsetting, rarefying, manipulation of metadata/taxonomy/phylogeny,
computation of alpha and beta diversity metrics, and summarizing counts per
taxonomic rank. Computationally intensive tasks (including UniFrac [@unifrac])
have been implemented with multithreaded C++ to greatly reduce calculation
time.
Visualization is a key component of rbiom
. Rarefaction curves, taxa
abundances, alpha diversity, and beta diversity can all be plotted in a variety
of graphical formats, including correlation, heatmap, ordination, stacked bar,
and box plots. In rbiom
, box plots can be any combination of box, bar,
violin, dot, strip, and/or range layers. Each plot includes provenance and
modification history as attributes, as well as the ggplot2
[@ggplot2] call
used to render it to encourage downstream user customization.
Correlations between sample metadata and microbiome structure can be identified
by mapping one or more metadata variables of interest to a plot's axes, facets,
and/or aesthetics. These mappings can optionally define color/shape/pattern
assignments, category ordering, or subsetting parameters. When metadata is
associated with a axis or aesthetic, rbiom
will automatically run the
appropriate statistical test, correct for multiple comparisons, and display
significant differences on the plot, captioning it with a brief methodology.
Currently, rbiom
can perform four types of significance testing. On
correlation plots with a numeric metadata variable on the x-axis (e.g., Age,
BMI), linear regression will be computed with R's lm
linear model function.
For plots with two categories (e.g. Male vs Female), a Mann-Whitney test
[@Mann1947] is run with R's wilcox.test
. When three or more categories are
compared, the Kruskal-Wallis rank sum test [@Kruskal1952] is used instead via
R's kruskal.test
function. P-values for ordinations are derived using the
adonis2
function from the vegan
R package [@vegan], which randomly
re-categorizes samples 1,000 times to estimate the significance of the observed
clustering. P-values are corrected for multiple comparisons using the method
described by @Benjamini1995 via R's p.adjust
function to control for the
false discovery rate.
QIIME2
[@qiime2], mothur
[@mothur], and Phyloseq
[@phyloseq] offer
overlapping functionality with rbiom, but with important distinctions. The
first two are designed for command-line interaction, making them difficult to
integrate into R projects. Phyloseq
has been a staple of R bioinformatics for
a decade, but is frustratingly slow for studies with thousands of samples.
This package is designed for users of all experience levels. Novice R users
will appreciate that a couple commands will produce publication-ready figures.
Advanced R users can use rbiom
to complement their existing pipelines with
faster and more flexible functions. rbiom
is cross-platform compatible and
available from CRAN and conda-forge. The latest development version is on
GitHub.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.