In qpmnguyen/teaR: Competitive Balances for Taxonomic Enrichment Analysis in R

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)

## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    phyloseq = citation("phyloseq")[1],
    treesummarizedexperiment = citation("TreeSummarizedExperiment")[1],
    tidyverse = citation("tidyverse")[1],
    mixtools = citation("mixtools")[1],
    fitdistrplus = citation("fitdistrplus")[1],
    CBEA = citation("CBEA")[1],
    broom = citation("broom")[1], 
    BiocParallel = citation("BiocParallel")[1]
)

Introduction

This package implements the CBEA approach for performing set-based enrichment analysis for microbiome relative abundance data. A preprint of the package can be found on bioXriv. In summary, CBEA (Competitive Balances for taxonomic Enrichment Analysis) provides an estimate of the activity of a set by transforming an input taxa-by-sample data matrix into a corresponding set-by-sample data matrix. The resulting output can be used for additional downstream analyses such as differential abundance, classification, clustering, etc. using set-based features instead of the original units.

The transformation that CBEA applies is based on the isometric log ratio transformation:
$$ CBEA_{i,\mathbb{S}} = \sqrt{\frac{|\mathbb{S}||\mathbb{S_c}|}{|\mathbb{S}| + |\mathbb{S_c}|}} \ln \frac{g(X_{i,j | j\in \mathbb{S}})}{g(X_{i,j | j \notin \mathbb{S}})} $$ Where $\mathbb{S}$ is the set of interest, $\mathbb{S}_C$ is it's complement, $g()$ is the geometric mean operation, and $X$ is the original data matrix where $i$ is the index representing samples and $j$ is the index representing variables (or taxa).

The inference procedure is performed through estimating the null distribution of the test statistic. This can be done either via permutations or a parametric fit of a distributional form on the permuted scores. Users can also adjust for variance inflation due to inter-taxa correlation. Please refer to the main manuscript for any additional details.

Usage guide

Install `CBEA`

r Biocpkg("CBEA") is an R package available via the Bioconductor repository for packages. It requires installing the R open source statistical programming language, which can be accessed on any operating system from CRAN. After which you can install r Biocpkg("CBEA") by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("CBEA")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

If there are any issues with the installation procedure or package features, the best place would be to file an issue at the GitHub repository. For any additional support you can use the Bioconductor support site and use the CBEA tag and check the older posts. Please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

library("CBEA")
library(BiocSet)
library(tidyverse)
set.seed(1020)

Loading sample data

First, we can load some pre-packaged data sets contained in r Biocpkg("CBEA"). Here we're loading the data from the Human Microbiome Project (HMP) in TreeSummarizedExperiment data container from the r Biocpkg("TreeSummarizedExperiment"). This package does not support phyloseq from the r Biocpkg("phyloseq") package but users can leverage the r Biocpkg("mia") package to convert between the data types.

In addition, users can also input raw matrices or data frames, however those require additional arguments. The taxa_are_rows argument requires users specify whether the data frame/matrix has taxa abundances as rows (or as columns). The id_col argument requires users to specify (for data frames only) a vector of names of row metadata that will be excluded from the analysis.

data("hmp_gingival")
abun <- hmp_gingival$data
metab_sets <- hmp_gingival$set
abun # this is a TreeSummarizedExperiment object

Input sets

CBEA accepts any type of sets, as long as it is in the BiocSet format where the elements in the sets can be matched to taxa names in the data set. The main function will check if these names match.

metab_sets

For more information on BiocSet, please refer to the documentation from r Biocpkg("BiocSet"). However, simply speaking, BiocSet acts similar to a list of three data frames and can be used in conjunction with r CRANpkg("dplyr")/r CRANpkg("tidyr").

Applying CBEA

After specifying the inputs, cbea is the main function to apply the method. If there are zeros in the abundance data, the cbea will add a pseudocount to avoid issues with the log-ratio transformation (but will throw a warning). If a different zero-handling approach is desired, users should pre-process the abundance data with the appropriate method. For parametric fits, cbea relies on the r CRANpkg("fitdistrplus") and r CRANpkg("mixtools") packages to estimate the parameters of the null. Specific arguments to control this fitting procedure can be provided as a named list in the control argument.
Applying cbea is one command:

results <- cbea(abun, set = metab_sets, abund_values = "16SrRNA",
              output = "cdf", distr = "mnorm", adj = TRUE, thresh = 0.05, n_perm = 10)
results

Some important arguments to control the behaviour of CBEA.

output: This controls what type of output is being returned. CBEA usually estimates a parametric null and users can specify what they want in return. If users want to perform downstream analysis with set-level features, they can return CDF values or z-scores of each raw score computed against that distribution (options cdf or zscore). Alternatively, users can just return the raw scores themselves (no distribution fitting will be performed) using raw as the option. Users can also use this distribution to estimate unadjusted p-values (option pval) to see whether a set is enriched at each sample. These unadjusted p-values can be converted based on a threshold (based on thresh which is default to be set at 0.05) into a dummy variable indicating enrichment (option sig). Note: CDF values and Z-scores are not available for non-parametric null estimations.
parametric: This is a logical argument to specify whether a the null distribution will be specified via parametric fit or via non-parametric permutation testing. If parametric is TRUE, users need to specify distr and adj. If parametric is FALSE, users need to increase n_perm.
distr: The form of the distribution if parametric fit is desired. As of now only supports norm, mnorm.
adj: Whether the distribution should be adjusted for variance inflation. This procedure is done by combining the mean estimate from scores computed from permuted data set and the variance estimate from raw scores (computed on the unpermuted data set).

Model output

The output object is of class CBEAout, which is an S3 object. The underlying data structure is a list of lists, where the outer lists represent different aspects of the output. For example R represent the final scores while diagnostic represent certain goodness-of-fit statistics.

names(results)

Within each aspect, there is a list of size equivalent to the total number of sets evaluated. For example, the results object is of size 3 representing the evaluated sets.

str(results$R)

Users can use tidy and glance following the r CRANpkg("broom") to process CBEAout into nice objects. The tidy function returns a tibble of scores (samples by set). The glance function returns some diagnostics. There are two options for the glance function: fit_comparison allows users to compare the l-moments of the data, the permuted data, and the final fitted distribution; fit_diagnostic shows goodness-of-fit statistics of the distribution fitting procedure itself, with log-likelihoods and Anderson-Darling (column "ad") statistics.

tidy(results)
glance(results, "fit_comparison")
glance(results, "fit_diagnostic")

Parallel computing

CBEA has in-built capacity to perform calculations paralelled across the total number of sets. The engine for parallelization is r Biocpkg("BiocParallel"). If NULL, SerialParam backend will be used.

BiocParallel::registered()

cbea(abun, set = metab_sets, abund_values = "16SrRNA",
     output = "cdf", distr = "mnorm", adj = TRUE, thresh = 0.05, n_perm = 10, 
     parallel_backend = MulticoreParam(workers = 2))

Citing `CBEA`

We hope that r Biocpkg("CBEA") will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("CBEA")

Reproducibility

The r Biocpkg("CBEA") package r Citep(bib[["CBEA"]]) was made possible thanks to:

R r Citep(bib[["R"]])
r Biocpkg("BiocStyle") r Citep(bib[["BiocStyle"]])
r CRANpkg("knitr") r Citep(bib[["knitr"]])
r CRANpkg("RefManageR") r Citep(bib[["RefManageR"]])
r CRANpkg("rmarkdown") r Citep(bib[["rmarkdown"]])
r CRANpkg("broom") r Citep(bib[["broom"]])
r CRANpkg("sessioninfo") r Citep(bib[["sessioninfo"]])
r CRANpkg("testthat") r Citep(bib[["testthat"]])
r CRANpkg("mixtools") r Citep(bib[["mixtools"]])
r CRANpkg("fitdistrplus") r Citep(bib[["fitdistrplus"]])
r CRANpkg("tidyverse") r Citep(bib[["tidyverse"]])
r Biocpkg("BiocSet") r Citep(bib[["BiocSet"]])
r Biocpkg("phyloseq") r Citep(bib[["phyloseq"]])
r Biocpkg("BiocParallel") r Citep(bib[["BiocParallel"]])

This package was developed using r BiocStyle::Biocpkg("biocthis").

Code for creating the vignette

## Create the vignette
library("rmarkdown")
system.time(render("basic_usage.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("basic_usage.Rmd", tangle = TRUE)

Date the vignette was generated.

## Date the vignette was generated
Sys.time()

Wallclock time spent generating the vignette.

## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)

R session information.

## Session info
library("sessioninfo")
options(width = 120)
session_info()

Bibliography

This vignette was generated using r Biocpkg("BiocStyle") r Citep(bib[["BiocStyle"]]) with r CRANpkg("knitr") r Citep(bib[["knitr"]]) and r CRANpkg("rmarkdown") r Citep(bib[["rmarkdown"]]) running behind the scenes.

Citations made with r CRANpkg("RefManageR") r Citep(bib[["RefManageR"]]).

## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))

qpmnguyen/teaR documentation built on April 4, 2022, 7:26 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

qpmnguyen/teaR
Competitive Balances for Taxonomic Enrichment Analysis in R

In qpmnguyen/teaR: Competitive Balances for Taxonomic Enrichment Analysis in R

Introduction

Usage guide

Install `CBEA`

Loading sample data

Input sets

Applying CBEA

Model output

Parallel computing

Citing `CBEA`

Reproducibility

Bibliography

R Package Documentation

Browse R Packages

We want your feedback!

qpmnguyen/teaR Competitive Balances for Taxonomic Enrichment Analysis in R

In qpmnguyen/teaR: Competitive Balances for Taxonomic Enrichment Analysis in R

Introduction

Usage guide

Install CBEA

Loading sample data

Input sets

Applying CBEA

Model output

Parallel computing

Citing CBEA

Reproducibility

Bibliography

R Package Documentation

Browse R Packages

We want your feedback!

qpmnguyen/teaR
Competitive Balances for Taxonomic Enrichment Analysis in R

Install `CBEA`

Citing `CBEA`