ancombc2: Analysis of Compositions of Microbiomes with Bias Correction...

View source: R/ancombc2.R

ancombc2R Documentation

Analysis of Compositions of Microbiomes with Bias Correction 2 (ANCOM-BC2)

Description

Determine taxa whose absolute abundances, per unit volume, of the ecosystem (e.g., gut) are significantly different with changes in the covariate of interest (e.g., group). The current version of ancombc2 function implements Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC2) in cross-sectional and repeated measurements data. In addition to the two-group comparison, ANCOM-BC2 also supports testing for continuous covariates and multi-group comparisons, including the global test, pairwise directional test, Dunnett's type of test, and trend test.

Usage

ancombc2(
  data,
  assay.type = assay_name,
  assay_name = "counts",
  rank = tax_level,
  tax_level = NULL,
  fix_formula,
  rand_formula = NULL,
  p_adj_method = "holm",
  pseudo = 0,
  pseudo_sens = TRUE,
  prv_cut = 0.1,
  lib_cut = 0,
  s0_perc = 0.05,
  group = NULL,
  struc_zero = FALSE,
  neg_lb = FALSE,
  alpha = 0.05,
  n_cl = 1,
  verbose = FALSE,
  global = FALSE,
  pairwise = FALSE,
  dunnet = FALSE,
  trend = FALSE,
  iter_control = list(tol = 0.01, max_iter = 20, verbose = FALSE),
  em_control = list(tol = 1e-05, max_iter = 100),
  lme_control = lme4::lmerControl(),
  mdfdr_control = list(fwer_ctrl_method = "holm", B = 100),
  trend_control = list(contrast = NULL, node = NULL, solver = "ECOS", B = 100)
)

Arguments

data

the input data. The data parameter should be either a phyloseq or a TreeSummarizedExperiment object, which consists of a feature table (microbial count table), a sample metadata table, a taxonomy table (optional), and a phylogenetic tree (optional). Ensure that the row names of the metadata table match the sample names in the feature table, and the row names of the taxonomy table match the taxon (feature) names in the feature table. For detailed information, refer to ?phyloseq::phyloseq or ?TreeSummarizedExperiment::TreeSummarizedExperiment. It is recommended to use low taxonomic levels, such as OTU or species level, as the estimation of sampling fractions requires a large number of taxa.

assay.type

alias for assay_name.

assay_name

character. Name of the count table in the data object (only applicable if data object is a (Tree)SummarizedExperiment). Default is "counts". See ?SummarizedExperiment::assay for more details.

rank

alias for tax_level.

tax_level

character. The taxonomic or non taxonomic(rowData) level of interest. The input data can be analyzed at any taxonomic or rowData level without prior agglomeration. Note that tax_level must be a value from taxonomyRanks or rowData, which includes "Kingdom", "Phylum" "Class", "Order", "Family" "Genus" "Species" etc. See ?mia::taxonomyRanks for more details. Default is NULL, i.e., do not perform agglomeration, and the ANCOM-BC2 analysis will be performed at the lowest taxonomic level of the input data.

fix_formula

the character string expresses how the microbial absolute abundances for each taxon depend on the fixed effects in metadata. When specifying the fix_formula, make sure to include the group variable in the formula if it is not NULL.

rand_formula

the character string expresses how the microbial absolute abundances for each taxon depend on the random effects in metadata. ANCOM-BC2 follows the lmerTest package in formulating the random effects. See ?lmerTest::lmer for more details. Default is NULL.

p_adj_method

character. method to adjust p-values. Default is "holm". Options include "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none". See ?stats::p.adjust for more details.

pseudo

numeric. Add pseudo-counts to the data. Please note that this option is now deprecated in ANCOM-BC2. The software will utilize the complete data (nonzero counts) as its default analysis input. Specifying a pseudo-count will not affect the analysis or its results.

pseudo_sens

logical. Whether to perform the sensitivity analysis to the pseudo-count addition. Default is TRUE. While ANCOM-BC2 utilizes complete data (nonzero counts) by default for its analysis, a comprehensive evaluation of result robustness is performed by assessing how pseudo-count addition to zeros may affect the outcomes. For a detailed discussion on this sensitivity analysis, refer to the Details section.

prv_cut

a numerical fraction between 0 and 1. Taxa with prevalences (the proportion of samples in which the taxon is present) less than prv_cut will be excluded in the analysis. For example, if there are 100 samples, and a taxon has nonzero counts present in less than 100*prv_cut samples, it will not be considered in the analysis. Default is 0.10.

lib_cut

a numerical threshold for filtering samples based on library sizes. Samples with library sizes less than lib_cut will be excluded in the analysis. Default is 0, i.e. do not discard any sample.

s0_perc

a numerical fraction between 0 and 1. Inspired by Significance Analysis of Microarrays (SAM) methodology, a small positive constant is added to the denominator of ANCOM-BC2 test statistic corresponding to each taxon to avoid the significance due to extremely small standard errors, especially for rare taxa. This small positive constant is chosen as s0_perc-th percentile of standard error values for each fixed effect. Default is 0.05 (5th percentile).

group

character. the name of the group variable in metadata. The group parameter should be a character string representing the name of the group variable in the metadata. The group variable should be discrete, meaning it consists of categorical values. Specifying the group variable is required if you are interested in detecting structural zeros and performing performing multi-group comparisons (global test, pairwise directional test, Dunnett's type of test, and trend test). However, if these analyses are not of interest to you, you can leave the group parameter as NULL. If the group variable of interest contains only two categories, you can also leave the group parameter as NULL. Default is NULL.

struc_zero

logical. Whether to detect structural zeros based on group. Default is FALSE. See Details for a more comprehensive discussion on structural zeros.

neg_lb

logical. Whether to classify a taxon as a structural zero using its asymptotic lower bound. Default is FALSE.

alpha

numeric. Level of significance. Default is 0.05.

n_cl

numeric. The number of nodes to be forked. For details, see ?parallel::makeCluster. Default is 1 (no parallel computing).

verbose

logical. Whether to generate verbose output during the ANCOM-BC2 fitting process. Default is FALSE.

global

logical. Whether to perform the global test. Default is FALSE.

pairwise

logical. Whether to perform the pairwise directional test. Default is FALSE.

dunnet

logical. Whether to perform the Dunnett's type of test. Default is FALSE.

trend

logical. Whether to perform trend test. Default is FALSE.

iter_control

a named list of control parameters for the iterative MLE or RMEL algorithm, including 1) tol: the iteration convergence tolerance (default is 1e-02), 2) max_iter: the maximum number of iterations (default is 20), and 3)verbose: whether to show the verbose output (default is FALSE).

em_control

a named list of control parameters for the E-M algorithm, including 1) tol: the iteration convergence tolerance (default is 1e-05) and 2) max_iter: the maximum number of iterations (default is 100).

lme_control

a list of control parameters for mixed model fitting. See ?lme4::lmerControl for details.

mdfdr_control

a named list of control parameters for mixed directional false discover rate (mdFDR), including 1) fwer_ctrl_method: family wise error (FWER) controlling procedure, such as "holm", "hochberg", "bonferroni", etc (default is "holm") and 2) B: the number of bootstrap samples (default is 100). Increase B will lead to a more accurate p-values. See Details for a more comprehensive discussion on mdFDR.

trend_control

a named list of control parameters for the trend test, including 1) contrast: the list of contrast matrices for constructing inequalities, 2) node: the list of positions for the nodal parameter, 3) solver: a string indicating the solver to use (default is "ECOS"), and 4) B: the number of bootstrap samples (default is 100). Increase B will lead to a more accurate p-values. See vignette for the corresponding trend test examples.

Details

A taxon is considered to have structural zeros in some (>=1) groups if it is completely (or nearly completely) missing in these groups. For instance, suppose there are three groups: g1, g2, and g3. If the counts of taxon A in g1 are 0 but nonzero in g2 and g3, then taxon A will be considered to contain structural zeros in g1. In this example, taxon A is declared to be differentially abundant between g1 and g2, g1 and g3, and consequently, it is globally differentially abundant with respect to this group variable. Such taxa are not further analyzed using ANCOM-BC2, but the results are summarized in the overall summary. For more details about the structural zeros, please go to the ANCOM-II paper. Setting neg_lb = TRUE indicates that you are using both criteria stated in section 3.2 of ANCOM-II to detect structural zeros; otherwise, the algorithm will only use the equation 1 in section 3.2 for declaring structural zeros. Generally, it is recommended to set neg_lb = TRUE when the sample size per group is relatively large (e.g. > 30).

Like other differential abundance analysis methods, ANCOM-BC2 applies a log transformation to the observed counts. However, the presence of zero counts poses a challenge, and researchers often consider adding a pseudo-count before the log transformation. However, it has been shown that the choice of pseudo-count can impact the results and lead to an inflated false positive rate (Costea et al. (2014); Paulson, Bravo, and Pop (2014)). To address this issue, we conduct a sensitivity analysis to assess the impact of different pseudo-counts on zero counts for each taxon. This analysis involves adding a series of pseudo-counts (ranging from 0.01 to 0.5 in increments of 0.01) to the zero counts of each taxon. Linear regression models are then performed on the bias-corrected log abundance table using the different pseudo-counts. The sensitivity score for each taxon is calculated as the proportion of times that the p-value exceeds the specified significance level (alpha). If all p-values consistently show significance or nonsignificance across different pseudo-counts and are consistent with the results obtained without adding pseudo-counts to zero counts (using the default settings), then the taxon is considered not sensitive to the pseudo-count addition.

When performning pairwise directional (or Dunnett's type of) test, the mixed directional false discover rate (mdFDR) should be taken into account. The mdFDR is the combination of false discovery rate due to multiple testing, multiple pairwise comparisons, and directional tests within each pairwise comparison. For example, suppose we have five taxa and three experimental groups: g1, g2, and g3. Thus, we are performing five tests corresponding to five taxa. For each taxon, we are also conducting three pairwise comparisons (g1 vs. g2, g2 vs. g3, and g1 vs. g3). Within each pairwise comparison, we wish to determine if the abundance has increased or decreased or did not change (direction of the effect size). Errors could occur in each step. The overall false discovery rate is controlled by the mdFDR methodology we adopted from Guo, Sarkar, and Peddada (2010) and Grandhi, Guo, and Peddada (2016).

Value

a list with components:

  • feature_table, a data.frame of pre-processed (based on prv_cut and lib_cut) microbial count table.

  • bias_correct_log_table, a data.frame of bias-corrected log abundance table.

  • ss_tab, a data.frame of sensitivity scores for pseudo-count addition to 0s.

  • zero_ind, a logical data.frame with TRUE indicating the taxon is detected to contain structural zeros in some specific groups.

  • samp_frac, a numeric vector of estimated sampling fractions in log scale (natural log).

  • delta_em, estimated sample-specific biases through E-M algorithm.

  • delta_wls, estimated sample-specific biases through weighted least squares (WLS) algorithm.

  • res, a data.frame containing ANCOM-BC2 primary result:

    • columns started with lfc: log fold changes obtained from the ANCOM-BC2 log-linear (natural log) model.

    • columns started with se: standard errors (SEs) of lfc.

    • columns started with W: test statistics. W = lfc/se.

    • columns started with p: p-values. P-values are obtained from two-sided Z-test using the test statistic W.

    • columns started with q: adjusted p-values. Adjusted p-values are obtained by applying p_adj_method to p.

    • columns started with diff: TRUE if the taxon is significant (has q less than alpha).

    • columns started with passed_ss: TRUE if the taxon passed the sensitivity analysis, i.e., adding different pseudo-counts to 0s would not change the results.

  • res_global, a data.frame containing ANCOM-BC2 global test result for the variable specified in group, each column is:

    • W, test statistics.

    • p_val, p-values, which are obtained from two-sided Chi-square test using W.

    • q_val, adjusted p-values. Adjusted p-values are obtained by applying p_adj_method to p_val.

    • diff_abn, A logical vector. TRUE if the taxon has q_val less than alpha.

    • passed_ss, A logical vector. TRUE if the taxon has passed the sensitivity analysis.

  • res_pair, a data.frame containing ANCOM-BC2 pairwise directional test result for the variable specified in group:

    • columns started with lfc: log fold changes.

    • columns started with se: standard errors (SEs).

    • columns started with W: test statistics.

    • columns started with p: p-values.

    • columns started with q: adjusted p-values.

    • columns started with diff: TRUE if the taxon is significant (has q less than alpha).

    • columns started with passed_ss: TRUE if the taxon has passed the sensitivity analysis.

  • res_dunn, a data.frame containing ANCOM-BC2 Dunnett's type of test result for the variable specified in group:

    • columns started with lfc: log fold changes.

    • columns started with se: standard errors (SEs).

    • columns started with W: test statistics.

    • columns started with p: p-values.

    • columns started with q: adjusted p-values.

    • columns started with diff: TRUE if the taxon is significant (has q less than alpha).

    • columns started with passed_ss: TRUE if the taxon has passed the sensitivity analysis.

  • res_trend, a data.frame containing ANCOM-BC2 trend test result for the variable specified in group:

    • columns started with lfc: log fold changes.

    • columns started with se: standard errors (SEs).

    • W: test statistics.

    • p_val: p-values.

    • q_val: adjusted p-values.

    • diff_abn: TRUE if the taxon is significant (has q less than alpha).

    • passed_ss, A logical vector. TRUE if the taxon has passed the sensitivity analysis.

Author(s)

Huang Lin

References

\insertRef

kaul2017analysisANCOMBC

\insertRef

lin2020analysisANCOMBC

\insertRef

tusher2001significanceANCOMBC

\insertRef

costea2014fairANCOMBC

\insertRef

paulson2014replyANCOMBC

\insertRef

guo2010controllingANCOMBC

\insertRef

grandhi2016multipleANCOMBC

See Also

ancom ancombc

Examples

#===========Build a TreeSummarizedExperiment Object from Scratch=============
library(mia)

# microbial count table
otu_mat = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10)
rownames(otu_mat) = paste0("taxon", 1:nrow(otu_mat))
colnames(otu_mat) = paste0("sample", 1:ncol(otu_mat))
assays = SimpleList(counts = otu_mat)

# sample metadata
smd = data.frame(group = sample(LETTERS[1:4], size = 10, replace = TRUE),
                 row.names = paste0("sample", 1:ncol(otu_mat)),
                 stringsAsFactors = FALSE)
smd = DataFrame(smd)

# taxonomy table
tax_tab = matrix(sample(letters, 70, replace = TRUE),
                 nrow = nrow(otu_mat), ncol = 7)
rownames(tax_tab) = rownames(otu_mat)
colnames(tax_tab) = c("Kingdom", "Phylum", "Class", "Order",
                      "Family", "Genus", "Species")
# Can also contain non-taxonomic information, for instance
# colnames(tax_tab) = c("G1", "G2", "G3", "G4", "G5", "G6", "G7")
tax_tab = DataFrame(tax_tab)

# create TSE
tse = TreeSummarizedExperiment(assays = assays,
                               colData = smd,
                               rowData = tax_tab)

# convert TSE to phyloseq
pseq = makePhyloseqFromTreeSummarizedExperiment(tse)

#=======================Run ANCOMBC2 Using a Real Data=======================
library(ANCOMBC)
data(dietswap, package = "microbiome")
tse = mia::makeTreeSummarizedExperimentFromPhyloseq(dietswap)

colData(tse)$bmi_group = factor(colData(tse)$bmi_group,
                                levels = c("obese",
                                           "overweight",
                                           "lean"))

set.seed(123)
# Note that setting max_iter = 1 and B = 1 is only for the sake of speed
# Use default or larger values for max_iter and B for better performance
out = ancombc2(data = tse, assay_name = "counts", tax_level = "Phylum",
               fix_formula = "nationality + timepoint + bmi_group",
               rand_formula = NULL,
               p_adj_method = "holm", pseudo_sens = TRUE,
               prv_cut = 0.10, lib_cut = 1000, s0_perc = 0.05,
               group = "bmi_group", struc_zero = TRUE, neg_lb = TRUE,
               alpha = 0.05, n_cl = 1, verbose = TRUE,
               global = TRUE, pairwise = TRUE, dunnet = TRUE, trend = TRUE,
               iter_control = list(tol = 1e-2, max_iter = 1, verbose = TRUE),
               em_control = list(tol = 1e-5, max_iter = 1),
               lme_control = lme4::lmerControl(),
               mdfdr_control = list(fwer_ctrl_method = "holm", B = 1),
               trend_control = list(contrast =
                                          list(matrix(c(1, 0, -1, 1),
                                                      nrow = 2,
                                                      byrow = TRUE)),
                                      node = list(2),
                                      solver = "ECOS",
                                      B = 1))
res_prim = out$res
res_global = out$res_global
res_pair = out$res_pair
res_dunn = out$res_dunn
res_trend = out$res_trend


FrederickHuangLin/ANCOMBC documentation built on Jan. 4, 2024, 8:18 a.m.