ancombc: Analysis of Compositions of Microbiomes with Bias Correction...

View source: R/ancombc.R

ancombcR Documentation

Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC)

Description

Determine taxa whose absolute abundances, per unit volume, of the ecosystem (e.g., gut) are significantly different with changes in the covariate of interest (e.g., group). The current version of ancombc function implements Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC) in cross-sectional data while allowing for covariate adjustment.

Usage

ancombc(
  data = NULL,
  assay.type = NULL,
  assay_name = "counts",
  rank = NULL,
  tax_level = NULL,
  phyloseq = NULL,
  formula,
  p_adj_method = "holm",
  prv_cut = 0.1,
  lib_cut = 0,
  group = NULL,
  struc_zero = FALSE,
  neg_lb = FALSE,
  tol = 1e-05,
  max_iter = 100,
  conserve = FALSE,
  alpha = 0.05,
  global = FALSE,
  n_cl = 1,
  verbose = FALSE
)

Arguments

data

the input data. The data parameter should be either a phyloseq or a TreeSummarizedExperiment object, which consists of a feature table (microbial count table), a sample metadata table, a taxonomy table (optional), and a phylogenetic tree (optional). Ensure that the row names of the metadata table match the sample names in the feature table, and the row names of the taxonomy table match the taxon (feature) names in the feature table. For detailed information, refer to ?phyloseq::phyloseq or ?TreeSummarizedExperiment::TreeSummarizedExperiment.

assay.type

alias for assay_name.

assay_name

character. Name of the count table in the data object (only applicable if data object is a (Tree)SummarizedExperiment). Default is "counts". See ?SummarizedExperiment::assay for more details.

rank

alias for tax_level.

tax_level

character. The taxonomic level of interest. The input data can be agglomerated at different taxonomic levels based on your research interest. Default is NULL, i.e., do not perform agglomeration, and the ANCOM-BC anlysis will be performed at the lowest taxonomic level of the input data.

phyloseq

a phyloseq object. Will be deprecated.

formula

the character string expresses how microbial absolute abundances for each taxon depend on the variables in metadata. When specifying the formula, make sure to include the group variable in the formula if it is not NULL.

p_adj_method

character. method to adjust p-values. Default is "holm". Options include "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none". See ?stats::p.adjust for more details.

prv_cut

a numerical fraction between 0 and 1. Taxa with prevalences (the proportion of samples in which the taxon is present) less than prv_cut will be excluded in the analysis. For example, if there are 100 samples, and a taxon has nonzero counts present in less than 100*prv_cut samples, it will not be considered in the analysis. Default is 0.10.

lib_cut

a numerical threshold for filtering samples based on library sizes. Samples with library sizes less than lib_cut will be excluded in the analysis. Default is 0, i.e. do not discard any sample.

group

character. the name of the group variable in metadata. The group parameter should be a character string representing the name of the group variable in the metadata. The group variable should be discrete, meaning it consists of categorical values. Specifying the group variable is required if you are interested in detecting structural zeros and performing global tests. However, if these analyses are not of interest to you, you can leave the group parameter as NULL. If the group variable of interest contains only two categories, you can also leave the group parameter as NULL. Default is NULL.

struc_zero

logical. whether to detect structural zeros based on group. Default is FALSE.

neg_lb

logical. whether to classify a taxon as a structural zero using its asymptotic lower bound. Default is FALSE.

tol

numeric. the iteration convergence tolerance for the E-M algorithm. Default is 1e-05.

max_iter

numeric. the maximum number of iterations for the E-M algorithm. Default is 100.

conserve

logical. whether to use a conservative variance estimator for the test statistic. It is recommended if the sample size is small and/or the number of differentially abundant taxa is believed to be large. Default is FALSE.

alpha

numeric. level of significance. Default is 0.05.

global

logical. whether to perform the global test. Default is FALSE.

n_cl

numeric. The number of nodes to be forked. For details, see ?parallel::makeCluster. Default is 1 (no parallel computing).

verbose

logical. Whether to generate verbose output during the ANCOM-BC fitting process. Default is FALSE.

Details

A taxon is considered to have structural zeros in some (>=1) groups if it is completely (or nearly completely) missing in these groups. For instance, suppose there are three groups: g1, g2, and g3. If the counts of taxon A in g1 are 0 but nonzero in g2 and g3, then taxon A will be considered to contain structural zeros in g1. In this example, taxon A is declared to be differentially abundant between g1 and g2, g1 and g3, and consequently, it is globally differentially abundant with respect to this group variable. Such taxa are not further analyzed using ANCOM-BC, but the results are summarized in the overall summary. For more details about the structural zeros, please go to the ANCOM-II paper. Setting neg_lb = TRUE indicates that you are using both criteria stated in section 3.2 of ANCOM-II to detect structural zeros; otherwise, the algorithm will only use the equation 1 in section 3.2 for declaring structural zeros. Generally, it is recommended to set neg_lb = TRUE when the sample size per group is relatively large (e.g. > 30).

Value

a list with components:

  • feature_table, a data.frame of pre-processed (based on prv_cut and lib_cut) microbial count table.

  • zero_ind, a logical data.frame with TRUE indicating the taxon is detected to contain structural zeros in some specific groups.

  • samp_frac, a numeric vector of estimated sampling fractions in log scale (natural log).

  • delta_em, estimated sample-specific biases through E-M algorithm.

  • delta_wls, estimated sample-specific biases through weighted least squares (WLS) algorithm.

  • res, a list containing ANCOM-BC primary result, which consists of:

    • lfc, a data.frame of log fold changes obtained from the ANCOM-BC log-linear (natural log) model.

    • se, a data.frame of standard errors (SEs) of lfc.

    • W, a data.frame of test statistics. W = lfc/se.

    • p_val, a data.frame of p-values. P-values are obtained from two-sided Z-test using the test statistic W.

    • q_val, a data.frame of adjusted p-values. Adjusted p-values are obtained by applying p_adj_method to p_val.

    • diff_abn, a logical data.frame. TRUE if the taxon has q_val less than alpha.

  • res_global, a data.frame containing ANCOM-BC global test result for the variable specified in group, each column is:

    • W, test statistics.

    • p_val, p-values, which are obtained from two-sided Chi-square test using W.

    • q_val, adjusted p-values. Adjusted p-values are obtained by applying p_adj_method to p_val.

    • diff_abn, A logical vector. TRUE if the taxon has q_val less than alpha.

Author(s)

Huang Lin

References

\insertRef

kaul2017analysisANCOMBC

\insertRef

lin2020analysisANCOMBC

See Also

ancom ancombc2

Examples

#===========Build a TreeSummarizedExperiment Object from Scratch=============
library(mia)

# microbial count table
otu_mat = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10)
rownames(otu_mat) = paste0("taxon", 1:nrow(otu_mat))
colnames(otu_mat) = paste0("sample", 1:ncol(otu_mat))
assays = SimpleList(counts = otu_mat)

# sample metadata
smd = data.frame(group = sample(LETTERS[1:4], size = 10, replace = TRUE),
                 row.names = paste0("sample", 1:ncol(otu_mat)),
                 stringsAsFactors = FALSE)
smd = DataFrame(smd)

# taxonomy table
tax_tab = matrix(sample(letters, 70, replace = TRUE),
                 nrow = nrow(otu_mat), ncol = 7)
rownames(tax_tab) = rownames(otu_mat)
colnames(tax_tab) = c("Kingdom", "Phylum", "Class", "Order",
                      "Family", "Genus", "Species")
tax_tab = DataFrame(tax_tab)

# create TSE
tse = TreeSummarizedExperiment(assays = assays,
                               colData = smd,
                               rowData = tax_tab)

# convert TSE to phyloseq
pseq = makePhyloseqFromTreeSummarizedExperiment(tse)

#========================Run ANCOMBC Using a Real Data=======================
library(ANCOMBC)
data(atlas1006, package = "microbiome")
tse = mia::makeTreeSummarizedExperimentFromPhyloseq(atlas1006)

# subset to baseline
tse = tse[, tse$time == 0]

# run ancombc function
set.seed(123)
out = ancombc(data = tse, assay_name = "counts",
              tax_level = "Family", phyloseq = NULL,
              formula = "age + nationality + bmi_group",
              p_adj_method = "holm", prv_cut = 0.10, lib_cut = 1000,
              group = "bmi_group", struc_zero = TRUE, neg_lb = FALSE,
              tol = 1e-5, max_iter = 100, conserve = TRUE,
              alpha = 0.05, global = TRUE, n_cl = 1, verbose = TRUE)

res_prim = out$res
res_global = out$res_global

# to run ancombc using the phyloseq object
tse_alt = agglomerateByRank(tse, "Family")
pseq = makePhyloseqFromTreeSummarizedExperiment(tse_alt)
set.seed(123)
out = ancombc(data = NULL, assay_name = NULL,
              tax_level = "Family", phyloseq = pseq,
              formula = "age + nationality + bmi_group",
              p_adj_method = "holm", prv_cut = 0.10, lib_cut = 1000,
              group = "bmi_group", struc_zero = TRUE, neg_lb = FALSE,
              tol = 1e-5, max_iter = 100, conserve = TRUE,
              alpha = 0.05, global = TRUE, n_cl = 1, verbose = TRUE)


FrederickHuangLin/ANCOMBC documentation built on Jan. 4, 2024, 8:18 a.m.