#' @title Analysis of Compositions of Microbiomes with Bias Correction
#' (ANCOM-BC)
#'
#' @description Determine taxa whose absolute abundances, per unit volume, of
#' the ecosystem (e.g., gut) are significantly different with changes in the
#' covariate of interest (e.g., group). The current version of
#' \code{ancombc} function implements Analysis of Compositions of Microbiomes
#' with Bias Correction (ANCOM-BC) in cross-sectional data while allowing
#' for covariate adjustment.
#'
#' @details A taxon is considered to have structural zeros in some (>=1)
#' groups if it is completely (or nearly completely) missing in these groups.
#' For instance, suppose there are three groups: g1, g2, and g3.
#' If the counts of taxon A in g1 are 0 but nonzero in g2 and g3,
#' then taxon A will be considered to contain structural zeros in g1.
#' In this example, taxon A is declared to be differentially abundant between
#' g1 and g2, g1 and g3, and consequently, it is globally differentially
#' abundant with respect to this group variable.
#' Such taxa are not further analyzed using ANCOM-BC, but the results are
#' summarized in the overall summary. For more details about the structural
#' zeros, please go to the
#' \href{https://doi.org/10.3389/fmicb.2017.02114}{ANCOM-II} paper.
#' Setting \code{neg_lb = TRUE} indicates that you are using both criteria
#' stated in section 3.2 of
#' \href{https://doi.org/10.3389/fmicb.2017.02114}{ANCOM-II}
#' to detect structural zeros; otherwise, the algorithm will only use the
#' equation 1 in section 3.2 for declaring structural zeros. Generally, it is
#' recommended to set \code{neg_lb = TRUE} when the sample size per group is
#' relatively large (e.g. > 30).
#'
#' @param data the input data. The \code{data} parameter should be either a
#' \code{matrix}, \code{data.frame}, \code{phyloseq} or a \code{TreeSummarizedExperiment}
#' object. Both \code{phyloseq} and \code{TreeSummarizedExperiment} objects
#' consist of a feature table (microbial count table), a sample metadata table,
#' a taxonomy table (optional), and a phylogenetic tree (optional).
#' If a \code{matrix} or \code{data.frame} is provided, ensure that the row
#' names of the \code{metadata} match the sample names (column names if
#' \code{taxa_are_rows} is TRUE, and row names otherwise) in \code{data}.
#' if a \code{phyloseq} or a \code{TreeSummarizedExperiment} is used, this
#' standard has already been enforced. For detailed information, refer to
#' \code{?phyloseq::phyloseq} or
#' \code{?TreeSummarizedExperiment::TreeSummarizedExperiment}.
#' It is recommended to use low taxonomic levels, such as OTU or species level,
#' as the estimation of sampling fractions requires a large number of taxa.
#' @param taxa_are_rows logical. Whether taxa are positioned in the rows of the
#' feature table. Default is TRUE.
#' @param assay_name character. Name of the count table in the data object
#' (only applicable if data object is a \code{(Tree)SummarizedExperiment}).
#' Default is "counts".
#' See \code{?SummarizedExperiment::assay} for more details.
#' @param assay.type alias for \code{assay_name}.
#' @param tax_level character. The taxonomic level of interest. The input data
#' can be agglomerated at different taxonomic levels based on your research
#' interest. Default is NULL, i.e., do not perform agglomeration, and the
#' ANCOM-BC anlysis will be performed at the lowest taxonomic level of the
#' input \code{data}.
#' @param rank alias for \code{tax_level}.
#' @param aggregate_data The abundance data that has been aggregated to the desired
#' taxonomic level. This parameter is required only when the input data is in
#' \code{matrix} or \code{data.frame} format. For \code{phyloseq} or \code{TreeSummarizedExperiment}
#' data, aggregation is performed by specifying the \code{tax_level} parameter.
#' @param meta_data a \code{data.frame} containing sample metadata.
#' This parameter is mandatory when the input \code{data} is a generic
#' \code{matrix} or \code{data.frame}. Ensure that the row names of the \code{metadata} match the
#' sample names (column names if \code{taxa_are_rows} is TRUE, and row names
#' otherwise) in \code{data}.
#' @param formula the character string expresses how microbial absolute
#' abundances for each taxon depend on the variables in metadata. When
#' specifying the \code{formula}, make sure to include the \code{group} variable
#' in the formula if it is not NULL.
#' @param p_adj_method character. method to adjust p-values. Default is "holm".
#' Options include "holm", "hochberg", "hommel", "bonferroni", "BH", "BY",
#' "fdr", "none". See \code{?stats::p.adjust} for more details.
#' @param prv_cut a numerical fraction between 0 and 1. Taxa with prevalences
#' (the proportion of samples in which the taxon is present)
#' less than \code{prv_cut} will be excluded in the analysis. For example,
#' if there are 100 samples, and a taxon has nonzero counts present in less than
#' 100*prv_cut samples, it will not be considered in the analysis.
#' Default is 0.10.
#' @param lib_cut a numerical threshold for filtering samples based on library
#' sizes. Samples with library sizes less than \code{lib_cut} will be
#' excluded in the analysis. Default is 0, i.e. do not discard any sample.
#' @param group character. the name of the group variable in metadata.
#' The \code{group} parameter should be a character string representing the name
#' of the group variable in the metadata. The \code{group} variable should be
#' discrete, meaning it consists of categorical values. Specifying the
#' \code{group} variable is required if you are interested in detecting
#' structural zeros and performing global tests. However, if these analyses are
#' not of interest to you, you can leave the \code{group} parameter as NULL.
#' If the \code{group} variable of interest contains only two categories, you
#' can also leave the \code{group} parameter as NULL. Default is NULL.
#' @param struc_zero logical. whether to detect structural zeros based on
#' \code{group}. Default is FALSE.
#' @param neg_lb logical. whether to classify a taxon as a structural zero using
#' its asymptotic lower bound. Default is FALSE.
#' @param tol numeric. the iteration convergence tolerance for the E-M
#' algorithm. Default is 1e-05.
#' @param max_iter numeric. the maximum number of iterations for the E-M
#' algorithm. Default is 100.
#' @param conserve logical. whether to use a conservative variance estimator for
#' the test statistic. It is recommended if the sample size is small and/or
#' the number of differentially abundant taxa is believed to be large.
#' Default is FALSE.
#' @param alpha numeric. level of significance. Default is 0.05.
#' @param global logical. whether to perform the global test. Default is FALSE.
#' @param n_cl numeric. The number of nodes to be forked. For details, see
#' \code{?parallel::makeCluster}. Default is 1 (no parallel computing).
#' @param verbose logical. Whether to generate verbose output during the
#' ANCOM-BC fitting process. Default is FALSE.
#'
#' @return a \code{list} with components:
#' \itemize{
#' \item{ \code{feature_table}, a \code{data.frame} of pre-processed
#' (based on \code{prv_cut} and \code{lib_cut}) microbial count table.}
#' \item{ \code{zero_ind}, a logical \code{data.frame} with TRUE
#' indicating the taxon is detected to contain structural zeros in
#' some specific groups.}
#' \item{ \code{samp_frac}, a numeric vector of estimated sampling
#' fractions in log scale (natural log).}
#' \item{ \code{delta_em}, estimated sample-specific biases
#' through E-M algorithm.}
#' \item{ \code{delta_wls}, estimated sample-specific biases through
#' weighted least squares (WLS) algorithm.}
#' \item{ \code{res}, a \code{list} containing ANCOM-BC primary result,
#' which consists of:}
#' \itemize{
#' \item{ \code{lfc}, a \code{data.frame} of log fold changes
#' obtained from the ANCOM-BC log-linear (natural log) model.}
#' \item{ \code{se}, a \code{data.frame} of standard errors (SEs) of
#' \code{lfc}.}
#' \item{ \code{W}, a \code{data.frame} of test statistics.
#' \code{W = lfc/se}.}
#' \item{ \code{p_val}, a \code{data.frame} of p-values. P-values are
#' obtained from two-sided Z-test using the test statistic \code{W}.}
#' \item{ \code{q_val}, a \code{data.frame} of adjusted p-values.
#' Adjusted p-values are obtained by applying \code{p_adj_method}
#' to \code{p_val}.}
#' \item{ \code{diff_abn}, a logical \code{data.frame}. TRUE if the
#' taxon has \code{q_val} less than \code{alpha}.}
#' }
#' \item{ \code{res_global}, a \code{data.frame} containing ANCOM-BC
#' global test result for the variable specified in \code{group},
#' each column is:}
#' \itemize{
#' \item{ \code{W}, test statistics.}
#' \item{ \code{p_val}, p-values, which are obtained from two-sided
#' Chi-square test using \code{W}.}
#' \item{ \code{q_val}, adjusted p-values. Adjusted p-values are
#' obtained by applying \code{p_adj_method} to \code{p_val}.}
#' \item{ \code{diff_abn}, A logical vector. TRUE if the taxon has
#' \code{q_val} less than \code{alpha}.}
#' }
#' }
#'
#' @seealso \code{\link{ancom}} \code{\link{ancombc2}}
#'
#' @examples
#' library(ANCOMBC)
#' data(atlas1006, package = "microbiome")
#' tse = mia::makeTreeSummarizedExperimentFromPhyloseq(atlas1006)
#'
#' # subset to baseline
#' tse = tse[, tse$time == 0]
#'
#' # run ancombc function
#' set.seed(123)
#' out = ancombc(data = tse, assay_name = "counts",
#' tax_level = "Family", phyloseq = NULL,
#' formula = "age + nationality + bmi_group",
#' p_adj_method = "holm", prv_cut = 0.10, lib_cut = 1000,
#' group = "bmi_group", struc_zero = TRUE, neg_lb = FALSE,
#' tol = 1e-5, max_iter = 100, conserve = TRUE,
#' alpha = 0.05, global = TRUE, n_cl = 1, verbose = TRUE)
#'
#' res_prim = out$res
#' res_global = out$res_global
#'
#' # to run ancombc using the phyloseq object
#' tse_alt = agglomerateByRank(tse, "Family")
#' pseq = makePhyloseqFromTreeSummarizedExperiment(tse_alt)
#' set.seed(123)
#' out = ancombc(data = NULL, assay_name = NULL,
#' tax_level = "Family", phyloseq = pseq,
#' formula = "age + nationality + bmi_group",
#' p_adj_method = "holm", prv_cut = 0.10, lib_cut = 1000,
#' group = "bmi_group", struc_zero = TRUE, neg_lb = FALSE,
#' tol = 1e-5, max_iter = 100, conserve = TRUE,
#' alpha = 0.05, global = TRUE, n_cl = 1, verbose = TRUE)
#'
#' @author Huang Lin
#'
#' @references
#' \insertRef{kaul2017analysis}{ANCOMBC}
#'
#' \insertRef{lin2020analysis}{ANCOMBC}
#'
#' @rawNamespace import(stats, except = filter)
#' @importFrom microbiome abundances meta aggregate_taxa
#' @importFrom SummarizedExperiment assay colData rowData
#' @importFrom mia taxonomyRanks agglomerateByRank
#' @importFrom S4Vectors DataFrame SimpleList
#' @importFrom parallel makeCluster stopCluster
#' @importFrom foreach foreach %dopar% registerDoSEQ
#' @importFrom doParallel registerDoParallel
#' @importFrom doRNG %dorng%
#' @importFrom MASS ginv
#' @importFrom nloptr neldermead
#' @importFrom Rdpack reprompt
#'
#' @export
ancombc = function(data = NULL, taxa_are_rows = TRUE,
assay.type = NULL, assay_name = "counts",
rank = NULL, tax_level = NULL,
aggregate_data = NULL, meta_data = NULL,
formula, p_adj_method = "holm", prv_cut = 0.10,
lib_cut = 0, group = NULL, struc_zero = FALSE,
neg_lb = FALSE, tol = 1e-05, max_iter = 100,
conserve = FALSE, alpha = 0.05, global = FALSE,
n_cl = 1, verbose = TRUE){
message("'ancombc' has been fully evolved to 'ancombc2'. \n",
"Explore the enhanced capabilities of our refined method!")
if (n_cl > 1) {
cl = parallel::makeCluster(n_cl)
doParallel::registerDoParallel(cl)
} else {
foreach::registerDoSEQ()
}
# 1. Data pre-processing
# Data sanity check
check_results = data_sanity_check(data = data,
taxa_are_rows = taxa_are_rows,
assay.type = assay_name,
assay_name = assay_name,
rank = tax_level,
tax_level = tax_level,
aggregate_data = aggregate_data,
meta_data = meta_data,
fix_formula = formula,
group = group,
struc_zero = struc_zero,
global = global,
verbose = verbose)
feature_table = check_results$feature_table
feature_table_aggregate = check_results$feature_table_aggregate
meta_data = check_results$meta_data
global = check_results$global
# Filter data by prevalence and library size
core = .data_core(data = feature_table_aggregate,
meta_data = meta_data,
prv_cut = prv_cut, lib_cut = lib_cut,
tax_keep = NULL, samp_keep = NULL)
feature_table = core$feature_table
meta_data = core$meta_data
tax_keep = core$tax_keep
n_tax = nrow(feature_table)
tax_name = rownames(feature_table)
if (n_tax < 10) {
warn_txt = sprintf(paste("ANCOM-BC results would be unreliable when the number of taxa is too small (e.g. < 10)",
"The number of taxa in the current dataset is: ",
n_tax, sep = "\n"))
warning(warn_txt, call. = FALSE)
}
# Add pseudocount (1) and take logarithm.
y = log(feature_table + 1)
options(na.action = "na.pass") # Keep NA's in rows of x
x = model.matrix(formula(paste0("~", formula)), data = meta_data)
options(na.action = "na.omit") # Switch it back
covariates = colnames(x)
n_covariates = length(covariates)
# 2. Identify taxa with structural zeros
if (struc_zero) {
if (is.null(group)) {
stop_txt = paste("Please specify the group variable for",
"detecting structural zeros.",
"Otherwise, set struc_zero = FALSE to proceed")
stop(stop_txt, call. = FALSE)
}
zero_ind = .get_struc_zero(data = feature_table_aggregate,
meta_data = meta_data,
group = group, neg_lb = neg_lb)
zero_ind = zero_ind[tax_keep, ]
rownames(zero_ind) = NULL
}else{ zero_ind = NULL }
# 3. Estimation of parameters
if (verbose) {
message("Obtaining initial estimates ...")
}
para = .iter_mle(x = x, y = y, meta_data = meta_data,
formula = formula, theta = NULL, tol = tol,
max_iter = max_iter, verbose = FALSE)
beta = para$beta
vcov_hat = para$vcov_hat
var_hat = para$var_hat
# 4. Estimation of the sample-specific bias
if (verbose) {
message("Estimating sample-specific biases ...")
}
fun_list = list(.bias_em)
bias = foreach(i = seq_len(ncol(beta)), .combine = rbind) %dorng% {
output = fun_list[[1]](beta = beta[, i],
var_hat = var_hat[, i],
tol = tol,
max_iter = max_iter)
}
bias = data.frame(bias, row.names = covariates, check.names = FALSE)
delta_em = bias$delta_em
delta_wls = bias$delta_wls
var_delta = bias$var_delta
# 5. Obtain coefficients, standard errors, and sampling fractions
beta_hat = beta
beta_hat = t(t(beta_hat) - delta_em)
# Account for the variance of delta_hat
if (conserve) {
var_hat = sweep(var_hat, 2, var_delta, "+") +
2 * sqrt(sweep(var_hat, 2, var_delta, "*"))
vcov_hat = lapply(seq_len(n_tax), function(i) {
diag(vcov_hat[[i]]) = var_hat[i, ]
return(vcov_hat[[i]])
})
se_hat = sqrt(var_hat)
}else{ se_hat = sqrt(var_hat) }
theta_hat = matrix(NA, nrow = n_tax, ncol = ncol(y))
for (i in seq_len(n_tax)) {
theta_hat[i, ] = y[i, ] - x %*% beta_hat[i, ]
}
theta_hat = colMeans(theta_hat, na.rm = TRUE)
# 6. Primary results
if (verbose) {
message("ANCOM-BC primary results ...")
}
W = beta_hat/se_hat
p = 2 * pnorm(abs(W), mean = 0, sd = 1, lower.tail = FALSE)
q = apply(p, 2, function(x) p.adjust(x, method = p_adj_method))
diff_abn = q <= alpha & !is.na(q)
beta_prim = cbind(taxon = data.frame(taxon = tax_name),
data.frame(beta_hat, check.names = FALSE,
row.names = NULL))
se_prim = cbind(taxon = data.frame(taxon = tax_name),
data.frame(se_hat, check.names = FALSE,
row.names = NULL))
W_prim = cbind(taxon = data.frame(taxon = tax_name),
data.frame(W, check.names = FALSE,
row.names = NULL))
p_prim = cbind(taxon = data.frame(taxon = tax_name),
data.frame(p, check.names = FALSE,
row.names = NULL))
q_prim = cbind(taxon = data.frame(taxon = tax_name),
data.frame(q, check.names = FALSE,
row.names = NULL))
diff_prim = cbind(taxon = data.frame(taxon = tax_name),
data.frame(diff_abn, check.names = FALSE,
row.names = NULL))
res = list(lfc = beta_prim,
se = se_prim,
W = W_prim,
p_val = p_prim,
q_val = q_prim,
diff_abn = diff_prim)
# 7. Global test results
if (global) {
if (verbose) {
message("ANCOM-BC global test ...")
}
res_global = .ancombc_global_F(x = x, group = group,
beta_hat = beta_hat,
vcov_hat = vcov_hat,
p_adj_method = p_adj_method,
alpha = alpha)
} else { res_global = NULL }
# 8. Combine the information of structural zeros
# Set p/q-values and SEs of structural zeros to be 0s.
if (struc_zero) {
if (verbose) {
txt = paste0("Merge the information of structural zeros ... \n",
"Note that taxa with structural zeros will have ",
"0 p/q-values and SEs")
message(txt)
}
zero_idx = as.matrix(zero_ind[, -1])
group_ind = grepl(group, c("taxon", covariates))
zero_mask = 1 - abs((zero_idx - zero_idx[, 1]))
zero_mask = zero_mask[, -1, drop = FALSE]
res$se[, group_ind] = res$se[, group_ind] * zero_mask
res$p_val[, group_ind] = res$p_val[, group_ind] * zero_mask
res$q_val[, group_ind] = res$q_val[, group_ind] * zero_mask
res$diff_abn[, group_ind] = data.frame(res$q_val[, group_ind] <= alpha &
!is.na(res$q_val[, group_ind]),
check.names = FALSE)
# Global test
if (global) {
zero_mask = 1 - apply(zero_idx, 1, function(x)
sum(x) > 0 & sum(x) < ncol(zero_idx))
res_global[, "p_val"] = res_global[, "p_val"] * zero_mask
res_global[, "q_val"] = res_global[, "q_val"] * zero_mask
res_global[, "diff_abn"] = res_global[, "q_val"] <= alpha &
!is.na(res_global[, "q_val"])
}
}
# 9. Outputs
out = list(feature_table = feature_table, zero_ind = zero_ind,
samp_frac = theta_hat, delta_em = delta_em,
delta_wls = delta_wls, res = res, res_global = res_global)
if (n_cl > 1) {
parallel::stopCluster(cl)
}
return(out)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.