R/secom_linear.R

Defines functions secom_linear

Documented in secom_linear

#' @title Sparse estimation of linear correlations among microbiomes
#'
#' @description Obtain the sparse correlation matrix for linear correlations
#' between taxa. The current version of \code{secom_linear} function supports
#' either of the three correlation coefficients: Pearson, Spearman, and
#' Kendall's \eqn{\tau}.
#'
#' @param data a \code{list} of the input data.
#' The \code{data} parameter should be either a
#' \code{matrix}, \code{data.frame}, \code{phyloseq} or a \code{TreeSummarizedExperiment}
#' object. Both \code{phyloseq} and \code{TreeSummarizedExperiment} objects
#' consist of a feature table (microbial count table), a sample metadata table,
#' a taxonomy table (optional), and a phylogenetic tree (optional).
#' If a \code{matrix} or \code{data.frame} is provided, ensure that the row
#' names of the \code{metadata} match the sample names (column names if
#' \code{taxa_are_rows} is TRUE, and row names otherwise) in \code{data}.
#' if a \code{phyloseq} or a \code{TreeSummarizedExperiment} is used, this
#' standard has already been enforced. For detailed information, refer to
#' \code{?phyloseq::phyloseq} or
#' \code{?TreeSummarizedExperiment::TreeSummarizedExperiment}.
#' It is recommended to use low taxonomic levels, such as OTU or species level,
#' as the estimation of sampling fractions requires a large number of taxa.
#' If working with multiple ecosystems, such as gut and tongue, stack the data
#' by specifying the list of input data as
#' \code{data = list(gut = pseq1, tongue = pseq2)}.
#' @param taxa_are_rows logical. Whether taxa are positioned in the rows of the
#' feature table. Default is TRUE.
#' @param assay_name character. Name of the feature table within the data object
#' (only applicable if the data object is a \code{(Tree)SummarizedExperiment}).
#' Default is "counts".
#' See \code{?SummarizedExperiment::assay} for more details.
#' @param assay.type alias for \code{assay_name}.
#' @param tax_level character. The taxonomic level of interest. The input data
#' can be agglomerated at different taxonomic levels based on your research
#' interest. Default is NULL, i.e., do not perform agglomeration, and the
#' SECOM anlysis will be performed at the lowest taxonomic level of the
#' input \code{data}.
#' @param rank alias for \code{tax_level}.
#' @param aggregate_data The abundance data that has been aggregated to the desired
#' taxonomic level. This parameter is required only when the input data is in
#' \code{matrix} or \code{data.frame} format. For \code{phyloseq} or \code{TreeSummarizedExperiment}
#' data, aggregation is performed by specifying the \code{tax_level} parameter.
#' @param meta_data a \code{data.frame} containing sample metadata.
#' This parameter is mandatory when the input \code{data} is a generic
#' \code{matrix} or \code{data.frame}. Ensure that the row names of the \code{metadata} match the
#' sample names (column names if \code{taxa_are_rows} is TRUE, and row names
#' otherwise) in \code{data}.
#' @param pseudo numeric. Add pseudo-counts to the data.
#' Default is 0 (no pseudo-counts).
#' @param prv_cut a numerical fraction between 0 and 1. Taxa with prevalences
#' (the proportion of samples in which the taxon is present)
#' less than \code{prv_cut} will be excluded in the analysis. For example,
#' if there are 100 samples, and a taxon has nonzero counts present in less than
#' 100*prv_cut samples, it will not be considered in the analysis.
#' Default is 0.50.
#' @param lib_cut a numerical threshold for filtering samples based on library
#' sizes. Samples with library sizes less than \code{lib_cut} will be
#' excluded in the analysis. Default is 1000.
#' @param corr_cut numeric. To avoid false positives caused by taxa with small
#' variances, taxa with Pearson correlation coefficients greater than
#' \code{corr_cut} with the estimated sample-specific bias will be flagged.
#' When taxa are flagged, the pairwise correlation coefficient between them will
#' be set to 0s. Default is 0.5.
#' @param wins_quant a numeric vector of probabilities with values between
#' 0 and 1. Replace extreme values in the abundance data with less
#' extreme values. Default is \code{c(0.05, 0.95)}. For details,
#' see \code{?DescTools::Winsorize}.
#' @param method character. It indicates which correlation coefficient is to be
#' computed. It can be either "pearson" or "spearman".
#' @param soft logical. \code{TRUE} indicates that soft thresholding is applied
#' to achieve the sparsity of the correlation matrix. \code{FALSE} indicates
#' that hard thresholding is applied to achieve the sparsity of the correlation
#' matrix. Default is \code{FALSE}.
#' @param alpha_grid a numeric vector of penalty parameters for the element-wise
#' L1 norm to induce sparsity. Default is 0.
#' @param thresh_len numeric. Grid-search is implemented to find the optimal
#' values over \code{thresh_len} thresholds for the thresholding operator.
#' Default is 100.
#' @param n_cv numeric. The fold number in cross validation.
#' Default is 10 (10-fold cross validation).
#' @param thresh_hard Numeric. Pairwise correlation coefficients
#' (in their absolute value) that are less than or equal to \code{thresh_hard}
#' will be set to 0. Default is 0.3.
#' @param max_p numeric. Obtain the sparse correlation matrix by
#' p-value filtering. Pairwise correlation coefficients with p-value greater
#' than \code{max_p} will be set to 0s. Default is 0.005.
#' @param n_cl numeric. The number of nodes to be forked. For details, see
#' \code{?parallel::makeCluster}. Default is 1 (no parallel computing).
#' @param verbose logical. Whether to display detailed progress messages.
#'
#' @return a \code{list} with components:
#'         \itemize{
#'         \item{ \code{s_diff_hat}, a numeric vector of estimated
#'         sample-specific biases.}
#'         \item{ \code{y_hat}, a matrix of bias-corrected abundances}
#'         \item{ \code{cv_error}, a numeric vector of cross-validation error
#'         estimates, which are the Frobenius norm differences between
#'         correlation matrices using training set and validation set,
#'         respectively.}
#'         \item{ \code{thresh_grid}, a numeric vector of thresholds
#'         in the cross-validation.}
#'         \item{ \code{thresh_opt}, numeric. The optimal threshold through
#'         cross-validation.}
#'         \item{ \code{mat_cooccur}, a matrix of taxon-taxon co-occurrence
#'         pattern. The number in each cell represents the number of complete
#'         (nonzero) samples for the corresponding pair of taxa.}
#'         \item{ \code{corr}, the sample correlation matrix (using the measure
#'         specified in \code{method}) computed using the bias-corrected
#'         abundances \code{y_hat}.}
#'         \item{ \code{corr_p}, the p-value matrix corresponding to the sample
#'         correlation matrix \code{corr}.}
#'         \item{ \code{corr_th}, the sparse correlation matrix obtained by
#'         thresholding based on the method specified in \code{soft}.}
#'         \item{ \code{corr_fl}, the sparse correlation matrix obtained by
#'         p-value filtering based on the cutoff specified in \code{max_p}.}
#'         \item{ \code{corr_reg}, the correlation matrix obtained by
#'         winsorizing small eigenvalues.}
#'         }
#'
#' @seealso \code{\link{secom_dist}}
#'
#' @examples
#' library(ANCOMBC)
#' if (requireNamespace("microbiome", quietly = TRUE)) {
#'     data(atlas1006, package = "microbiome")
#'     # subset to baseline
#'     pseq = phyloseq::subset_samples(atlas1006, time == 0)
#'
#'     # run secom_linear function
#'     set.seed(123)
#'     res_linear = secom_linear(data = list(pseq), taxa_are_rows = TRUE,
#'                               tax_level = "Phylum",
#'                               aggregate_data = NULL, meta_data = NULL, pseudo = 0,
#'                               prv_cut = 0.5, lib_cut = 1000, corr_cut = 0.5,
#'                               wins_quant = c(0.05, 0.95), method = "pearson",
#'                               soft = FALSE, alpha_grid = 0,
#'                               thresh_len = 20, n_cv = 10,
#'                               thresh_hard = 0.3, max_p = 0.005, n_cl = 2)
#'
#'     corr_th = res_linear$corr_th
#'     corr_fl = res_linear$corr_fl
#' } else {
#'     message("The 'microbiome' package is not installed. Please install it to use this example.")
#' }
#'
#' @author Huang Lin
#'
#' @importFrom energy dcor dcor.test
#' @importFrom parallel makeCluster stopCluster
#' @importFrom foreach foreach %dopar%
#' @importFrom doParallel registerDoParallel
#' @importFrom doRNG %dorng%
#' @importFrom gtools smartbind
#' @importFrom Hmisc rcorr
#' @importFrom DescTools Winsorize
#' @importFrom Rdpack reprompt
#'
#' @export
secom_linear = function(data, taxa_are_rows = TRUE,
                        assay.type = assay_name, assay_name = "counts",
                        rank = tax_level, tax_level = NULL,
                        aggregate_data = NULL, meta_data = NULL,
                        pseudo = 0, prv_cut = 0.5, lib_cut = 1000,
                        corr_cut = 0.5, wins_quant = c(0.05, 0.95),
                        method = c("pearson", "spearman"),
                        soft = FALSE, alpha_grid = 0,
                        thresh_len = 100, n_cv = 10,
                        thresh_hard = 0, max_p = 0.005, n_cl = 1,
                        verbose = TRUE) {

    # ===========Sampling fraction and absolute abundance estimation============
    if (length(data) == 1) {
        # Data sanity check
        check_results = data_sanity_check(data = data[[1]],
                                          taxa_are_rows = taxa_are_rows,
                                          assay.type = assay_name,
                                          assay_name = assay_name,
                                          rank = tax_level,
                                          tax_level = tax_level,
                                          aggregate_data = aggregate_data[[1]],
                                          meta_data = meta_data[[1]],
                                          fix_formula = NULL,
                                          verbose = verbose)
        feature_table = check_results$feature_table
        feature_table_aggregate = check_results$feature_table_aggregate
        meta_data = check_results$meta_data

        abn_list = .abn_est(data = feature_table,
                            aggregate_data = feature_table_aggregate,
                            meta_data = meta_data, pseudo = pseudo,
                            prv_cut = prv_cut, lib_cut = lib_cut)
        s_diff_hat = abn_list$s_diff_hat
        y_hat = abn_list$y_hat
    } else {
        if (is.null(names(data))) names(data) = paste0("data", seq_along(data))

        # Data sanity check
        check_results_list = lapply(seq_along(data), function(i) {
            check_results = data_sanity_check(data = data[[i]],
                                              taxa_are_rows = taxa_are_rows,
                                              assay.type = assay_name[i],
                                              assay_name = assay_name[i],
                                              rank = tax_level[i],
                                              tax_level = tax_level[i],
                                              aggregate_data = aggregate_data[[i]],
                                              meta_data = meta_data[[i]],
                                              fix_formula = NULL,
                                              verbose = verbose)
            return(check_results)
        })
        feature_table_list = lapply(seq_along(data), function(i) {
            check_results_list[[i]]$feature_table
        })
        feature_table_aggregate_list = lapply(seq_along(data), function(i) {
            check_results_list[[i]]$feature_table_aggregate
        })
        meta_data_list = lapply(seq_along(data), function(i) {
            check_results_list[[i]]$meta_data
        })

        # Check common samples
        samp_names = lapply(feature_table_list, function(x) colnames(x))
        samp_common = Reduce(intersect, samp_names)
        samp_txt = sprintf(paste0("Number of common samples ",
                                  "across datasets: ",
                                  length(samp_common)))
        message(samp_txt)
        if (length(samp_common) < 10) {
            stop_txt = paste0("Insufficient common samples: ",
                              "Multi-dataset computation not recommended")
            stop(stop_txt)
        }

        # Rename taxa
        for (i in seq_along(data)) {
            rownames(feature_table_list[[i]]) =
                paste(names(data)[[i]],
                      rownames(feature_table_list[[i]]),
                      sep = " - ")
        }

        for (i in seq_along(data)) {
            rownames(feature_table_aggregate_list[[i]]) =
                paste(names(data)[[i]],
                      rownames(feature_table_aggregate_list[[i]]),
                      sep = " - ")
        }

        abn_list = lapply(seq_along(data), function(i) {
            .abn_est(data = feature_table_list[[i]],
                     aggregate_data = feature_table_aggregate_list[[i]],
                     meta_data = meta_data_list[[i]],
                     pseudo = pseudo,
                     prv_cut = prv_cut,
                     lib_cut = lib_cut)
        })
        s_diff_hat = lapply(abn_list, function(x) x$s_diff_hat)
        y_hat = do.call(gtools::smartbind, lapply(abn_list, function(x) as.data.frame(x$y_hat)))
        y_hat_rownames = do.call(c, lapply(abn_list, function(x) rownames(x$y_hat)))
        y_hat = as.matrix(y_hat)
        rownames(y_hat) = y_hat_rownames
    }

    # =================Sparse estimation on linear correlations=================
    if (n_cl > 1) {
      cl = parallel::makeCluster(n_cl)
      doParallel::registerDoParallel(cl)
    } else {
      foreach::registerDoSEQ()
    }

    if (method %in% c("pearson", "spearman")) {
        res_corr = .sparse_linear(mat = t(y_hat), wins_quant, method,
                                  soft, alpha_grid, thresh_len, n_cv,
                                  thresh_hard, max_p)
    } else {
        stop_txt = paste0("The specified correlation coefficient type is not valid \n",
                          "Please choose either 'pearson' or 'spearman' as the type of correlation coefficient")
        stop(stop_txt, call. = FALSE)
    }

    if (n_cl > 1) {
      parallel::stopCluster(cl)
    }

    # To prevent false positives from taxa with extremely small variances
    if (length(data) == 1) {
        corr_s = cor(cbind(s_diff_hat, t(y_hat)),
                     use = "pairwise.complete.obs")[1, -1]
        fp_ind1 = replicate(nrow(y_hat), corr_s > corr_cut)
        fp_ind2 = t(replicate(nrow(y_hat), corr_s > corr_cut))
        fp_ind = (fp_ind1 * fp_ind2 == 1)
        diag(fp_ind) = FALSE
        res_corr$corr[fp_ind] = 0
        res_corr$corr_th[fp_ind] = 0
        res_corr$corr_fl[fp_ind] = 0
        res_corr$corr_reg[fp_ind] = 0
        res_corr$corr_p[fp_ind] = 1
    } else {
        for (i in seq_along(data)) {
            df_s = data.frame(sample_id = names(s_diff_hat[[i]]),
                              s = s_diff_hat[[i]])
            rownames(df_s) = NULL
            df_y = data.frame(sample_id = rownames(t(y_hat)), t(y_hat),
                              check.names = FALSE)
            rownames(df_y) = NULL

            df_merge = df_y
            df_merge$s = df_s$s[match(df_y$sample_id, df_s$sample_id)]
            df_merge$sample_id = NULL
            df_merge = df_merge[c('s', setdiff(names(df_merge), 's'))]

            corr_s = cor(df_merge, use = "pairwise.complete.obs")[1, -1]
            fp_ind1 = replicate(nrow(y_hat), corr_s > corr_cut)
            fp_ind2 = t(replicate(nrow(y_hat), corr_s > corr_cut))
            fp_ind = (fp_ind1 * fp_ind2 == 1)
            diag(fp_ind) = FALSE
            res_corr$corr[fp_ind] = 0
            res_corr$corr_th[fp_ind] = 0
            res_corr$corr_fl[fp_ind] = 0
            res_corr$corr_reg[fp_ind] = 0
            res_corr$corr_p[fp_ind] = 1
        }
    }

    # ==================================Outputs=================================
    res = c(list(s_diff_hat = s_diff_hat, y_hat = y_hat), res_corr)
    return(res)
}
FrederickHuangLin/ANCOMBC documentation built on June 11, 2025, 6:22 p.m.