secom_dist: Sparse estimation of distance correlations among microbiomes

View source: R/secom_dist.R

secom_distR Documentation

Sparse estimation of distance correlations among microbiomes

Description

Obtain the sparse correlation matrix for distance correlations between taxa.

Usage

secom_dist(
  data,
  taxa_are_rows = TRUE,
  assay.type = assay_name,
  assay_name = "counts",
  rank = tax_level,
  tax_level = NULL,
  aggregate_data = NULL,
  meta_data = NULL,
  pseudo = 0,
  prv_cut = 0.5,
  lib_cut = 1000,
  corr_cut = 0.5,
  wins_quant = c(0.05, 0.95),
  R = 1000,
  thresh_hard = 0,
  max_p = 0.005,
  n_cl = 1,
  verbose = TRUE
)

Arguments

data

a list of the input data. The data parameter should be either a matrix, data.frame, phyloseq or a TreeSummarizedExperiment object. Both phyloseq and TreeSummarizedExperiment objects consist of a feature table (microbial count table), a sample metadata table, a taxonomy table (optional), and a phylogenetic tree (optional). If a matrix or data.frame is provided, ensure that the row names of the metadata match the sample names (column names if taxa_are_rows is TRUE, and row names otherwise) in data. if a phyloseq or a TreeSummarizedExperiment is used, this standard has already been enforced. For detailed information, refer to ?phyloseq::phyloseq or ?TreeSummarizedExperiment::TreeSummarizedExperiment. It is recommended to use low taxonomic levels, such as OTU or species level, as the estimation of sampling fractions requires a large number of taxa. If working with multiple ecosystems, such as gut and tongue, stack the data by specifying the list of input data as data = list(gut = pseq1, tongue = pseq2).

taxa_are_rows

logical. Whether taxa are positioned in the rows of the feature table. Default is TRUE.

assay.type

alias for assay_name.

assay_name

character. Name of the count table in the data object (only applicable if data object is a (Tree)SummarizedExperiment). Default is "counts". See ?SummarizedExperiment::assay for more details.

rank

alias for tax_level.

tax_level

character. The taxonomic level of interest. The input data can be agglomerated at different taxonomic levels based on your research interest. Default is NULL, i.e., do not perform agglomeration, and the SECOM anlysis will be performed at the lowest taxonomic level of the input data.

aggregate_data

The abundance data that has been aggregated to the desired taxonomic level. This parameter is required only when the input data is in matrix or data.frame format. For phyloseq or TreeSummarizedExperiment data, aggregation is performed by specifying the tax_level parameter.

meta_data

a data.frame containing sample metadata. This parameter is mandatory when the input data is a generic matrix or data.frame. Ensure that the row names of the metadata match the sample names (column names if taxa_are_rows is TRUE, and row names otherwise) in data.

pseudo

numeric. Add pseudo-counts to the data. Default is 0 (no pseudo-counts).

prv_cut

a numerical fraction between 0 and 1. Taxa with prevalences (the proportion of samples in which the taxon is present) less than prv_cut will be excluded in the analysis. For example, if there are 100 samples, and a taxon has nonzero counts present in less than 100*prv_cut samples, it will not be considered in the analysis. Default is 0.50.

lib_cut

a numerical threshold for filtering samples based on library sizes. Samples with library sizes less than lib_cut will be excluded in the analysis. Default is 1000.

corr_cut

numeric. To avoid false positives caused by taxa with small variances, taxa with Pearson correlation coefficients greater than corr_cut with the estimated sample-specific bias will be flagged. When taxa are flagged, the pairwise correlation coefficient between them will be set to 0s. Default is 0.5.

wins_quant

a numeric vector of probabilities with values between 0 and 1. Replace extreme values in the abundance data with less extreme values. Default is c(0.05, 0.95). For details, see ?DescTools::Winsorize.

R

numeric. The number of replicates in calculating the p-value for distance correlation. For details, see ?energy::dcor.test. Default is 1000.

thresh_hard

Numeric. Pairwise correlation coefficients (in their absolute value) that are less than or equal to thresh_hard will be set to 0. Default is 0.3.

max_p

numeric. Obtain the sparse correlation matrix by p-value filtering. Pairwise correlation coefficients with p-value greater than max_p will be set to 0s. Default is 0.005.

n_cl

numeric. The number of nodes to be forked. For details, see ?parallel::makeCluster. Default is 1 (no parallel computing).

verbose

logical. Whether to display detailed progress messages.

Details

The distance correlation, which is a measure of dependence between two random variables, can be used to quantify any dependence, whether linear, monotonic, non-monotonic or nonlinear relationships.

Value

a list with components:

  • s_diff_hat, a numeric vector of estimated sample-specific biases.

  • y_hat, a matrix of bias-corrected abundances

  • mat_cooccur, a matrix of taxon-taxon co-occurrence pattern. The number in each cell represents the number of complete (nonzero) samples for the corresponding pair of taxa.

  • dcorr, the sample distance correlation matrix computed using the bias-corrected abundances y_hat.

  • dcorr_p, the p-value matrix corresponding to the sample distance correlation matrix dcorr.

  • dcorr_fl, the sparse correlation matrix obtained by p-value filtering based on the cutoff specified in max_p.

Author(s)

Huang Lin

See Also

secom_linear

Examples

library(ANCOMBC)
if (requireNamespace("microbiome", quietly = TRUE)) {
    data(atlas1006, package = "microbiome")
    # subset to baseline
    pseq = phyloseq::subset_samples(atlas1006, time == 0)

    # run secom_linear function
    set.seed(123)
    res_dist = secom_dist(data = list(pseq), taxa_are_rows = TRUE,
                          tax_level = "Phylum",
                          aggregate_data = NULL, meta_data = NULL, pseudo = 0,
                          prv_cut = 0.5, lib_cut = 1000, corr_cut = 0.5,
                          wins_quant = c(0.05, 0.95), R = 1000,
                          thresh_hard = 0.3, max_p = 0.005, n_cl = 2)

    dcorr_fl = res_dist$dcorr_fl
} else {
    message("The 'microbiome' package is not installed. Please install it to use this example.")
}


FrederickHuangLin/ANCOMBC documentation built on Jan. 8, 2025, 6:20 a.m.