secom_dist: Sparse estimation of distance correlations among microbiomes

View source: R/secom_dist.R

secom_distR Documentation

Sparse estimation of distance correlations among microbiomes

Description

Obtain the sparse correlation matrix for distance correlations between taxa.

Usage

secom_dist(
  data,
  assay_name = "counts",
  tax_level = NULL,
  pseudo = 0,
  prv_cut = 0.5,
  lib_cut = 1000,
  corr_cut = 0.5,
  wins_quant = c(0.05, 0.95),
  R = 1000,
  thresh_hard = 0,
  max_p = 0.005,
  n_cl = 1
)

Arguments

data

a list of the input data. Each element of the list can be a phyloseq, SummarizedExperiment, or TreeSummarizedExperiment object, which consists of a feature table (microbial count table), a sample metadata, a taxonomy table (optional), and a phylogenetic tree (optional). The row names of the metadata must match the sample names of the feature table, and the row names of the taxonomy table must match the taxon (feature) names of the feature table. See ?phyloseq::phyloseq, ?SummarizedExperiment::SummarizedExperiment, or ?TreeSummarizedExperiment::TreeSummarizedExperiment for more details. It is highly recommended that the input data are in low taxonomic levels, such as OTU or species level, as the estimation of sampling fractions requires a large number of taxa. For multiple ecosystems, simply stack the data. For example, for two ecosystems, such as gut and tongue, specify the list of input data as data = list(gut = tse1, tongue = tse2).

assay_name

character. Name of the count table in the data object (only applicable if data object is a (Tree)SummarizedExperiment). Default is "counts". See ?SummarizedExperiment::assay for more details.

tax_level

character. The taxonomic level of interest. The input data can be agglomerated at different taxonomic levels based on your research interest. Default is NULL, i.e., do not perform agglomeration, and the SECOM anlysis will be performed at the lowest taxonomic level of the input data.

pseudo

numeric. Add pseudo-counts to the data. Default is 0 (no pseudo-counts).

prv_cut

a numerical fraction between 0 and 1. Taxa with prevalences less than prv_cut will be excluded in the analysis. For instance, suppose there are 100 samples, if a taxon has nonzero counts presented in less than 10 samples, it will not be further analyzed. Default is 0.50.

lib_cut

a numerical threshold for filtering samples based on library sizes. Samples with library sizes less than lib_cut will be excluded in the analysis. Default is 1000.

corr_cut

numeric. To prevent false positives due to taxa with small variances, taxa with Pearson correlation coefficients greater than corr_cut with the estimated sample-specific bias will be flagged. The pairwise correlation coefficient between flagged taxa will be set to 0s. Default is 0.5.

wins_quant

a numeric vector of probabilities with values between 0 and 1. Replace extreme values in the abundance data with less extreme values. Default is c(0.05, 0.95). For details, see ?DescTools::Winsorize.

R

numeric. The number of replicates in calculating the p-value for distance correlation. For details, see ?energy::dcor.test. Default is 1000.

thresh_hard

Numeric. Set a hard threshold for the correlation matrix. Pairwise distance correlation less than or equal to thresh_hard will be set to 0. Default is 0 (No ad-hoc hard thresholding).

max_p

numeric. Obtain the sparse correlation matrix by p-value filtering. Pairwise correlation coefficient with p-value greater than max_p will be set to 0. Default is 0.005.

n_cl

numeric. The number of nodes to be forked. For details, see ?parallel::makeCluster. Default is 1 (no parallel computing).

Details

The distance correlation, which is a measure of dependence between two random variables, can be used to quantify any dependence, whether linear, monotonic, non-monotonic or nonlinear relationships.

Value

a list with components:

  • s_diff_hat, a numeric vector of estimated sample-specific biases.

  • y_hat, a matrix of bias-corrected abundances

  • mat_cooccur, a matrix of taxon-taxon co-occurrence pattern. The number in each cell represents the number of complete (nonzero) samples for the corresponding pair of taxa.

  • dcorr, the sample distance correlation matrix computed using the bias-corrected abundances y_hat.

  • dcorr_p, the p-value matrix corresponding to the sample distance correlation matrix dcorr.

  • dcorr_fl, the sparse correlation matrix obtained by p-value filtering based on the cutoff specified in max_p.

Author(s)

Huang Lin

See Also

secom_linear

Examples

library(ANCOMBC)
data(dietswap)

# subset to baseline
tse = dietswap[, dietswap$timepoint == 1]

set.seed(123)
res_dist = secom_dist(data = list(tse), assay_name = "counts",
                      tax_level = "Phylum", pseudo = 0,
                      prv_cut = 0.5, lib_cut = 1000, corr_cut = 0.5,
                      wins_quant = c(0.05, 0.95), R = 1000,
                      thresh_hard = 0.3, max_p = 0.005, n_cl = 2)

dcorr_fl = res_dist$dcorr_fl


FrederickHuangLin/ANCOMBC documentation built on Feb. 23, 2023, 11:13 p.m.