secom_linear: Sparse estimation of linear correlations among microbiomes

View source: R/secom_linear.R

secom_linearR Documentation

Sparse estimation of linear correlations among microbiomes


Obtain the sparse correlation matrix for linear correlations between taxa. The current version of secom_linear function supports either of the three correlation coefficients: Pearson, Spearman, and Kendall's τ.


  assay_name = "counts",
  tax_level = NULL,
  pseudo = 0,
  prv_cut = 0.5,
  lib_cut = 1000,
  corr_cut = 0.5,
  wins_quant = c(0.05, 0.95),
  method = c("pearson", "kendall", "spearman"),
  soft = FALSE,
  thresh_len = 100,
  n_cv = 10,
  thresh_hard = 0,
  max_p = 0.005,
  n_cl = 1



a list of the input data. Each element of the list can be a phyloseq, SummarizedExperiment, or TreeSummarizedExperiment object, which consists of a feature table (microbial count table), a sample metadata, a taxonomy table (optional), and a phylogenetic tree (optional). The row names of the metadata must match the sample names of the feature table, and the row names of the taxonomy table must match the taxon (feature) names of the feature table. See ?phyloseq::phyloseq, ?SummarizedExperiment::SummarizedExperiment, or ?TreeSummarizedExperiment::TreeSummarizedExperiment for more details. It is highly recommended that the input data are in low taxonomic levels, such as OTU or species level, as the estimation of sampling fractions requires a large number of taxa. For multiple ecosystems, simply stack the data. For example, for two ecosystems, such as gut and tongue, specify the list of input data as data = list(gut = tse1, tongue = tse2).


character. Name of the count table in the data object (only applicable if data object is a (Tree)SummarizedExperiment). Default is "counts". See ?SummarizedExperiment::assay for more details.


character. The taxonomic level of interest. The input data can be agglomerated at different taxonomic levels based on your research interest. Default is NULL, i.e., do not perform agglomeration, and the SECOM anlysis will be performed at the lowest taxonomic level of the input data.


numeric. Add pseudo-counts to the data. Default is 0 (no pseudo-counts).


a numerical fraction between 0 and 1. Taxa with prevalences less than prv_cut will be excluded in the analysis. For instance, suppose there are 100 samples, if a taxon has nonzero counts presented in less than 10 samples, it will not be further analyzed. Default is 0.50.


a numerical threshold for filtering samples based on library sizes. Samples with library sizes less than lib_cut will be excluded in the analysis. Default is 1000.


numeric. To prevent false positives due to taxa with small variances, taxa with Pearson correlation coefficients greater than corr_cut with the estimated sample-specific bias will be flagged. The pairwise correlation coefficient between flagged taxa will be set to 0s. Default is 0.5.


a numeric vector of probabilities with values between 0 and 1. Replace extreme values in the abundance data with less extreme values. Default is c(0.05, 0.95). For details, see ?DescTools::Winsorize.


character. It indicates which correlation coefficient is to be computed. One of "pearson", "kendall", or "spearman": can be abbreviated.


logical. TRUE indicates that soft thresholding is applied to achieve the sparsity of the correlation matrix. FALSE indicates that hard thresholding is applied to achieve the sparsity of the correlation matrix. Default is FALSE.


numeric. Grid-search is implemented to find the optimal values over thresh_len thresholds for the thresholding operator. Default is 100.


numeric. The fold number in cross validation. Default is 10 (10-fold cross validation).


Numeric. Set a hard threshold for the correlation matrix. Pairwise correlation coefficient (in its absolute value) less than or equal to thresh_hard will be set to 0. Default is 0 (No ad-hoc hard thresholding).


numeric. Obtain the sparse correlation matrix by p-value filtering. Pairwise correlation coefficient with p-value greater than max_p will be set to 0. Default is 0.005.


numeric. The number of nodes to be forked. For details, see ?parallel::makeCluster. Default is 1 (no parallel computing).


a list with components:

  • s_diff_hat, a numeric vector of estimated sample-specific biases.

  • y_hat, a matrix of bias-corrected abundances

  • cv_error, a numeric vector of cross-validation error estimates, which are the Frobenius norm differences between correlation matrices using training set and validation set, respectively.

  • thresh_grid, a numeric vector of thresholds in the cross-validation.

  • thresh_opt, numeric. The optimal threshold through cross-validation.

  • mat_cooccur, a matrix of taxon-taxon co-occurrence pattern. The number in each cell represents the number of complete (nonzero) samples for the corresponding pair of taxa.

  • corr, the sample correlation matrix (using the measure specified in method) computed using the bias-corrected abundances y_hat.

  • corr_p, the p-value matrix corresponding to the sample correlation matrix corr.

  • corr_th, the sparse correlation matrix obtained by thresholding based on the method specified in soft.

  • corr_fl, the sparse correlation matrix obtained by p-value filtering based on the cutoff specified in max_p.


Huang Lin

See Also




# subset to baseline
tse = dietswap[, dietswap$timepoint == 1]

res_linear = secom_linear(data = list(tse), assay_name = "counts",
                          tax_level = "Phylum", pseudo = 0,
                          prv_cut = 0.5, lib_cut = 1000, corr_cut = 0.5,
                          wins_quant = c(0.05, 0.95), method = "pearson",
                          soft = FALSE, thresh_len = 20, n_cv = 10,
                          thresh_hard = 0.3, max_p = 0.005, n_cl = 2)

corr_th = res_linear$corr_th
corr_fl = res_linear$corr_fl

FrederickHuangLin/ANCOMBC documentation built on Feb. 23, 2023, 11:13 p.m.