continuous_discover: Unsupervised meta-analytical discovery and validation of...

View source: R/continuous_discover.R

continuous_discoverR Documentation

Unsupervised meta-analytical discovery and validation of continuous structures in microbial abundance data

Description

continuous_discover takes as input a feature-by-sample matrix of microbial abundances. It first performs unsupervised continuous structure discovery (PCA) within each batch. Loadings of top PCs from each batch are then mapped against each other to identify "consensus" loadings that are reproducible across batches with a network community discovery approach with igraph. The identified consensus loadings/scores can be viewed as continuous structures in microbial profiles that are recurrent across batches and valid in a meta-analyitical sense. continuous_discover returns, among other output, the identified consensus scores for continuous structures in the provided microbial abundance profiles, as well as the consensus PC loadings which can be used to assign continuous scores to any sample with the same set of microbial features.

Usage

continuous_discover(feature_abd, batch, data, control)

Arguments

feature_abd

feature-by-sample matrix of abundances (proportions or counts).

batch

name of the batch variable. This variable in data should be a factor variable and will be converted to so with a warning if otherwise.

data

data frame of metadata, columns must include batch.

control

a named list of additional control parameters. See details.

Details

control should be provided as a named list of the following components (can be a subset).

normalization

character. Similar to the normalization parameter in Maaslin2 but only "TSS" and "NONE" are allowed. Default to "TSS" (total sum scaling).

transform

character. Similar to the transform parameter in Maaslin2 but only "AST" and "LOG" are allowed. Default to "AST" (arcsine square root transformation).

pseudo_count

numeric. Pseudo count to add feature_abd before the transformation. Default to NULL, in which case pseudo count will be set automatically to 0 if transform="AST", and half of minimal non-zero values in feature_abd if transform="LOG".

var_perc_cutoff

numeric. A value between 0 and 1 that indicates the percentage variability explained to cut off at for selecting top PCs in each batch. Across batches, the top PCs that in total explain more than var_perc_cutoff of the total variability will be selected for meta-analytical continuous structure discovery. Default to 0.8 (PCs included need to explain at least 80 total variability).

cos_cutoff

numeric. A value between 0 and 1 that indicates cutoff for absolute cosine coefficients between PC loadings to construct the method's network with. Once the top PC loadings from each batch are selected, cosine coefficients between each loading pair are calculated which indicate their similarity. Loading pairs with absolute cosine coefficients surpassing cos_cutoff are then considered as associated with each other, and represented as an edge between the pair in a PC loading network. Network community discovery can then be performed on this network to identified densely connected "clusters" of PC loadings, which represent meta-analytically recurrent continuous structures.

cluster_function

function. cluster_function is used to perform community structure discovery in the constructed PC loading network. This can be any of the network cluster functions provided in igraph. Default to cluster_optimal. Note that this option can be slow for larger datasets, in which case cluster_fast_greedy is recommended.

network_plot

character. Name for the generated network figure file. Default to "clustered_network.pdf". Can be set to NULL in which case no output will be generated.

plot_size_cutoff

integer. Clusters with sizes smaller than or equal to plot_size_cutoff will be excluded in the visualized network. Defaul to 2 - visualized clusters must have at least three nodes (PC loadings).

diagnostic_plot

character. Name for the generated diagnostic figure file. Default to "continuous_diagnostic.pdf". Can be set to NULL in which case no output will be generated.

verbose

logical. Indicates whether or not verbose information will be printed.

Value

a list, with the following components:

consensus_scores

matrix of identified consensus continuous scores. Columns are the identified consensus scores and rows correspond to samples in feature_abd.

consensus_loadings

matrix of identified consensus loadings. Columns are the identified consensus scores and rows correspond to features in feature_abd.

mat_vali

matrix of validation cosine coefficients of the identified consensus loadings. Columns correspond to the identified consensus scores and rows correspond to batches.

network, communities, mat_cos

components for the constructed PC loading network and community discovery results. network is a igraph graph object for the constructed network of associated PC loadings. communities is a communities object for the identified consensus loading clusters in network (output from control$cluster_function). mat_cos is the matrix of cosine coefficients between all selected top PCs from all batches.

control

list of additional control parameters used in the function call.

Author(s)

Siyuan Ma, siyuanma@g.harvard.edu

Examples

data("CRC_abd", "CRC_meta")
fit_continuous <- continuous_discover(feature_abd = CRC_abd,
                                      batch = "studyID",
                                      data = CRC_meta)

biobakery/MMUPHin documentation built on March 30, 2024, 4:50 a.m.