View source: R/continuous_discover.R
continuous_discover | R Documentation |
continuous_discover
takes as input a feature-by-sample matrix of
microbial abundances. It first performs unsupervised continuous structure
discovery (PCA) within each batch. Loadings of top PCs from each batch are
then mapped against each other to identify "consensus" loadings that are
reproducible across batches with a network community discovery approach with
igraph. The identified consensus loadings/scores can be viewed as
continuous structures in microbial profiles that are recurrent across batches
and valid in a meta-analyitical sense. continuous_discover
returns,
among other output, the identified consensus scores for continuous
structures in the provided microbial abundance profiles, as well as the
consensus PC loadings which can be used to assign continuous scores to any
sample with the same set of microbial features.
continuous_discover(feature_abd, batch, data, control)
feature_abd |
feature-by-sample matrix of abundances (proportions or counts). |
batch |
name of the batch variable. This variable in data should be a factor variable and will be converted to so with a warning if otherwise. |
data |
data frame of metadata, columns must include batch. |
control |
a named list of additional control parameters. See details. |
control
should be provided as a named list of the following components
(can be a subset).
character. Similar to the normalization
parameter in
Maaslin2
but only "TSS"
and "NONE"
are
allowed. Default to "TSS"
(total sum scaling).
character. Similar to the transform
parameter in
Maaslin2
but only "AST"
and "LOG"
are
allowed. Default to "AST"
(arcsine square root transformation).
numeric. Pseudo count to add feature_abd before the transformation. Default
to NULL
, in which case pseudo count will be set automatically to 0 if
transform="AST"
, and half of minimal non-zero values in
feature_abd
if transform="LOG"
.
numeric. A value between 0 and 1 that indicates the percentage variability explained to cut off at for selecting top PCs in each batch. Across batches, the top PCs that in total explain more than var_perc_cutoff of the total variability will be selected for meta-analytical continuous structure discovery. Default to 0.8 (PCs included need to explain at least 80 total variability).
numeric. A value between 0 and 1 that indicates cutoff for absolute cosine coefficients between PC loadings to construct the method's network with. Once the top PC loadings from each batch are selected, cosine coefficients between each loading pair are calculated which indicate their similarity. Loading pairs with absolute cosine coefficients surpassing cos_cutoff are then considered as associated with each other, and represented as an edge between the pair in a PC loading network. Network community discovery can then be performed on this network to identified densely connected "clusters" of PC loadings, which represent meta-analytically recurrent continuous structures.
function. cluster_function
is used to perform community structure
discovery in the constructed PC loading network. This can be any of the
network cluster functions provided in igraph. Default to
cluster_optimal
. Note that this option can be slow for
larger datasets, in which case cluster_fast_greedy
is
recommended.
character. Name for the generated network figure file. Default to
"clustered_network.pdf"
. Can be set to NULL
in which
case no output will be generated.
integer. Clusters with sizes smaller than or equal to plot_size_cutoff will be excluded in the visualized network. Defaul to 2 - visualized clusters must have at least three nodes (PC loadings).
character. Name for the generated diagnostic figure file. Default to
"continuous_diagnostic.pdf"
. Can be set to NULL
in which
case no output will be generated.
logical. Indicates whether or not verbose information will be printed.
a list, with the following components:
matrix of identified consensus continuous scores. Columns are the identified consensus scores and rows correspond to samples in feature_abd.
matrix of identified consensus loadings. Columns are the identified consensus scores and rows correspond to features in feature_abd.
matrix of validation cosine coefficients of the identified consensus loadings. Columns correspond to the identified consensus scores and rows correspond to batches.
components for the constructed PC loading network and community
discovery results. network
is a igraph graph
object for
the constructed network of associated PC loadings. communities
is a
communities
object for the identified
consensus loading clusters in network
(output from
control$cluster_function
). mat_cos
is the matrix of cosine
coefficients between all selected top PCs from all batches.
list of additional control parameters used in the function call.
Siyuan Ma, siyuanma@g.harvard.edu
data("CRC_abd", "CRC_meta")
fit_continuous <- continuous_discover(feature_abd = CRC_abd,
batch = "studyID",
data = CRC_meta)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.